Programming with SIMD-instructions
SIMD = Single Instruction stream, Multiple Data stream
From Flynn s taxonomy: SISD, SIMD, (MISD), MIMD
Extensions to the Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data
data parallelism parallel vector operations applies the same operation in parallel on a number of data items packed into a 64-, 128- or 256-bit vector also supports scalar operations on integer or floating-point values
Originally designed to speed up media processing applications
can also be very useful in other types of applications
There are many different versions of SIMD extensions
MMX, SSE, SSE2, SSE3, SSE4, 3DNow!, Altivec, VIS, AVX, ...
MMX, SSE and AVX
Extensions to the IA-32 and x86-64 instruction sets for parallel SIMD operations on packed data MMX Multimedia Extensions
introduced in the Pentium processor 1993 supports only integer operations
SSE Streaming SIMD Extension
introduced in Pentium III 1999 support for single-precision floating point operations SSE2 Streaming SIMD Extension 2 was introduced in Pentium 4, 2000 supports also double-precision floating point operations later extensions: SSE3, SSSE3, SSE4
AVX Advanced Vector Extensions
announced in 2008, supported in the Intel Sandy Bridge processors extends the vector registers to 256 bits
Characteristics of SIMD operations
The SIMD extensions were designed to speed up multimedia and communication applications
graphics and image processing video and audio processing speech compression and recognition can also be used for data-intensive scientific computations
Applications can benefit from SIMD processing if they have the following characteristics
small integer or floating-point data types (8 bit pixel values or characters, 16-bit audio samples, 32-bit floating-point values) small, highly repetitive loops frequent additions, multiplications or other simple operations compute-intense algorithms data-parallelism, can operate on independent values in parallel
SIMD operation
SIMD execution
performs an operation in parallel on an array of 2, 4, 8 or 16 values data parallel operation
The operation can be a
data movement instruction arithmetic instruction logical instruction comparison instruction conversion instruction shuffle instruction
Source 1 Source 2
X3 Y3
X2 Y2
X1 Y1
X0 Y0
Destination
X2Y2
X1Y1
X0Y0
X3Y3
MMX registers
8 64-bit MMX registers
aliased to the x87 floating-point registers no stack-organization can store 1, 2, 4 or 8 packed integer values
Floating-point registers
MMX registers can only hold data
not memory addresses the general-purpouse registers are used for addresses
MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7
63
MMX registers
Can not use the x87 floating-point unit and the MMX unit at the same time
they share the same set of registers
MMX data types
MMX instructions operate on 8, 16, 32 or 64-bit integer values, packed into a 64-bit field 4 MMX data types
packed byte 8 bytes packed into a 64-bit quantity packed word 4 16-bit words packed into a 64-bit quantity packed doubleword 2 32-bit doublewords packed into a 64-bit quantity quadword one 64-bit quantity
63 0
b7
63
b6 w3
b5 w2
b4
b3 w1
b2
b1 w0
b0
0
63
dw1
63
dw0
0
qw
MMX operates only on integer values MMX operations are limited to integer values
the SSE extensions also provide operations on floating-point data
MMX instructions
MMX introduced 47 new instructions for operation on packed integer data
arithmetic
addition, subtraction, multiplication, multiply and add also with signed and unsigned saturation
comparision
compare for equal, compare for greater than
conversion
packing and unpacking of data
logical
and, or, xor, and not
The MMX instructions start with the prefix P (for Packed)
Ex: paddb, paddw, paddd add packed bytes/word/doubleword (8/16/32 bit integers)
7
SSE
Streaming SIMD Extension
introduced with the Pentium III processor designed to speed up performance of advanced 2D and 3D graphics, motion video, videoconferencing, image processing, speech recognition, ... supports only single-precision floating point operation
Parallel operations on packed single precision floating-point values
128-bit packed single precision floating point data type four IEEE 32-bit floating point values packed into a 128-bit field data must be aligned in memory on 16-byte boundaries
127
s3
s2
s1
s0
XMM registers
SSE adds a set of new 128-bit XMM registers
8 XMM registers in 32-bit mode 16 XMM registers in 64-bit mode
The XMM registers are new physical registers
not aliased to any other registers independent of the general purpose and FPU/MMX registers can mix MMX and SSE instructions
XMM registers can be accessed in 32-bit, 64-bit or 128-bit mode
only for operations on data, not addresses
There is also a 32 bit control and status register, MXCSR
flag and mask bits for floating-point exceptions rounding control bits flush-to-zero bit denormals-are-zero bit
XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
127 0
SSE instructions
The SSE extension added 70 new instructions to the instruction set
50 for SIMD floating-point operations 12 for SIMD integer operations 8 for cache control
Supports both packed and scalar single precision floating-point instructions
operations on packed 32-bit floating-point values
packed instructions have the suffix PS
operations on a scalar 32-bit floating-point value (the 32 LSB)
scalar instructions have the suffix SS
Also included some 64-bit SIMD integer instructions
extension to MMX operations on packed integer values stored in MMX registers
10
Packed and scalar operations
Packed SSE operations apply an operation in parallel on 2 or 4 floating-point values
Source 1 Source 2
X3 Y3
X2 Y2
X1 Y1
X0 Y0
Destination
X2Y2
X1Y1
X0Y0
X3Y3
Scalar SSE operations apply an operation on a single (scalar) floating-point value
Source 1 Source 2
X3 Y3
X2 Y2
X1 Y1
X0 Y0
The compiler uses this for floating-point operations instead of the x87 fp-unit
Destination
X3
X2
X1
X0Y0
11
SSE2
Streaming SIMD Extension 2
introduced in the Pentium 4 processor designed to speed up performance of advanced 3D graphics, video encoding/decodeing, speech recognition, E-commerce and Internet, scientific and engineering applications
Extends MMX and SSE with support for
packed double precision floating point-values packed integer values adds over 70 new instructions to the instruction set
Operates on 128-bit entities in the XMM registers
must be aligned on 16-bit boundaries when stored in memory
12
SSE2 data types
128-bit packed double precision floating point
2 IEEE double precision floatingpoint values
127 0
d1
d0
128-bit packed byte integer
16 byte integers (8 bits)
127 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
128-bit packed word integer
8 word integers (16 bits)
127
w7
127
w6
w5
w4
w3
w2
w1
w0
0
128-bit packed doubleword integer
4 doubleword integers (32 bits)
d3
127
d2
d1
d0
0
128-bit packed quadword integer
2 quadword integers (64 bits)
q1
q0
13
SSE instructions
MMX instructions have names composed of different fields
a prefix P stands for Packed the operation, for example ADD, SUB or MUL for arithmetic operations US (Unsigned Saturation) or S (Signed Saturation) a suffix describing the data type
B Packed Byte, 8 bytes W Packed Word, four 16-bit words D Packed Doubleword, two 32-bit double words Q Quadword, one single 64-bit quadword
Example: PADDB Add Packed Byte PADDSB Add Packed Signed Byte Integers with Signed Saturation
SSE operations on packed double-precision data has the suffix PD
examples: ADDPD, MULPD, MAXPD, ANDPD
SSE operations on scalar double-precision data has the suffix SD
examples: MOVSD, ADDSD, MULSD, MINSD
14
Programming with MMX and SSE
There are different ways a programmer can use SSE in a program
automatic vectorization by the compiler
no explicit SSE programming needed, but requires a vectorizing compiler
arithmetic operations on vector data types
declare variables of vector type express computations as normal arithmetic expression
compiler intrinsinc functions for SSE operation
functions that provide access to the MMX/SSE instructions from a high-level language also requires a detailed knowledge of MMX/SSE operation
program with inline assembly language
very good possibilities to arrange instructions for efficient execution difficult to program, error prone requires detailed knowledge of MMX/SSE operation and assembly language programming
15
Automatic vectorization
The compiler automatically recognizes loops that can be implemented with vectorized code very easy to use, no changes to the program code are needed Only loops that can be analyzed and that are found to be suitable for SIMD execution are vectorized does not guarantee any performance improvement has no effect if the compiler can not analyze the code and find the opportunities for SIMD operation Requires a compiler with vectorizing capabilities in gcc, vectorization is enabled by -O3 (use -ftree-vectorizer-verbose=1 to print reports about which loops are vectorized) the Intel compiler, icc, also does advanced vectorization
gcc -O3 -ftree-vectorizer-verbose=1 saxpy.c -o saxpy! ! saxpy.c:9: note: created 1 versioning for alias checks.! ! saxpy.c:9: note: LOOP VECTORIZED.! saxpy.c:6: note: vectorized 1 loops in function.!
16
Arithmetic operations on vector data types
Declare variables of vector data types and express computations with normal arithmetic expressions
SSE2 data types defined in emmintrin.h
Vector data types
four 32-bit floating-point values: __m128 two 64-bit floating-point values: __m128d integer data types: __m128i
__m128 a; /* 4 packed int values */ ! __m128 b;! __m128 c; ! ! c = a+b;! !
17
Compiler intrinsinc functions
Functions for performing MMX and SSE operations on packed data
inplemented with inline assembly code allows the programmer to use C function calls and variables
Defines a C function for each MMX/SSE instruction
there are also intrisinc functions composed of several MMX/SSE instructions
New data types to represent packed integer and floating-point values
__m64 represents the contents of a 64-bit MMX register (8, 16 or 32 bit packed integers) __m128 represents 4 packed single precision floating-point values __m128d represents 2 packed double precision floating-point values __m128i represents packed integer values (8, 16, 32 or 64-bit)
18
C intrisinc functions
Example:
multiply two arrays A and B of 400 single precision f-p values
#define SIZE 400! ! float A[SIZE], B[SIZE], C[SIZE];! __m128 m1, m2, m3;! ! for (int i=0; i<SIZE; i+=4) {! m1 = _mm_load_ps (A+i);! m2 = _mm_load_ps (B+i);! m3 = _mm_mul_ps (m1,m2);! _mm_store_ps (C+i,m3);! }!
Register allocation and instruction scheduling is left to the compiler
Variables of vector data types have to be aligned to 16-bit boundaries May also need to access the individual values in the packed data
can be done by using a union structure
union mmdata {! __mm128 m;! float f[4];! };!
19
10