0% found this document useful (0 votes)

107 views10 pages

Programming With SIMD-instructions

The document discusses programming with SIMD (Single Instruction Multiple Data) instructions. It describes several SIMD extensions including MMX, SSE, SSE2 and AVX, which allow parallel operations on packed integer or floating-point data. It provides details on the characteristics of SIMD operations, different data types, instructions, and programming techniques for utilizing SIMD, including using compiler intrinsics or inline assembly. Automatic vectorization by advanced compilers is also discussed, allowing SIMD parallelism without requiring explicit programming.

Uploaded by

Mahipal Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views10 pages

Programming With SIMD-instructions

Uploaded by

Mahipal Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Programming with SIMD-instructions

SIMD = Single Instruction stream, Multiple Data stream

From Flynn s taxonomy: SISD, SIMD, (MISD), MIMD

Extensions to the Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data
data parallelism parallel vector operations applies the same operation in parallel on a number of data items packed into a 64-, 128- or 256-bit vector also supports scalar operations on integer or floating-point values

Originally designed to speed up media processing applications

can also be very useful in other types of applications

There are many different versions of SIMD extensions

MMX, SSE, SSE2, SSE3, SSE4, 3DNow!, Altivec, VIS, AVX, ...

MMX, SSE and AVX

Extensions to the IA-32 and x86-64 instruction sets for parallel SIMD operations on packed data MMX Multimedia Extensions
introduced in the Pentium processor 1993 supports only integer operations

SSE Streaming SIMD Extension

introduced in Pentium III 1999 support for single-precision floating point operations SSE2 Streaming SIMD Extension 2 was introduced in Pentium 4, 2000 supports also double-precision floating point operations later extensions: SSE3, SSSE3, SSE4

AVX Advanced Vector Extensions

announced in 2008, supported in the Intel Sandy Bridge processors extends the vector registers to 256 bits

Characteristics of SIMD operations

The SIMD extensions were designed to speed up multimedia and communication applications
graphics and image processing video and audio processing speech compression and recognition can also be used for data-intensive scientific computations

Applications can benefit from SIMD processing if they have the following characteristics
small integer or floating-point data types (8 bit pixel values or characters, 16-bit audio samples, 32-bit floating-point values) small, highly repetitive loops frequent additions, multiplications or other simple operations compute-intense algorithms data-parallelism, can operate on independent values in parallel

SIMD operation
SIMD execution
performs an operation in parallel on an array of 2, 4, 8 or 16 values data parallel operation

The operation can be a

data movement instruction arithmetic instruction logical instruction comparison instruction conversion instruction shuffle instruction

Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

Destination

X2Y2

X1Y1

X0Y0

X3Y3

MMX registers
8 64-bit MMX registers
aliased to the x87 floating-point registers no stack-organization can store 1, 2, 4 or 8 packed integer values
Floating-point registers

MMX registers can only hold data

not memory addresses the general-purpouse registers are used for addresses

MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7

MMX registers

Can not use the x87 floating-point unit and the MMX unit at the same time
they share the same set of registers

MMX data types

MMX instructions operate on 8, 16, 32 or 64-bit integer values, packed into a 64-bit field 4 MMX data types
packed byte 8 bytes packed into a 64-bit quantity packed word 4 16-bit words packed into a 64-bit quantity packed doubleword 2 32-bit doublewords packed into a 64-bit quantity quadword one 64-bit quantity
63 0

b7
63

b6 w3

b5 w2

b3 w1

b1 w0

b0
0

dw1
63

dw0
0

MMX operates only on integer values MMX operations are limited to integer values
the SSE extensions also provide operations on floating-point data

MMX instructions
MMX introduced 47 new instructions for operation on packed integer data
arithmetic
addition, subtraction, multiplication, multiply and add also with signed and unsigned saturation

comparision
compare for equal, compare for greater than

conversion
packing and unpacking of data

logical
and, or, xor, and not

The MMX instructions start with the prefix P (for Packed)

Ex: paddb, paddw, paddd add packed bytes/word/doubleword (8/16/32 bit integers)
7

SSE
Streaming SIMD Extension
introduced with the Pentium III processor designed to speed up performance of advanced 2D and 3D graphics, motion video, videoconferencing, image processing, speech recognition, ... supports only single-precision floating point operation

Parallel operations on packed single precision floating-point values

128-bit packed single precision floating point data type four IEEE 32-bit floating point values packed into a 128-bit field data must be aligned in memory on 16-byte boundaries

127

XMM registers
SSE adds a set of new 128-bit XMM registers
8 XMM registers in 32-bit mode 16 XMM registers in 64-bit mode

The XMM registers are new physical registers

not aliased to any other registers independent of the general purpose and FPU/MMX registers can mix MMX and SSE instructions

XMM registers can be accessed in 32-bit, 64-bit or 128-bit mode

only for operations on data, not addresses

There is also a 32 bit control and status register, MXCSR

flag and mask bits for floating-point exceptions rounding control bits flush-to-zero bit denormals-are-zero bit

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
127 0

SSE instructions
The SSE extension added 70 new instructions to the instruction set
50 for SIMD floating-point operations 12 for SIMD integer operations 8 for cache control

Supports both packed and scalar single precision floating-point instructions

operations on packed 32-bit floating-point values
packed instructions have the suffix PS

operations on a scalar 32-bit floating-point value (the 32 LSB)

scalar instructions have the suffix SS

Also included some 64-bit SIMD integer instructions

extension to MMX operations on packed integer values stored in MMX registers

Packed and scalar operations

Packed SSE operations apply an operation in parallel on 2 or 4 floating-point values
Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

Destination

X2Y2

X1Y1

X0Y0

X3Y3

Scalar SSE operations apply an operation on a single (scalar) floating-point value

Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

The compiler uses this for floating-point operations instead of the x87 fp-unit

Destination

X0Y0

SSE2
Streaming SIMD Extension 2
introduced in the Pentium 4 processor designed to speed up performance of advanced 3D graphics, video encoding/decodeing, speech recognition, E-commerce and Internet, scientific and engineering applications

Extends MMX and SSE with support for

packed double precision floating point-values packed integer values adds over 70 new instructions to the instruction set

Operates on 128-bit entities in the XMM registers

must be aligned on 16-bit boundaries when stored in memory

SSE2 data types

128-bit packed double precision floating point
2 IEEE double precision floatingpoint values
127 0

128-bit packed byte integer

16 byte integers (8 bits)

127 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

128-bit packed word integer

8 word integers (16 bits)

127

w7
127

w0
0

128-bit packed doubleword integer

4 doubleword integers (32 bits)

d3
127

d0
0

128-bit packed quadword integer

2 quadword integers (64 bits)

SSE instructions
MMX instructions have names composed of different fields
a prefix P stands for Packed the operation, for example ADD, SUB or MUL for arithmetic operations US (Unsigned Saturation) or S (Signed Saturation) a suffix describing the data type
B Packed Byte, 8 bytes W Packed Word, four 16-bit words D Packed Doubleword, two 32-bit double words Q Quadword, one single 64-bit quadword

Example: PADDB Add Packed Byte PADDSB Add Packed Signed Byte Integers with Signed Saturation

SSE operations on packed double-precision data has the suffix PD

examples: ADDPD, MULPD, MAXPD, ANDPD

SSE operations on scalar double-precision data has the suffix SD

examples: MOVSD, ADDSD, MULSD, MINSD
14

Programming with MMX and SSE

There are different ways a programmer can use SSE in a program
automatic vectorization by the compiler
no explicit SSE programming needed, but requires a vectorizing compiler

arithmetic operations on vector data types

declare variables of vector type express computations as normal arithmetic expression

compiler intrinsinc functions for SSE operation

functions that provide access to the MMX/SSE instructions from a high-level language also requires a detailed knowledge of MMX/SSE operation

program with inline assembly language

very good possibilities to arrange instructions for efficient execution difficult to program, error prone requires detailed knowledge of MMX/SSE operation and assembly language programming
15

Automatic vectorization
The compiler automatically recognizes loops that can be implemented with vectorized code very easy to use, no changes to the program code are needed Only loops that can be analyzed and that are found to be suitable for SIMD execution are vectorized does not guarantee any performance improvement has no effect if the compiler can not analyze the code and find the opportunities for SIMD operation Requires a compiler with vectorizing capabilities in gcc, vectorization is enabled by -O3 (use -ftree-vectorizer-verbose=1 to print reports about which loops are vectorized) the Intel compiler, icc, also does advanced vectorization

gcc -O3 -ftree-vectorizer-verbose=1 saxpy.c -o saxpy! ! saxpy.c:9: note: created 1 versioning for alias checks.! ! saxpy.c:9: note: LOOP VECTORIZED.! saxpy.c:6: note: vectorized 1 loops in function.!
16

Arithmetic operations on vector data types

Declare variables of vector data types and express computations with normal arithmetic expressions
SSE2 data types defined in emmintrin.h

Vector data types

four 32-bit floating-point values: __m128 two 64-bit floating-point values: __m128d integer data types: __m128i

m128 a; /* 4 packed int values */ ! m128 b;! __m128 c; ! ! c = a+b;! !

Compiler intrinsinc functions

Functions for performing MMX and SSE operations on packed data
inplemented with inline assembly code allows the programmer to use C function calls and variables

Defines a C function for each MMX/SSE instruction

there are also intrisinc functions composed of several MMX/SSE instructions

New data types to represent packed integer and floating-point values

__m64 represents the contents of a 64-bit MMX register (8, 16 or 32 bit packed integers) __m128 represents 4 packed single precision floating-point values __m128d represents 2 packed double precision floating-point values __m128i represents packed integer values (8, 16, 32 or 64-bit)

C intrisinc functions
Example:
multiply two arrays A and B of 400 single precision f-p values
#define SIZE 400! ! float A[SIZE], B[SIZE], C[SIZE];! __m128 m1, m2, m3;! ! for (int i=0; i<SIZE; i+=4) {! m1 = _mm_load_ps (A+i);! m2 = _mm_load_ps (B+i);! m3 = _mm_mul_ps (m1,m2);! _mm_store_ps (C+i,m3);! }!

Register allocation and instruction scheduling is left to the compiler

Variables of vector data types have to be aligned to 16-bit boundaries May also need to access the individual values in the packed data
can be done by using a union structure

union mmdata {! __mm128 m;! float f[4];! };!

SIMD v1
No ratings yet
SIMD v1
31 pages
SIMD Presentation
No ratings yet
SIMD Presentation
28 pages
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
No ratings yet
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
80 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Intel SIMD Architecture Guide
No ratings yet
Intel SIMD Architecture Guide
74 pages
Introduction To x64 Assembly
100% (2)
Introduction To x64 Assembly
13 pages
Introduction To x64 Assembly
100% (1)
Introduction To x64 Assembly
13 pages
Lec17 x86SIMD PDF
No ratings yet
Lec17 x86SIMD PDF
80 pages
Practical SIMD Programming Guide
No ratings yet
Practical SIMD Programming Guide
17 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
Intel X86 and Arm Data Types
No ratings yet
Intel X86 and Arm Data Types
20 pages
MMX Notes
No ratings yet
MMX Notes
2 pages
Pentium 4
No ratings yet
Pentium 4
60 pages
Intel Pentium 4 Processor
No ratings yet
Intel Pentium 4 Processor
60 pages
MMX Intel Architecture
No ratings yet
MMX Intel Architecture
9 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Lecture8 Simd
No ratings yet
Lecture8 Simd
38 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
Unit 1 MPMC Final
No ratings yet
Unit 1 MPMC Final
63 pages
Evolution of SIMD in Computing
No ratings yet
Evolution of SIMD in Computing
10 pages
Details of Intel® Advanced Vector Extensions Intrinsics
No ratings yet
Details of Intel® Advanced Vector Extensions Intrinsics
3 pages
8086 Architecture
No ratings yet
8086 Architecture
40 pages
Serial Communication
100% (1)
Serial Communication
28 pages
Unit I@mpmc
No ratings yet
Unit I@mpmc
33 pages
Microprocessor Based System: Muhammad Syargawi B. Abdullah Photonics Lab, Mimos Berhad For Unikl, Miit Sept 2012
No ratings yet
Microprocessor Based System: Muhammad Syargawi B. Abdullah Photonics Lab, Mimos Berhad For Unikl, Miit Sept 2012
69 pages
Unit-I (First Half)
No ratings yet
Unit-I (First Half)
38 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
MP 3 4
No ratings yet
MP 3 4
52 pages
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
No ratings yet
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
14 pages
Risc Ans Cisc
No ratings yet
Risc Ans Cisc
33 pages
8086 Processor
No ratings yet
8086 Processor
49 pages
16-Bit Floating Point Instructions For Embedded Multimedia Applications
No ratings yet
16-Bit Floating Point Instructions For Embedded Multimedia Applications
6 pages
C674x CPU Features
No ratings yet
C674x CPU Features
23 pages
CSC 315 Notes 4
No ratings yet
CSC 315 Notes 4
9 pages
Adv M 1
No ratings yet
Adv M 1
85 pages
CISC
No ratings yet
CISC
16 pages
Pape 3
No ratings yet
Pape 3
20 pages
Microprocessors & Microcontrollers Guide
No ratings yet
Microprocessors & Microcontrollers Guide
117 pages
Intel 8088 Technical Overview
100% (1)
Intel 8088 Technical Overview
23 pages
EVOLUTION OF Microprocessor
No ratings yet
EVOLUTION OF Microprocessor
41 pages
14 Assembly Instructions
100% (1)
14 Assembly Instructions
9 pages
Tutorial Emu8086
No ratings yet
Tutorial Emu8086
70 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Cse 216 - L3
No ratings yet
Cse 216 - L3
15 pages
Floating Point Instructions: Ray Seyfarth
No ratings yet
Floating Point Instructions: Ray Seyfarth
18 pages
MMX Technology in Microprocessors
No ratings yet
MMX Technology in Microprocessors
17 pages
Advanced Processors: Overview of DSP Unit-5 Unit-6
No ratings yet
Advanced Processors: Overview of DSP Unit-5 Unit-6
58 pages
Microprocessors and Microcontrollers
No ratings yet
Microprocessors and Microcontrollers
122 pages
Design by Mohammed Intekhab Khan
No ratings yet
Design by Mohammed Intekhab Khan
33 pages
IA32 Instruction Set (Short Form)
No ratings yet
IA32 Instruction Set (Short Form)
79 pages
Basic Architecture Ia32 x86
No ratings yet
Basic Architecture Ia32 x86
41 pages
Lecture 1
No ratings yet
Lecture 1
63 pages
Assembly Language Program With 8085 Microprocessor
100% (1)
Assembly Language Program With 8085 Microprocessor
22 pages
Oracle-DBA Trainining in Hyderabad
No ratings yet
Oracle-DBA Trainining in Hyderabad
4 pages
Service Training: Vorsprung Durch Technik WWW - Audi.de
No ratings yet
Service Training: Vorsprung Durch Technik WWW - Audi.de
60 pages
UV IQP ZEMA-0004B Rev310
No ratings yet
UV IQP ZEMA-0004B Rev310
29 pages
Experion ACE Specification
No ratings yet
Experion ACE Specification
17 pages
How Computers Work: The CPU and Memory
100% (1)
How Computers Work: The CPU and Memory
6 pages
Manual Maha1
100% (1)
Manual Maha1
112 pages
Kula User Instruction Manual Issue 1 Rev 1
100% (11)
Kula User Instruction Manual Issue 1 Rev 1
183 pages
Transmission Towers & Lines: Download Brochure PDF
No ratings yet
Transmission Towers & Lines: Download Brochure PDF
2 pages
Reference Manual: Digital Camera
No ratings yet
Reference Manual: Digital Camera
248 pages
UNIQODER Setup Guide
No ratings yet
UNIQODER Setup Guide
7 pages
Manual de Usuario Dermatomo Integra Padgett Alu-2b
No ratings yet
Manual de Usuario Dermatomo Integra Padgett Alu-2b
84 pages
Humitherm Ultra - Manual
No ratings yet
Humitherm Ultra - Manual
66 pages
IBM ThinkCentre
No ratings yet
IBM ThinkCentre
290 pages
Access Keypad Installation Guide
No ratings yet
Access Keypad Installation Guide
10 pages
High Power Line Englisch
No ratings yet
High Power Line Englisch
16 pages
Structural Analysis of Lattice Steel Transmission Towers
No ratings yet
Structural Analysis of Lattice Steel Transmission Towers
11 pages
Samsung TV 55 Inch Crystal Uhd Samsung U8000f 4k Smart TV (2025)
No ratings yet
Samsung TV 55 Inch Crystal Uhd Samsung U8000f 4k Smart TV (2025)
22 pages
Arquitectura HA Imperva
No ratings yet
Arquitectura HA Imperva
7 pages
HP L2245wg 22-Inch Widescreen LCD Monitor Datasheet en
No ratings yet
HP L2245wg 22-Inch Widescreen LCD Monitor Datasheet en
2 pages
GK 400 RB III Manual
No ratings yet
GK 400 RB III Manual
16 pages
Hostpot Manual
No ratings yet
Hostpot Manual
39 pages
Vijay Agarwal Mob-7387386635 - Objective
0% (1)
Vijay Agarwal Mob-7387386635 - Objective
4 pages
OXE Avaya
No ratings yet
OXE Avaya
24 pages
COA - Unit 5
No ratings yet
COA - Unit 5
16 pages
Aspire 4530 Series Specifications
No ratings yet
Aspire 4530 Series Specifications
3 pages
Tidu 412 A
No ratings yet
Tidu 412 A
39 pages
Running Wordcount On AWS Elastic Map Reduce
100% (2)
Running Wordcount On AWS Elastic Map Reduce
26 pages
QJ71GP21 SX
No ratings yet
QJ71GP21 SX
642 pages
Analysis of Stress in Anchorage Zone Using Ansys
No ratings yet
Analysis of Stress in Anchorage Zone Using Ansys
5 pages
Mainframe Material - Cobol, Db2, and Cics: Author - K Phani Kumar
No ratings yet
Mainframe Material - Cobol, Db2, and Cics: Author - K Phani Kumar
44 pages

Programming With SIMD-instructions

Uploaded by

Programming With SIMD-instructions

Uploaded by

Programming with SIMD-instructions

SIMD = Single Instruction stream, Multiple Data stream

Originally designed to speed up media processing applications

There are many different versions of SIMD extensions

MMX, SSE and AVX

SSE Streaming SIMD Extension

AVX Advanced Vector Extensions

Characteristics of SIMD operations

The operation can be a

MMX registers can only hold data

MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7

MMX data types

The MMX instructions start with the prefix P (for Packed)

Parallel operations on packed single precision floating-point values

The XMM registers are new physical registers

XMM registers can be accessed in 32-bit, 64-bit or 128-bit mode

There is also a 32 bit control and status register, MXCSR

Supports both packed and scalar single precision floating-point instructions

operations on a scalar 32-bit floating-point value (the 32 LSB)

Also included some 64-bit SIMD integer instructions

Packed and scalar operations

Scalar SSE operations apply an operation on a single (scalar) floating-point value

Extends MMX and SSE with support for

Operates on 128-bit entities in the XMM registers

SSE2 data types

128-bit packed byte integer

128-bit packed word integer

128-bit packed doubleword integer

128-bit packed quadword integer

SSE operations on packed double-precision data has the suffix PD

SSE operations on scalar double-precision data has the suffix SD

Programming with MMX and SSE

arithmetic operations on vector data types

compiler intrinsinc functions for SSE operation

program with inline assembly language

Arithmetic operations on vector data types

Vector data types

__m128 a; /* 4 packed int values */ ! __m128 b;! __m128 c; ! ! c = a+b;! !

Compiler intrinsinc functions

Defines a C function for each MMX/SSE instruction

New data types to represent packed integer and floating-point values

Register allocation and instruction scheduling is left to the compiler

union mmdata {! __mm128 m;! float f[4];! };!

You might also like

m128 a; /* 4 packed int values */ ! m128 b;! __m128 c; ! ! c = a+b;! !