Fpgas 29032016

FPGAs!
Basic Concepts – Building Blocks
• There are (3) fundamental building blocks found in

digital devices interconnect gates flip flops
– Gates
– Flip-Flops D Q
>
– Interconnect
(or routing) D Q
>
D Q
>
D Q
>
2
Digital Logic Landscape
The following slides provide a history of the various logic devices
Design Capacity Full
(gates) Custom
Standard
Cell
Gate
Programmable Array
Logic
FPGA
CPLD
SPLD
Standard
Logic
Development Time
hours days weeks months years
3
Digital Logic History - PLDs
interconnect gates flip flops
• Developed in
the late 70s D Q A very common
> low cost IC package
• Major player D Q
has pins on all 4 sides
called a Plastic-Leaded
today: Lattice > Chip Carrier (PLCC)
• First device that D Q
needs software >
• 50 – 200 gates D Q
>
4
PLD Example
5
Digital Logic History - Gate Array
Definition: A pre-built IC consisting of a regular arrangement of gates and interconnect
(routing) where the interconnect is modified to achieve a customer’s desired
functions.
– The customer designs the behaviors/functions
– The vendor manipulates/changes the interconnect gates
metal interconnect to arrive at the
customer’s specified functions
(that is, the vendor hooks up the gates)
– Sometimes called an
Uncommitted Logic Array (ULA).
Packaging Enhancement:
To increase the number
of I/Os (Inputs/Outputs), the
pin thickness and spacing
(pitch) are dramatically
Gate Array in a
reduced in this Thin Quad
TQFP package
FlatPack package (TQFP). 1,000,000+ gates
6
Gate Array
• The ultimate building tool set for digital designers
• Advantages
– Very dense (today over 10,000,000 gates (10 million))

– Fast performance (200 – 500 MHz)
– Very low unit cost
• Disadvantages
– Long turn around time (3 - 6 months)
– $50K - $500K NRE
• NRE = Non-Recurring Engineering charges,
which are one-time “set-up” charges to ready

the “fab” to build the custom part
(“fab” = the “factory” where the ICs are
manufactured;
the “fabrication plant”)
– Risk of re-spins
7
Digital Logic History - Standard Cell
• This device features a series of customized “cells”
– Each cell is optimized for its “standard” function
• Cells are chosen form a library from the Standard Cell vendor,
customized, and connected to the other cells and the routing on the
part.
• There are no standard layers to the device; each layer is a unique
design
• Advantages:
– More optimized die size compared to GA
– Cheaper device price compared to GA
– Can add analog functions
• Disadvantages:
– Extremely high NRE charges (up to $1M)
– Requires >250k+ units/year
– Much longer development time
– Much higher risk (re-spins, etc.)
8
CPLDs, FPGAs
Design Capacity Full
(gates) Custom
Standard
Cell
Gate
Array
Programmable FPGA
Logic
CPLD
SPLD
Standard
Logic
Development Time
hours days weeks months years
9
Digital Logic History - CPLD
Complex Programmable Logic Device
interconnect macrocells
Definition:
A CPLD contains a bunch of PLD blocks
whose inputs and outputs are
connected together by a global
interconnection matrix.
CPLD has two levels of

programmability:
--Each PLD block can be programmed
--The interconnection between the
PLDs can be programmed.
CPLD technology was introduced in

the late 80s 32-1024 macrocells
10
CPLDs
• Vendors: Altera, Lattice, Cypress, Xilinx
• 2 Primary Technologies
– EEPROM
(old technology)
– FLASH
(technology used by Xilinx CPLDs)
• FPGAs vs. CPLDs

– FPGAs have much greater capacity
– CPLDs are faster for some small applications
– Both are easy to design
11
Digital Logic History - FPGA
Field Programmable Gate Array
Definition:
interconnect logic cells
• An array of “logic cells” surrounded by
substantial routing, both of which are under
the user’s control
• The CLB (Configurable Logic Block) is/was the

fundamental building block of the logic cell,
although today’s FPGAs use a very
sophisticated collection of gates that goes
beyond the original CLB design
– The early Xilinx CLBs contained a (4)

input look-up table (LUT), a flip-flop,
and “carry logic”
>10 million gates
12
FPGA Building Blocks
13
An Early Xilinx CLB
14
Digital Logic History
FPGA - Field Programmable Gate Array
2 types of FPGAs LUT flip flop
• Reprogrammable (SRAM-based)
0110 0
– Xilinx, Altera, Lattice, Atmel 1011

1100
0001
0
1
1
1010 0
1111 1
• One-time Programmable (OTP) SRAM logic cell
– Actel, Quicklogic, EZchip gates flip flop
OTP logic cell

15
Basic Concepts - Logic Interconnect
• Method to hook-up gates inside a single device
• Need to have enough routing to connect most gates
• Larger gate counts result in lots of routing,
bigger die size, increased cost
vertical interconnect
A
B
horizontal used
interconnect interconnect
path
gates
16
Basic Concepts - I/Os
Inputs and Outputs
• All signals on & off O

chip must go through I/O buffer
an I/O buffer
I
• User can choose
package pin
many I/O buffer
options
silicon die
17
Basic Concepts
Propagation Delay (tPD)
Definition: The time required for a signal to travel

from A to B, measured in nanoseconds
(ns).
Gate Delay Interconnect Delay
“A” “B”
“A” “B”
tPD = 3ns tPD = 1ns
18
Basic Concepts
Path Delay
Definition: The sum of all the gate and net delays from
starting to ending point.
“C”
fanout=2
“A” “B”
tPD = 3ns tPD = 1.2ns tPD = 3ns tPD = 1.8nstPD = 3ns
Path Delay “A” to “B” = sum of all gate + net delays

3ns + 1.2ns + 3ns + 1.8ns + 3ns =
12ns
19
Basic Concepts
Maximum System Performance (fMAX)
Definition: The fastest speed a circuit containing flip-flops can
operate, measured In Megahertz (MHz).
D Q Circuit Events per Second:

1 = 1 Hertz (Hz)
1,000 = kilo (kHz)
> 1,000,000 = mega (MHz)
1,000,000,000 = giga (GHz)
tCQ = 2.5ns tPD = 1ns tPD = 2ns tPD = 0.5ns tPD = 2ns
1
fMAX =
longest flip-flop path delay
fMAX = 1/(flip-flop delay + gate delays + net delays)

= 1/(2.5 + 1 + 2 + 0.5 + 2)ns
= 125 MHz
20
Xilinx FPGA
Architecture
How are they arranged
18Kbits 18×18
Spartan 6
Dual Port RAM Multiplier
CLB (Configurable Logic Block)

= 4 Slices
Slice
I3 SET
CE
I2 O D Q
I1
RST
I0
I3 SET
CE
I2 O D Q
I1
RST
I0
124 multi-standard I/O with JTAG
Low Cost Design 22

How they are arranged
Kintex-7 FPGA
Typical FPGA Logic Structure
• LUT
• Flip flop
Typical 4 Input LUT
• 4 Inputs
• One Output
• Any 4 input Logic function

can be implemented.
Flip Flop
• Input D
• Input Clock
SET
• Input Clock Enable CE
• Input Set D Q
• Input Reset
RST
• Output Q
Making the Most of Controls
Dedicated Flip-Flop controls make designs smaller and faster.
LUT4
SET
I3 CE
1 level of logic - fast and small I2 O D Q
I1
Up to 4 data inputs plus 3 controls I0
RST
tSU
2 levels of logic - significantly slower and twice the size (and cost)
LUT4 LUT4
SET
I3 I3 CE
I2 O I2 O Q
net D
I1 I1
I0 I0
RST
tSU tSU
Low Cost Design 27

Workshop - How can this be implemented?
This simple code describes a 4-input function followed by a Flip-Flop.
What size and performance is this function?
process (clk,reset)
begin
if reset='1' then reset
data_out <= '0';
elsif clk'event and clk='1' then
if enable='1' then enable
if force_high='1' then
set
data_out <= '1';
else
data_out <= a and b and c and d; logic
end if;
end if;
end if;
end process;
Low Cost Design 28

Making the Most LUTs and FFs
Dedicated Flip-Flop controls make designs smaller and faster.
LUT4
SET
I3 CE
1 level of logic - fast and small I2 O D Q
I1
Up to 4 data inputs plus 3 controls I0
RST
tSU
2 levels of logic - significantly slower and twice the size (and cost)
LUT4 LUT4
SET
I3 I3 CE
I2 O I2 O Q
net D
I1 I1
I0 I0
RST
tSU tSU
Low Cost Design 29

Workshop - How can this be implemented?
This simple code describes a 4-input function followed by a Flip-Flop.
What size and performance is this function?
process (clk,reset)
begin
if reset='1' then reset
data_out <= '0';
elsif clk'event and clk='1' then
if enable='1' then enable
if force_high='1' then
set
data_out <= '1';
else
data_out <= a and b and c and d; logic
end if;
end if;
end if;
end process;
Low Cost Design 30

TWICE the Cost and Half the Speed
Report
Cell Usage :
# BELS : 2
TWICE as Big as it # LUT2 : 1
should be and Slow! # LUT4 : 1
# FlipFlops/Latches : 1
# FDCE : 1
enable
LUT4
LUT2 PRE
force_high I3 CE
d I1 b I2 data_out
O O D Q
c I0 I1
a I0
CLR
Solution
reset
Low Cost Design 31

CLB (Configurable Logic Block)
Multiple LUTs and FFs
CLB
Slice Slice
PRE PRE
LUT Carry D Q LUT Carry D Q
CE CE
CLR CLR
LUT Carry PRE LUT Carry PRE

D Q D Q
CE CE
CLR CLR
2 Slices in Each CLB

• Each Slice has Two LUTs and Two Flipflops
How do CLBs connect with each Other
• Pairs of CLBs are arranged symmetrically
• Connect via Switch matrix
Slice
Slice
Switch Matrix
Clocks
Switch Matrix
Slice
Slice
Data Data
Fabric Routing
• Connections between CLBs and other resources use the fabric routing
resources
• Routing lines connect to the switch
matrices adjacent to the resources
• Routes connect resources vertically,
horizontally, and diagonally
• Routes have different spans
• Horizontal: Single, Dual, Quad, Long (12)
• Vertical: Single, Dual, Hex, Long (18)
• Diagonal: Single, Dual, Hex
Different Architectures:
6 Input LUTs
• 6-input LUT can be two 5-input LUTs with common inputs
• Minimal speed impact to
a 6-input LUT 6-LUT
• One or two outputs A6
• Any function of six variables or A5 A5

A4 A4 D
two independent functions of A3 5-LUT
A3
five variables A2 A2
A1 A1
O6
A5
A4 D O5
A3
5-LUT
A2
A1
Different Architectures:
Slice Structure with 4 LUTs
• Four six-input Look Up Tables (LUT)
• Wide multiplexers
LUT/RAM/SRL
• Carry chain
• Four flip-flop/latches LUT/RAM/SRL
• Four additional flip-flops
• The implementation tools (MAP)

LUT/RAM/SRL
are responsible for packing slice

resources into the slice LUT/RAM/SRL
01
More Detailed Look at Flip Flops
• All flip-flops are D type D Q
CE
CE
• All flip-flops have a single clock input (CLK) CK
CK
 Clock can be inverted at the slice boundary SRSR
• All flip-flops have an active high chip enable (CE)

• All flip-flops have an active high SR input
 Input can be synchronous or asynchronous, as determined by the configuration bit
stream
 Sets the flip-flop value to a pre-determined state, as determined by the configuration
bit stream
Asynchronous Reset
• To infer asynchronous resets, the reset signal must be in the
sensitivity list of the process
• Output takes reset value immediately
• Even if clock is not present
• SRVAL attribute is determined by reset value in RTL code
FF: process (CLK, RST)
always @ (posedge CLK or posedge RST )
begin
begin
if (RST)
if (RST = ‘1’) then SRVAL
Q <= ‘0’;
Q <= 1’b0;
elsif (rising_edge CLK) then
else SRVAL Q <= D;
Q <= D;
end if;
end
end
Using Asynchronous Resets
• Deassertion of reset should be synchronous to the clock
• Not synchronizing the deassertion of reset can create
problems
• Flip-flops can go metastable
• Not all flip-flops are guaranteed to come out of reset on the
same clock
• Use a reset bridge to synchronize reset to each domain
rst_pin
D SR D SR
0 D Q
CE D Q
CE rst_clkA
CK
CK CK
CK
SR configured as
SR SR asynchronous,
clkA SRVAL=1
Synchronous Reset
• A synchronous reset will not take effect until the first active clock
edge after the assertion of the RST signal
• The RST pin of the flip-flop is a regular timing path endpoint
• The timing path ending at the RST pin will be covered by a PERIOD constraint
on the clock
FF: process (CLK)

always @ (posedge CLK) begin
begin if (rising_edge CLK) then
if (RST) if (RST = ‘1’) then
Q <= 1’b0; Q <= ‘0’;
else SRVAL else
Q <= D; Q <= D; SRVAL
end end if;
end
Chip Enable
• All flip-flops in the 7 series FPGAs have a chip enable (CE) pin
• Active high, synchronous to CLK
• When asserted, the flip-flop clocks in the D input
• When not asserted, the flip-flop holds the current value
• Inferred naturally from RTL code
FF: process (CLK)
begin
always @ (posedge CLK )
if (rising_edge CLK) then
begin
if (CE = ‘1’) then
if (CE)
Q <= D;
Q <= D;
end if;
end
end if;
end
LUTs can also be used as RAM
• Uses the same storage that is used for
Single Dual Simple Quad the look-up table function
Port Port Dual Port Port
• Synchronous write, asynchronous read
32x2 32x2D 32x6SDP 32x2Q
• Can be converted to synchronous read
32x4 32x4D 64x3SDP 64x1Q using the flip-flops available in the slice
32x6 64x1D
32x8 64x2D • Various configurations
64x1 128x1D • Single port
64x2 • One LUT6 = 64x1 or 32x2 RAM
64x3 • Cascadable up to 256x1 RAM
64x4 • Dual port (D)
128x1 • 1 read / write port + 1 read-only port
128x2 • Simple dual port (SDP)
256x1 • 1 write-only port + 1 read-only port
Each port has independent • Quad-port (Q)
address inputs • 1 read / write port + 3 read-only ports
Block RAMs
(In built Memory)
Single-Port Block RAM
• Single read/write port
• Clock: CLKA ADDRA Port A
36 36
• Address: ADDRA 4
DIA DOA
WEA
• Write enable: WEA CLKA
• Write data: DIA 36 Kb
• Read data: DOA Memory
Array
• 36-kbit configurations
• 32k x 1, 16k x 2, 8k x 4, 4k x 9, 2k x 18, 1k x 36
• 18-kbit configurations
• 16k x 1, 8k x 2, 4k x 4, 2k x 9, 1k x 18, 512 x 36
• Configurable write mode
• WRITE_FIRST: Data written on DIA is available on DOA
• READ_FIRST: Old contents of RAM at ADDRA is presented on DOA
• NO_CHANGE: The DOA holds its previous value (saves power)
Summary of Block RAM Configurations
18kbit 36kbit
32k x 1, 16Kx2,
16Kx1, 8Kx2, 4Kx4,  1 read/write port
Single Port 8Kx4, 4Kx9,
2Kx9, 1Kx18  Read OR write in 1 cycle
2Kx18, 1Kx36
32Kx1, 16Kx2,  Two fully independent

16Kx1, 8Kx2, 4Kx4,
True Dual Port 8Kx4, 4Kx9, read/write ports
2Kx9, 1Kx18
2Kx18, 1Kx36  Any two operations in 1 cycle
32K x 1, 16Kx2,
16Kx1, 8Kx2, 4Kx4,
8Kx4, 4Kx9,  1 read port and 1 write port
Simple Dual Port 2Kx9, 1Kx18,
2Kx18, 1Kx36,  Read AND write in 1 cycle
512x36
512x72
SelectI/O
5.0V 1.8V 3.3V 2.5V SelectI/O Allows Connection
Directly to External Signals of
Varied Voltages & Thresholds
PCI SSTL HSTL
Future Standards Can be

Supported Without Having GTL GTL+ AGP
to Make Silicon Changes
4 System Interfaces
SelectI/O
• Allows Connection & Use of a Wide Variety of Devices
• Processors, Memory, Bus Specific Standards, Mixed Signal...
• Provides Industry Standard IEEE/JDEC I/O Standards
• Maximizes Speed/Noise Tradeoff - Use Only What is Needed
• Can Connect to or Create High Performance Backplanes
• PCI, GTL+, HSTL
• DIY - Virtex Based Backplane Design in Progress
• Define I/O by Simply Placing Desired Input And/Or Output
Buffers Into the Design
• Special IBUF and OBUF Components Provided in Schematic Based and
HDL Based Design Flows
• For Example: SSTL3, Class I Output Buffer - OBUF_SSTL3_I
Simplified IOB Structure
• Fast I/O Drivers
DFF/LATCH
• Separate Registers for Input, D

CE
Q
Output & Three-State Control S/R
• Asynchronous Set or Reset

Available on Each Flip-flop
• Common Clock, Separate Clock DFF/LATCH
D Q
Enables CE
PAD
S/R
• Programmable Slew Rate, Pullup,

Input Delay, Etc
• Selectable I/O Standard Support DFF/LATCH
D Q
CE
• Supported Standards List can be S/R
Updated After Testing

How It Works
SelectI/O Output SelectI/O Input
Configuration Bits
OBUF_SSTL3_I IBUF_SSTL3_I
SSTL3 Class1 SSTL3 Class1

Output Driver Input Receiver
Xilinx 7 Series
Industry’s Best Industry’s Highest

Lowest Power
Price-Performance System Performance
and Cost
“New Class of FPGA” and Capacity
Compared to Spartan-6 Compared to Virtex-6 Compared to Virtex-6
 30% more performance  Comparable performance  2.5x larger (2M LCs)
 Lower system cost with 50% lower cost for 2x  50% higher performance
 50% less power better price-performance  50% lower power
 30% smaller footprint  50% less power  2x line rate (28 Gb/s)
Compared to Spartan-6  Similar EasyPath™ cost
 3.3x larger reduction
 Over 2x performance with
4x transceiver speed
 Superior price-performance
Page 50
7 Series FPGA Layout
• Similar Floorplan to Virtex-6 FPGAs
– Provides easy migration to 7 series
FPGAs
• CMT columns moved from center of
device to adjacent to I/O columns
– No more inner vs. outer column
performance difference
– Support for higher performance
interfaces
• Only one I/O column per half device I/O Columns
– Uniform skew from center of device CMT Columns
• GT columns replace I/O and CMT in Clock Routing

smaller devices CLB, Block RAM, DSP Columns
• GT columns not always present GT Columns
Page 51
7 Series Slice Structure
• Four six-input Look Up Tables (LUT)
• Wide multiplexers
LUT/RAM/SRL
• Carry chain
• Four flip-flop/latches LUT/RAM/SRL
• Four additional flip-flops
• The implementation tools (MAP)

LUT/RAM/SRL
are responsible for packing slice

resources into the slice LUT/RAM/SRL
01
7-Series I/O Block Diagram
Logical Resources Electrical Resources
OLOGIC/
ODELAY
OSERDES
P
Interconnect to FPGA Fabric

ILOGIC/
IDELAY
ISERDES
Master
LVDS
Termination
Slave
ILOGIC/
IDELAY
ISERDES
N
OLOGIC/
ODELAY
OSERDES
7 Series FPGAs DSP
• 7 series FPGAs DSP slice 100% based on Virtex-6 FPGA
DSP48E1
• 25x18 multiplier
• 25-bit pre-adder
• Flexible pipeline
• Cascade in and out
• Carry in and out
• 96-bit MACC
• SIMD support
• 48-bit ALU
• Pattern detect
• 17-bit shifter
• Dynamic operation (cycle by cycle)
Programmable
Systems
Integration
Programmable
Highly Capable, Dedicated DSP Logic in Every 7 Series FPGA
Systems Integration
Page 54
7-Series Gigabit Transceivers
2
Tx
FPGA
PMA PCS
Fabric
2 Interface
Rx
PMA PCS
• Dedicated parallel-to-serial transmitter and serial-to-parallel receiver

• Unidirectional, differential bit-serial data I/O
• Integrated PLL-based Clock and Data Recovery (CDR)
• Parallel interface to the FPGA internal fabric

• Width varies by family, protocol, and line rate from 8 to 40 bits
• Serial interface to the printed circuit board (differential signaling)

• Differential Current Mode Logic (CML)
• Two traces for the transmitter and two traces for the receiver; removes common-mode noise

Fpgas 29032016

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Fpgas 29032016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fpgas 29032016

Uploaded by

Copyright:

Available Formats

FPGAs!

Basic Concepts – Building Blocks

• There are (3) fundamental building blocks found in

• First device that D Q

needs software >

– Very dense (today over 10,000,000 gates (10 million))

which are one-time “set-up” charges to ready

CPLD has two levels of

CPLD technology was introduced in

• FPGAs vs. CPLDs

• The CLB (Configurable Logic Block) is/was the

– The early Xilinx CLBs contained a (4)

– Xilinx, Altera, Lattice, Atmel 1011

• One-time Programmable (OTP) SRAM logic cell

– Actel, Quicklogic, EZchip gates flip flop

OTP logic cell

• All signals on & off O

Definition: The time required for a signal to travel

tPD = 3ns tPD = 1ns

tPD = 3ns tPD = 1.2ns tPD = 3ns tPD = 1.8nstPD = 3ns

Path Delay “A” to “B” = sum of all gate + net delays

D Q Circuit Events per Second:

fMAX = 1/(flip-flop delay + gate delays + net delays)

CLB (Configurable Logic Block)

124 multi-standard I/O with JTAG

Low Cost Design 22

• Any 4 input Logic function

Low Cost Design 27

Low Cost Design 28

Low Cost Design 29

Low Cost Design 30

Low Cost Design 31

LUT Carry PRE LUT Carry PRE

2 Slices in Each CLB

• Any function of six variables or A5 A5

• Four additional flip-flops

• The implementation tools (MAP)

are responsible for packing slice

 Clock can be inverted at the slice boundary SRSR

• All flip-flops have an active high chip enable (CE)

FF: process (CLK)

32Kx1, 16Kx2,  Two fully independent

PCI SSTL HSTL

Future Standards Can be

• Separate Registers for Input, D

Output & Three-State Control S/R

• Asynchronous Set or Reset

• Programmable Slew Rate, Pullup,

Updated After Testing

SSTL3 Class1 SSTL3 Class1

Industry’s Best Industry’s Highest

– Uniform skew from center of device CMT Columns

• GT columns replace I/O and CMT in Clock Routing

• Four additional flip-flops

• The implementation tools (MAP)

are responsible for packing slice

Interconnect to FPGA Fabric

• Dedicated parallel-to-serial transmitter and serial-to-parallel receiver