[go: up one dir, main page]

0% found this document useful (0 votes)
12 views55 pages

Fpgas 29032016

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 55

FPGAs!

Basic Concepts – Building Blocks

• There are (3) fundamental building blocks found in


digital devices interconnect gates flip flops

– Gates
– Flip-Flops D Q

>
– Interconnect
(or routing) D Q

>

D Q

>

D Q

>

2
Digital Logic Landscape
The following slides provide a history of the various logic devices
Design Capacity Full
(gates) Custom
Standard
Cell
Gate
Programmable Array
Logic

FPGA

CPLD

SPLD
Standard
Logic
Development Time
hours days weeks months years

3
Digital Logic History - PLDs
interconnect gates flip flops

• Developed in
the late 70s D Q A very common
> low cost IC package
• Major player D Q
has pins on all 4 sides
called a Plastic-Leaded
today: Lattice > Chip Carrier (PLCC)

• First device that D Q

needs software >

• 50 – 200 gates D Q

>

4
PLD Example

5
Digital Logic History - Gate Array
Definition: A pre-built IC consisting of a regular arrangement of gates and interconnect
(routing) where the interconnect is modified to achieve a customer’s desired
functions.
– The customer designs the behaviors/functions
– The vendor manipulates/changes the interconnect gates
metal interconnect to arrive at the
customer’s specified functions
(that is, the vendor hooks up the gates)
– Sometimes called an
Uncommitted Logic Array (ULA).

Packaging Enhancement:
To increase the number
of I/Os (Inputs/Outputs), the
pin thickness and spacing
(pitch) are dramatically
Gate Array in a
reduced in this Thin Quad
TQFP package
FlatPack package (TQFP). 1,000,000+ gates

6
Gate Array
• The ultimate building tool set for digital designers
• Advantages

– Very dense (today over 10,000,000 gates (10 million))


– Fast performance (200 – 500 MHz)
– Very low unit cost
• Disadvantages
– Long turn around time (3 - 6 months)
– $50K - $500K NRE
• NRE = Non-Recurring Engineering charges,

which are one-time “set-up” charges to ready


the “fab” to build the custom part
(“fab” = the “factory” where the ICs are
manufactured;
the “fabrication plant”)
– Risk of re-spins

7
Digital Logic History - Standard Cell
• This device features a series of customized “cells”
– Each cell is optimized for its “standard” function
• Cells are chosen form a library from the Standard Cell vendor,
customized, and connected to the other cells and the routing on the
part.
• There are no standard layers to the device; each layer is a unique
design
• Advantages:
– More optimized die size compared to GA
– Cheaper device price compared to GA
– Can add analog functions
• Disadvantages:
– Extremely high NRE charges (up to $1M)
– Requires >250k+ units/year
– Much longer development time
– Much higher risk (re-spins, etc.)
8
CPLDs, FPGAs
Design Capacity Full
(gates) Custom
Standard
Cell
Gate
Array

Programmable FPGA
Logic

CPLD

SPLD
Standard
Logic
Development Time
hours days weeks months years

9
Digital Logic History - CPLD
Complex Programmable Logic Device
interconnect macrocells
Definition:
A CPLD contains a bunch of PLD blocks
whose inputs and outputs are
connected together by a global
interconnection matrix.

CPLD has two levels of


programmability:
--Each PLD block can be programmed
--The interconnection between the
PLDs can be programmed.

CPLD technology was introduced in


the late 80s 32-1024 macrocells

10
CPLDs
• Vendors: Altera, Lattice, Cypress, Xilinx
• 2 Primary Technologies

– EEPROM
(old technology)
– FLASH
(technology used by Xilinx CPLDs)

• FPGAs vs. CPLDs


– FPGAs have much greater capacity
– CPLDs are faster for some small applications
– Both are easy to design

11
Digital Logic History - FPGA
Field Programmable Gate Array
Definition:
interconnect logic cells
• An array of “logic cells” surrounded by
substantial routing, both of which are under
the user’s control

• The CLB (Configurable Logic Block) is/was the


fundamental building block of the logic cell,
although today’s FPGAs use a very
sophisticated collection of gates that goes
beyond the original CLB design

– The early Xilinx CLBs contained a (4)


input look-up table (LUT), a flip-flop,
and “carry logic”
>10 million gates

12
FPGA Building Blocks

13
An Early Xilinx CLB

14
Digital Logic History
FPGA - Field Programmable Gate Array
2 types of FPGAs LUT flip flop

• Reprogrammable (SRAM-based)
0110 0

– Xilinx, Altera, Lattice, Atmel 1011


1100
0001
0
1
1
1010 0
1111 1

• One-time Programmable (OTP) SRAM logic cell

– Actel, Quicklogic, EZchip gates flip flop

OTP logic cell


15
Basic Concepts - Logic Interconnect
• Method to hook-up gates inside a single device
• Need to have enough routing to connect most gates
• Larger gate counts result in lots of routing,
bigger die size, increased cost
vertical interconnect

A
B

horizontal used
interconnect interconnect
path

gates

16
Basic Concepts - I/Os
Inputs and Outputs

• All signals on & off O


chip must go through I/O buffer
an I/O buffer
I
• User can choose
package pin
many I/O buffer
options
silicon die

17
Basic Concepts
Propagation Delay (tPD)

Definition: The time required for a signal to travel


from A to B, measured in nanoseconds
(ns).
Gate Delay Interconnect Delay

“A” “B”
“A” “B”

tPD = 3ns tPD = 1ns

18
Basic Concepts
Path Delay
Definition: The sum of all the gate and net delays from
starting to ending point.
“C”

fanout=2

“A” “B”

tPD = 3ns tPD = 1.2ns tPD = 3ns tPD = 1.8nstPD = 3ns

Path Delay “A” to “B” = sum of all gate + net delays


3ns + 1.2ns + 3ns + 1.8ns + 3ns =
12ns
19
Basic Concepts
Maximum System Performance (fMAX)
Definition: The fastest speed a circuit containing flip-flops can
operate, measured In Megahertz (MHz).

D Q Circuit Events per Second:


1 = 1 Hertz (Hz)
1,000 = kilo (kHz)
> 1,000,000 = mega (MHz)
1,000,000,000 = giga (GHz)

tCQ = 2.5ns tPD = 1ns tPD = 2ns tPD = 0.5ns tPD = 2ns

1
fMAX =
longest flip-flop path delay

fMAX = 1/(flip-flop delay + gate delays + net delays)


= 1/(2.5 + 1 + 2 + 0.5 + 2)ns
= 125 MHz
20
Xilinx FPGA
Architecture
How are they arranged
18Kbits 18×18
Spartan 6
Dual Port RAM Multiplier

CLB (Configurable Logic Block)


= 4 Slices

Slice

I3 SET
CE
I2 O D Q
I1
RST
I0

I3 SET
CE
I2 O D Q
I1
RST
I0

124 multi-standard I/O with JTAG

Low Cost Design 22


How they are arranged
Kintex-7 FPGA
Typical FPGA Logic Structure

• LUT
• Flip flop
Typical 4 Input LUT
• 4 Inputs
• One Output

• Any 4 input Logic function


can be implemented.
Flip Flop
• Input D
• Input Clock
SET
• Input Clock Enable CE

• Input Set D Q

• Input Reset

RST
• Output Q
Making the Most of Controls
Dedicated Flip-Flop controls make designs smaller and faster.

LUT4
SET
I3 CE
1 level of logic - fast and small I2 O D Q
I1
Up to 4 data inputs plus 3 controls I0
RST
tSU

2 levels of logic - significantly slower and twice the size (and cost)

LUT4 LUT4
SET
I3 I3 CE
I2 O I2 O Q
net D
I1 I1
I0 I0
RST
tSU tSU

Low Cost Design 27


Workshop - How can this be implemented?
This simple code describes a 4-input function followed by a Flip-Flop.
What size and performance is this function?

process (clk,reset)
begin
if reset='1' then reset
data_out <= '0';
elsif clk'event and clk='1' then
if enable='1' then enable
if force_high='1' then
set
data_out <= '1';
else
data_out <= a and b and c and d; logic
end if;
end if;
end if;
end process;

Low Cost Design 28


Making the Most LUTs and FFs
Dedicated Flip-Flop controls make designs smaller and faster.

LUT4
SET
I3 CE
1 level of logic - fast and small I2 O D Q
I1
Up to 4 data inputs plus 3 controls I0
RST
tSU

2 levels of logic - significantly slower and twice the size (and cost)

LUT4 LUT4
SET
I3 I3 CE
I2 O I2 O Q
net D
I1 I1
I0 I0
RST
tSU tSU

Low Cost Design 29


Workshop - How can this be implemented?
This simple code describes a 4-input function followed by a Flip-Flop.
What size and performance is this function?

process (clk,reset)
begin
if reset='1' then reset
data_out <= '0';
elsif clk'event and clk='1' then
if enable='1' then enable
if force_high='1' then
set
data_out <= '1';
else
data_out <= a and b and c and d; logic
end if;
end if;
end if;
end process;

Low Cost Design 30


TWICE the Cost and Half the Speed
Report

Cell Usage :
# BELS : 2
TWICE as Big as it # LUT2 : 1
should be and Slow! # LUT4 : 1
# FlipFlops/Latches : 1
# FDCE : 1

enable

LUT4
LUT2 PRE
force_high I3 CE
d I1 b I2 data_out
O O D Q
c I0 I1
a I0
CLR
Solution

reset

Low Cost Design 31


CLB (Configurable Logic Block)
Multiple LUTs and FFs
CLB

Slice Slice

PRE PRE
LUT Carry D Q LUT Carry D Q
CE CE

CLR CLR

LUT Carry PRE LUT Carry PRE


D Q D Q
CE CE

CLR CLR

2 Slices in Each CLB


• Each Slice has Two LUTs and Two Flipflops
How do CLBs connect with each Other
• Pairs of CLBs are arranged symmetrically
• Connect via Switch matrix

Slice

Slice
Switch Matrix
Clocks

Switch Matrix
Slice

Slice
Data Data
Fabric Routing
• Connections between CLBs and other resources use the fabric routing
resources
• Routing lines connect to the switch
matrices adjacent to the resources
• Routes connect resources vertically,
horizontally, and diagonally
• Routes have different spans
• Horizontal: Single, Dual, Quad, Long (12)
• Vertical: Single, Dual, Hex, Long (18)
• Diagonal: Single, Dual, Hex
Different Architectures:
6 Input LUTs
• 6-input LUT can be two 5-input LUTs with common inputs
• Minimal speed impact to
a 6-input LUT 6-LUT
• One or two outputs A6

• Any function of six variables or A5 A5


A4 A4 D
two independent functions of A3 5-LUT
A3
five variables A2 A2
A1 A1
O6

A5
A4 D O5
A3
5-LUT
A2
A1
Different Architectures:
Slice Structure with 4 LUTs
• Four six-input Look Up Tables (LUT)
• Wide multiplexers
LUT/RAM/SRL

• Carry chain
• Four flip-flop/latches LUT/RAM/SRL

• Four additional flip-flops

• The implementation tools (MAP)


LUT/RAM/SRL

are responsible for packing slice


resources into the slice LUT/RAM/SRL

01
More Detailed Look at Flip Flops
• All flip-flops are D type D Q
CE
CE
• All flip-flops have a single clock input (CLK) CK
CK

 Clock can be inverted at the slice boundary SRSR

• All flip-flops have an active high chip enable (CE)


• All flip-flops have an active high SR input
 Input can be synchronous or asynchronous, as determined by the configuration bit
stream
 Sets the flip-flop value to a pre-determined state, as determined by the configuration
bit stream
Asynchronous Reset
• To infer asynchronous resets, the reset signal must be in the
sensitivity list of the process
• Output takes reset value immediately
• Even if clock is not present
• SRVAL attribute is determined by reset value in RTL code
FF: process (CLK, RST)
always @ (posedge CLK or posedge RST )
begin
begin
if (RST)
if (RST = ‘1’) then SRVAL
Q <= ‘0’;
Q <= 1’b0;
elsif (rising_edge CLK) then
else SRVAL Q <= D;
Q <= D;
end if;
end
end
Using Asynchronous Resets
• Deassertion of reset should be synchronous to the clock
• Not synchronizing the deassertion of reset can create
problems
• Flip-flops can go metastable
• Not all flip-flops are guaranteed to come out of reset on the
same clock
• Use a reset bridge to synchronize reset to each domain
rst_pin

D SR D SR
0 D Q
CE D Q
CE rst_clkA

CK
CK CK
CK
SR configured as
SR SR asynchronous,
clkA SRVAL=1
Synchronous Reset
• A synchronous reset will not take effect until the first active clock
edge after the assertion of the RST signal
• The RST pin of the flip-flop is a regular timing path endpoint
• The timing path ending at the RST pin will be covered by a PERIOD constraint
on the clock

FF: process (CLK)


always @ (posedge CLK) begin
begin if (rising_edge CLK) then
if (RST) if (RST = ‘1’) then
Q <= 1’b0; Q <= ‘0’;
else SRVAL else
Q <= D; Q <= D; SRVAL
end end if;
end
Chip Enable
• All flip-flops in the 7 series FPGAs have a chip enable (CE) pin
• Active high, synchronous to CLK
• When asserted, the flip-flop clocks in the D input
• When not asserted, the flip-flop holds the current value
• Inferred naturally from RTL code
FF: process (CLK)
begin
always @ (posedge CLK )
if (rising_edge CLK) then
begin
if (CE = ‘1’) then
if (CE)
Q <= D;
Q <= D;
end if;
end
end if;
end
LUTs can also be used as RAM
• Uses the same storage that is used for
Single Dual Simple Quad the look-up table function
Port Port Dual Port Port
• Synchronous write, asynchronous read
32x2 32x2D 32x6SDP 32x2Q
• Can be converted to synchronous read
32x4 32x4D 64x3SDP 64x1Q using the flip-flops available in the slice
32x6 64x1D
32x8 64x2D • Various configurations
64x1 128x1D • Single port
64x2 • One LUT6 = 64x1 or 32x2 RAM
64x3 • Cascadable up to 256x1 RAM
64x4 • Dual port (D)
128x1 • 1 read / write port + 1 read-only port
128x2 • Simple dual port (SDP)
256x1 • 1 write-only port + 1 read-only port
Each port has independent • Quad-port (Q)
address inputs • 1 read / write port + 3 read-only ports
Block RAMs
(In built Memory)
Single-Port Block RAM
• Single read/write port
• Clock: CLKA ADDRA Port A
36 36
• Address: ADDRA 4
DIA DOA
WEA
• Write enable: WEA CLKA
• Write data: DIA 36 Kb
• Read data: DOA Memory
Array
• 36-kbit configurations
• 32k x 1, 16k x 2, 8k x 4, 4k x 9, 2k x 18, 1k x 36
• 18-kbit configurations
• 16k x 1, 8k x 2, 4k x 4, 2k x 9, 1k x 18, 512 x 36
• Configurable write mode
• WRITE_FIRST: Data written on DIA is available on DOA
• READ_FIRST: Old contents of RAM at ADDRA is presented on DOA
• NO_CHANGE: The DOA holds its previous value (saves power)
Summary of Block RAM Configurations
18kbit 36kbit

32k x 1, 16Kx2,
16Kx1, 8Kx2, 4Kx4,  1 read/write port
Single Port 8Kx4, 4Kx9,
2Kx9, 1Kx18  Read OR write in 1 cycle
2Kx18, 1Kx36

32Kx1, 16Kx2,  Two fully independent


16Kx1, 8Kx2, 4Kx4,
True Dual Port 8Kx4, 4Kx9, read/write ports
2Kx9, 1Kx18
2Kx18, 1Kx36  Any two operations in 1 cycle

32K x 1, 16Kx2,
16Kx1, 8Kx2, 4Kx4,
8Kx4, 4Kx9,  1 read port and 1 write port
Simple Dual Port 2Kx9, 1Kx18,
2Kx18, 1Kx36,  Read AND write in 1 cycle
512x36
512x72
SelectI/O
5.0V 1.8V 3.3V 2.5V SelectI/O Allows Connection
Directly to External Signals of
Varied Voltages & Thresholds

PCI SSTL HSTL

Future Standards Can be


Supported Without Having GTL GTL+ AGP
to Make Silicon Changes

4 System Interfaces
SelectI/O
• Allows Connection & Use of a Wide Variety of Devices
• Processors, Memory, Bus Specific Standards, Mixed Signal...
• Provides Industry Standard IEEE/JDEC I/O Standards
• Maximizes Speed/Noise Tradeoff - Use Only What is Needed
• Can Connect to or Create High Performance Backplanes
• PCI, GTL+, HSTL
• DIY - Virtex Based Backplane Design in Progress
• Define I/O by Simply Placing Desired Input And/Or Output
Buffers Into the Design
• Special IBUF and OBUF Components Provided in Schematic Based and
HDL Based Design Flows
• For Example: SSTL3, Class I Output Buffer - OBUF_SSTL3_I
Simplified IOB Structure
• Fast I/O Drivers
DFF/LATCH

• Separate Registers for Input, D


CE
Q

Output & Three-State Control S/R

• Asynchronous Set or Reset


Available on Each Flip-flop
• Common Clock, Separate Clock DFF/LATCH
D Q
Enables CE
PAD
S/R

• Programmable Slew Rate, Pullup,


Input Delay, Etc
• Selectable I/O Standard Support DFF/LATCH
D Q
CE
• Supported Standards List can be S/R

Updated After Testing


How It Works
SelectI/O Output SelectI/O Input
Configuration Bits

OBUF_SSTL3_I IBUF_SSTL3_I

SSTL3 Class1 SSTL3 Class1


Output Driver Input Receiver
Xilinx 7 Series

Industry’s Best Industry’s Highest


Lowest Power
Price-Performance System Performance
and Cost
“New Class of FPGA” and Capacity
Compared to Spartan-6 Compared to Virtex-6 Compared to Virtex-6
 30% more performance  Comparable performance  2.5x larger (2M LCs)
 Lower system cost with 50% lower cost for 2x  50% higher performance
 50% less power better price-performance  50% lower power
 30% smaller footprint  50% less power  2x line rate (28 Gb/s)
Compared to Spartan-6  Similar EasyPath™ cost
 3.3x larger reduction
 Over 2x performance with
4x transceiver speed
 Superior price-performance
Page 50
7 Series FPGA Layout
• Similar Floorplan to Virtex-6 FPGAs
– Provides easy migration to 7 series
FPGAs
• CMT columns moved from center of
device to adjacent to I/O columns
– No more inner vs. outer column
performance difference
– Support for higher performance
interfaces
• Only one I/O column per half device I/O Columns

– Uniform skew from center of device CMT Columns

• GT columns replace I/O and CMT in Clock Routing


smaller devices CLB, Block RAM, DSP Columns
• GT columns not always present GT Columns

Page 51
7 Series Slice Structure
• Four six-input Look Up Tables (LUT)
• Wide multiplexers
LUT/RAM/SRL

• Carry chain
• Four flip-flop/latches LUT/RAM/SRL

• Four additional flip-flops

• The implementation tools (MAP)


LUT/RAM/SRL

are responsible for packing slice


resources into the slice LUT/RAM/SRL

01
7-Series I/O Block Diagram
Logical Resources Electrical Resources

OLOGIC/
ODELAY
OSERDES
P

Interconnect to FPGA Fabric


ILOGIC/
IDELAY
ISERDES

Master
LVDS
Termination

Slave
ILOGIC/
IDELAY
ISERDES
N
OLOGIC/
ODELAY
OSERDES
7 Series FPGAs DSP
• 7 series FPGAs DSP slice 100% based on Virtex-6 FPGA
DSP48E1
• 25x18 multiplier
• 25-bit pre-adder
• Flexible pipeline
• Cascade in and out
• Carry in and out
• 96-bit MACC
• SIMD support
• 48-bit ALU
• Pattern detect
• 17-bit shifter
• Dynamic operation (cycle by cycle)

Programmable
Systems
Integration
Programmable
Highly Capable, Dedicated DSP Logic in Every 7 Series FPGA
Systems Integration
Page 54
7-Series Gigabit Transceivers
2
Tx
FPGA
PMA PCS
Fabric
2 Interface
Rx
PMA PCS

• Dedicated parallel-to-serial transmitter and serial-to-parallel receiver


• Unidirectional, differential bit-serial data I/O
• Integrated PLL-based Clock and Data Recovery (CDR)

• Parallel interface to the FPGA internal fabric


• Width varies by family, protocol, and line rate from 8 to 40 bits

• Serial interface to the printed circuit board (differential signaling)


• Differential Current Mode Logic (CML)
• Two traces for the transmitter and two traces for the receiver; removes common-mode noise

You might also like