Physical Design.: Performance Area Reliability: Power
Physical Design.: Performance Area Reliability: Power
Physical Design.: Performance Area Reliability: Power
Physical design means --->> netlist (.v) converted into GDSII form(layout form) logical
connectivity of cells converted into physical connectivity.
During physical design, all design components are instantiated with their geometric
representations. In other words, all macros, cells, gates, transistors, etc., with fixed shapes
and sizes per fabrication layer, are assigned spatial locations (placement) and have
appropriate routing connections (routing) completed in metal layers.
Physical design directly impacts circuit performance, area, reliability, power, and
manufacturing yield. Examples of these impacts are discussed below.
Due to its high complexity, physical design is split into several key steps (Fig. 1).
Partitioning: breaks up a circuit into smaller sub-circuits or module which can each be
designed or analyzed individually.
Floorplanning: determines the shapes and arrangement of sub-circuits or modules, as
well as the locations of external ports and IP or macro-blocks
Power and ground routing (power planning): often intrinsic to floorplanning, distributes
power (VDD) and ground (GND) nets throughout the chip.
Placement: finds the spatial locations of all cells within each block.
Clock network synthesis: determines the buffering, gating (e.g., for power management)
and routing of the clock signal to meet prescribed skew and delay requirements
Global routing: allocates routing resources that are used for connections; example
resources include routing tracks in the channel and in the switch box
Detailed routing: assigns routes to specific metal layers and routing tracks within the
global routing resources.
Timing closure: optimizes circuit performance by specialized placement or routing
techniques
Physical verification:
After physical design is completed, the layout must be fully verified to ensure correct
electrical and logical functionality. Some problems found during physical verification can be
tolerated if their impact on chip yield is negligible. Therefore, at this stage, layout changes are
usually performed manually by experienced design engineers.
Design rule checking (DRC): verifies that the layout meets all technology-imposed
constraints. DRC also verifies layer density for chemical-mechanical polishing (CMP).
Layout vs. schematic (LVS): checking verifies the functionality of the design. From the
layout, a netlist is derived and compared with the original netlist produced from logic
synthesis or circuit design.
Parasitic extraction: derives electrical parameters of the layout elements from their
geometric representations; with the netlist, these are used to verify the electrical
characteristics of the circuit.
Antenna rule checking: seeks to prevent antenna effects, which may damage
transistor gates during manufacturing plasma-etch steps through the accumulation of excess
charge on metal wires that are not connected to PN junction node
Electrical rule checking (ERC): verifies the correctness of power and ground
connections, and that signal transition times (slew), capacitive loads and fan-outs are
appropriately bounded
EDA Tools:
1.P&R
Synopsys: ICCI, ICCII, DC COMPILER
cadence: Encounter, innovus
2. TIMING ANALYSIS
Synopsys: Primetime
cadence: tempus
3. physical verification:
Synopsys: Hercules
cadence: Assura
mentor: calibre(mostly used)
4. RC Extraction:
Synopsys: StarRCXT
5.formal verification:
Synopsys: formality
example of netlist:
module and_gate(y,a,b);
input a,b;
output y;
AND2 U1(.Y(y), .A(a), .B(b));
endmodule
It’s like Norton equivalent circuit It’s like Thevenin's equivalent circuits
current source used for driver modeling Voltage source used for driver modelling
CCS file is 10x times larger than NLDM Smaller than CCS file
because of more numbers of variables
Runtime for CCS is more Runtime is less
time units
voltage unit
leakage power unit
capacitive load unit
slew rate
rise and fall time
For each cell following attributes are present,
area of cell
leakage power
capacitance
rise and fall capacitance
for each pin direction and their capacitance
Lookup tables are defined for different parameters like cell delay, hold, setup, recovery,
removal with different matrix
cell_fall (delay_template_6x6) {
index_1 (“0.015, 0.04, 0.08, 0.2, 0.4, 0.8”);
index_2 (“0.06, 0.18, 0.42, 0.6, 1.2, 1.8”);
values ( \
“0.0606, 0.0624, 0.0744, 0.0768, 0.09, 0.098”, \
“0.1146, 0.1152, 0.1164, 0.1212, 0.1314, 0.1514”, \
“0.201, 0.2004, 0.2052, 0.2058, 0.2148, 0.2168”, \
“0.48, 0.4806, 0.4812, 0.4824, 0.4866, 0.4889”, \
“0.9504, 0.9504, 0.9504, 0.951, 0.9534, 0.975” \
“0.6804, 0.6820, 0.6836, 0.6851, 0.6874, 0.6895" \);
# index_1 represents input transition.
#index_2 represents output load i.e output net capacitance.
Ques: what would be the cell_fall time if input_net_transition is 0.08 and the output load
is 0.6?
Ans: 0.2058
example of library:
cell(OR2_3) {
area : 6.00
power:
rise_time:
fall_time:
pin (O) {
direction : output;
timing () {
related_pin : "A";
rise_propagation() }
rie_transition()}
function : "(A||B)"; #functionality
max_cap:
min_cap: }
pin (A) {
direction: input;
cap: ;}
Physical Library(.lef) :
Technology File :
Contain the number of metal layers and vias and their name and conventions.
Design rules for metal layers like the width of metal layer and spacing between two
metal layers.
Metal layers resistance and capacitance as well as routing grid.
Unit, precision, color, and pattern of metal layer and via.
Maximum current density is also present in the tech file.
Contains ERC rules, Extraction rules, LVS rules.
Physical and electrical characteristics of each layer and via.
It contains nwell,pwell, metal pitch.
tech file should be compatible with both physical & timing libraries
example of tech file:
/* specify units and unit values*/
technology {
name
unit
operating conditions
routing_rule_models
}
/* define six basic color used to create display colors*/
[ primary color {
primary_color_attributes
}]
/* define layer specific characteristics including display*/
/* characteristics and layer specific routing design rules*/
/* define layer and data types*/
/* defining vias used in the design */
/* define inter layer ruting design rules */
/* defining cell rows spacing rules */
/* defining density rules*/
/* defining via and slot rules*/
/* defining capacitance ,resistace and temperature coeff of the layer*/
TLU+(Table Lookup) :
It is a table containing wire cap at diffrent net length and spacing.
contain RC coeficients for specific technology.
TLU+ files are extracted or generated from ITF(contains interconnect details) file
results.
The main function of this files are--
[a] . Extracted R, C parasitics of metal per unit length
[b] . These RC parasitics are used for calculating net delays.
[c]. If TLU+ files are not present these R,C parasitics extracted from.ITF files
[d]. For loading of TLU+ we have to load 3 files: 1. TLU+ 2. Min TLU+ 3. Max TLU+
[e]. Map file maps the .itf file and .tf files of the layer and via names.
Milkyway.tf also contain parasitics model of wire as TLU+ contains. If you specify in ICC the
TLU+ files then ICC used TLU+ files and did not read parasitics from .tf. if not specified by
default ICC will use .tf.
advantage of TLU+
1.more accurate
2.different TLU+ for different RC corners and scenario.
Sanity checks:
To ensure that the input received from the library team and synthesis team is correct or not.
If we are not doing these checks then it creates problems in later stages of design.
Basically, we are checking following input files: and make sure that these files are complete
and not erroneous.
1. design/netlist checks
2. SDC checks
3. Library checks
Design checks:
Check if current design is consistent or not
It checks the quality of netlist and identifies:
1. Floating pins
2. Multidriven nets
3. Undriven input ports
4. Unloaded outputs
5. Unconstrained pins
SDC Checks:
1. If any unconstrained paths exist in the design then PNR tool will not optimize that path, so
these checks are used to report unconstrained paths
2. Checks whether the clock is reaching to all the clock pin of the flip-flop.
3. Check if multiple clock are driving same registers
4. Check unconstrained endpoints
Library checks:
It validate the library i.e. it checks the consistency between logical and physical libraries.
It checks the qualities of both libraries.
check_library: This command shows the name of the library, library type & its version, units
of time, capacitance, leakage power, and current. It shows the number of cells missing, the
number of metal or pins missing in the physical and logical library.
ICC2 Command: report_design_mismatch
Floor planning:
Floorplanning is the art of any physical design. A well and perfect floorplan leads to an ASIC
design with higher performance and optimum area.
Floorplanning can be challenging in that, it deals with the placement of I/O pads and macros
as well as power and ground structure.
Before we are going for the floor planning to make sure that inputs are used for floorplan is
prepared properly.
After physical design database creation using imported netlist and corresponding library
and technology file, steps are
1. Decide core width and height for die size estimation.
2. IO pad sites are created for placement of IO pad placement.
3. Placement of macros.
4. The standard cell rows created for standard cell placement.
5. Power planning (pre routing)
6. Adding physical only cells
apart from this aspect ratio of the core, utilization of core area, cell orientation, and core to
IO clearance are also taken care of during the floorplan stages.
Pad placement:
In ASIC design three types of IO Pads. Generally pad placement and pin placement is done by
Top-Level people. It is critical to the functional operation of an ASIC design to ensure that the
pads have adequate power and ground connections and are placed properly in order to
eliminate electro-migration and current-switching related problems.
1. Power
2. Ground
3. Signal
What is electromigration?
Electro-migration is the movement or molecular transfer of electrons in the metal from one
area to another area that is caused by a high density electric current inflows in the metal. High
density current can create voids or hillocks, resulting in increased metal resistance or shorts
between wires and it can degrade the ASIC performance.
One can determine the minimum number of ground pads required to satisfy the current limit,
and the required number of power pads equal to the number of ground pads.
Where
Ngnd = Itotal/Imax
Current switching noise is generated when there is a transition between states on metal
layers. This will cover in crosstalk topic.
Macro placement:
Macros may be memories, analog blocks. Proper placement of macros has a great impact on
the quality and performance of the ASIC design. Macro placement can be manual or
automatic.
Manual macro placement is more efficient when there are few macros to be placed. Manual
macro placement is done based on the connectivity information of macros to IO pin/pads and
macro to macro. Automatic macro placement is more appropriate if the number of macros is
large.
Types of macros:
Hard macros: The circuit is fixed. We can’t see the functionality information about
macros. Only we know the timing information.
Soft macros: The circuit is not fixed and we can see the functionality and which type
of gates are using inside it. Also we know the timing information.
Lots of iterations happen to get optimum floorplan. the designer takes care of the design
parameter such as power, area, timing and performance during floorplanning.
outputs of floorplan:
1. get core and boundary area
2. IO ports/pins placed
3. macros placement done
4. floorplan def file
fly lines:
macros are placed manually using fly lines. fly lines are a virtual connection between
macros and macros to IO pads.
This helps the designer about the logical connection between macros and pads.
Fly lines act as guidelines to the designer to reduce the interconnect length and
routing resources.
fly lines are of two types:
macros to IO pin:
macros to macros fly lines:
create_keepout_margin -type hard/soft -outer {0.5 0.5 0.5 0.5} [get_cells -physical_context -filter
design_type==macro]
Blockages :
1. Blockages are the specific location where the placing of cells is blocked.
2. Blockages will not guide the tool but it allows not to place the standard cells buffer
and inverters at some particular area i.e by using blockages we blocked the area so no
standard cells and other cells won't be placed.
3. If the macros moved from one place to another place, blockages will not move.
4. Blockages are of three types. a) soft b) hard c)partial
soft blockages:
only buffer can be placed.
prevents from the placement of std cell and hard macro within the specified area
during coarse placement but allows placement of buffer/inv during optimization, legalization
and clock tree synthesis.
create_placement_blockages -boundary {{10 20} {100 200}} -name PB1 -type soft
or derive_placement_blockage
hard blockages:
No standard cells, macros and buffer/inv can be placed within the specified area
during coarse placement, optimization, and legalization.
Used to avoid routing congestion at macros corners.
Control power rails generation at the macros.
create_placement_blockages -boundry {{10 20} {100 200}} -name PB1 -type hard
partial blockages:
Partial blockages limit the cell density in the specified area.
By default the blockage factor is 100% so no cells can be placed in that region but if
we want to reduce density without blocking 100% area, we can change the blockage factor.
create_placement_blockages -Boundry {10 20 100 200} -type partial
blocked_percentage 40
Partial blockages with a maximum allowed cell density of 60% (blocked % is 40)
enclosed by a rectangle with corners (10 20) & (100 200)
To allow unlimited usage of a partial blockage area specify a blockage percentage to
zero.
Tap cells:
A tap cell is a special nonlogic cell with a well tie, substrate tie, or both.
Tap cells are placed in the regular intervals in standard cell row and distance between
two tap cells given in the design rule manual.
These cells are typically used when most or all of the standard cells in the library
contain no substrate or well taps.
Generally, the design rules specify the maximum distance allowed between every
transistor in a standard cell and a well or substrate tap.
Before global placement (during the floorplanning stage), you can insert tap cells in
the block to form a two-dimensional array structure to ensure that all standard cells placed
subsequently comply with the maximum diffusion-to-tap distance limit.
command in ICC2: create_tap_cells, specify the name of the library cell to use for tap
cell insertion (-lib_cell option) and the maximum distance, in microns, between tap cells
(-distance option)
icc2_shell>create_tap_cells -lib_cell myreflib/mytapcell -distance 30
Transistors Q1(NPN) and Q2(PNP) are parasitics transistors that are getting formed during the
manufacturing of CMOS inverter. If these two parasitic transistors are in on condition then
current starts flowing from VDD to VSS and creates a short circuit. while manufacturing these
devices the designer made sure that all PN junction should be in reverse bias so that no
parasitic transistor will turn on and hence the normal operation will not be affected. but
sometimes what happened because of external elements (like input and output) the parasitic
transistors get turned on. for parasitics transistor gets turned on there are two scenarios::
1. when the input and output > VDD: PNP transistor in ON condition: because now P
region is more positive than N region in Nwell, therefore Base-Emitter junction of PNP
(Q2) transistor is in Forward biased and now this transistor will turn on. Now if we see
in the fig the collector of PNP transistor is connected to the base of NPN transistor,
because of this connection the current is flowing from collector (PNP) to base (NPN)
and then because of this base current the NPN transistor gets turn on and the current
flowing from VDD to VSS through these two parasitics transistors. This current is
flowing even if we removed the external inputs and outputs and parasitic transistors
make a feedback path in which current is latched up and creates a short circuit path.
2. when input and output <VSS: NPN transistor in ON condition: Now N region is more
negative than P region in P substrate, therefore Base-Emitter junction of NPN (Q1)
transistor is in Forward biased and now this transistor will turn on. Now if we see in
the fig the Base of NPN transistor is connected to the Collector of PNP transistor,
because of this connection the current is flowing from Base (NPN) to Collector (PNP)
and then because of this Collector current the PNP transistor gets turn on and current
flowing from VDD to VSS through these two parasitics transistors. This current is
flowing even if we removed the external inputs and outputs and parasitic transistors
make a feedback path in which current is latched up and creates a short circuit path.
In the fig shown above, the value of Rnwell and Rpsub resistance is quite high, if the values
of these resistances will reduced then what will happen? The current flowing from the
collector of PNP transistor will flow from these resistance paths, i.e current find the low
resistance path to reach from VDD to VSS and NPN transistor will never get turn on and in this
way, latchup problem will not occur.
Reducing the resistance values: tap the Nwell to VDD and Psubstrate to GND externally.
fig: tapping the VDD with Nwell and VSS with Psub externally
Tie cells:
These are special-purpose cells whose output is constant high or low. The input needs
to be connected to the gate of the transistor and there are only two types of input logic 1 and
logic 0, but we do not connect them directly to gate of the transistor as with supply glitches
can damage the transistor so we used tie high and tie low cells (these are nothing but resistors
to make sure that PG network connected through them) and output of these cells are
connected to the gate of the transistor.
There will be floating nets because of unused inputs they should be tie with some
value either low or high to make them stable.
Insert the tie cells manually also by command connect_tie_cells, this command insert
tie cells and connect them to specified cell ports.
The gate oxide is very thin and it is very sensitive to voltage fluctuations. If the Gate oxide is
directly connected to the PG network, the gate oxide of the transistor may get damaged due
to voltage fluctuations in the power supply. To overcome this problem tie cells are used.
Tie high cells: initially we directly connect VDD to the gate of transistor now we connect the
output of these cells to the gate of the transistor if any fluctuations in VDD due to ESD then
PMOS circuit pull it back to the stable state. PMOS should be ON always, the input of the
PMOS transistor is coming from the output of NMOS transistor and here in NMOS gate and
drain are shorted and this is the condition of saturation (NMOS) and NMOS will act as pull-
down and always give a low voltage at the gate of PMOS. now PMOS will on and gives stable
high output and this output is connected to the gate of transistor
Tie low cells: initially we directly connect VDD to the gate of transistor now we connect the
output of these cells to the gate of the transistor, if any fluctuations in VDD due to ESD
(electrostatic discharge) then NMOS circuit pull down it back to the constant low stable state.
in fig the gate and drain of PMOS transistor are shorted and hence this is on saturation region
and it acts as pull up resistor and it always gives high voltage to the gate of NMOS transistor
and because of this high voltage the NMOS transistor will be ON all the time and we get stable
low output because it acts as pull-down transistor and this output is connected to the gate of
the transistor.
Filler Cells
To ensure that all power nets are connected, you can fill empty space in the standard-
cell rows with filler cells.
Filler cells have no logical connectivity. these cells are provided continuity in the rows
for VDD and VSS nets and it also contains substrate nwell connection to improve substrate
biasing.
Filler cell insertion is often used to add decoupling capacitors to improve the stability
of the power supply and discontinuity in power.
The IC Compiler II tool supports filler cells with and without metal and supports
both single-height and multi-height filler cells.
fig2: Filler cells, Decap cells
If there is continuity b/w nwell and implant layer it is easier for foundry people to generate
them and the creation of a mask is a very costly process so it is better to use only a single
mask.
If nwell is discontinuous the DRC rule will tell that place cells further apart i.e maintain the
minimum spacing because there is a well proximity effect.
we know nwell is tap to VDD and P substrate is tap to VSS to prevent latchup problem. now
if there is a discontinuity in nwell it will not find well tap cells, so we have placed well tap
cells explicitly, therefore it will increase the area explicitly, hence we have filler cells so no
need to place well tap cells.
Decap cells:
These are typically poly gate transistors where source and drain are connected to the
ground rails and the gate is connected to the power rails. (fig2)
The gate in a transistor consumes most of the power(dynamic) only at the active edge
of the clock at which most of the sequential elements are switching.
No voltage source is perfect hence glitches are produced on the power line due to this
huge current drawn from the power grid at the active edge of the clock.
We already know that power and ground rails are made up of metal and metal have
its own resistance and inductance, when current is flowing through these metal wires there
is some voltage drop.
If the power source is far away from a flip flop the chances are that this flop can go
into the metastable state due to this IR drop because it will not get sufficient voltage to work
properly.
Decap filler cells are small capacitors which are placed between VDD and GND all over
the layout when the logic circuit draw a high amount of current, this capacitor provides extra
charge to that circuit. when logic circuit not drawing any current, the de-cap is charged up to
maximum capacitance.
we know V= IR + L di/dt.
now if the voltage drop is greater than Noise margin(NM) of the inverter then it might lead to
improper functionality. Let's say NM is 0.4 now 1-0.4=0.6v (anything less than 0.6 will consider
as 0 by this inverter)
now let's say voltage drop is 0.5 i.e 1-0.5=0.5v now this will consider as 0 therefore it will lead
to improper functionality. now we have seen voltage drop > NM so here de-cap cells comes
into picture so these cells are capacitor that holds the reservoir of charges and these cells are
placed near to the power-hungry cells, so whenever these cells switch the de-cap cells starts
discharging themselves and provided the required voltage to the particular cells and keep the
power supply constant to all the elements in the design.
With increasing clock frequency and decreasing supply voltage as technology scales,
maintaining the quality of power supply becomes a critical issue. Typically, decoupling
capacitors (de-caps) are used to keep the power supply within a certain percentage (e.g., 10%)
of the nominal supply voltage. De-caps holds a reservoir of charge and are placed close to the
power pads and near any large drivers. When large drivers switch, the de-caps provide
instantaneous current to the drivers to reduce IR drop and Ldi/dt effects, and hence keep the
supply voltage relatively constant. A standard de-cap is usually made from NMOS transistors
in a CMOS process. At the 90nm technology node, the oxide thickness of a transistor is
reduced to roughly 2.0nm. The thin oxide causes two new problems: possible electrostatic
discharge (ESD) induced oxide breakdown and gate tunneling leakage. Potential ESD oxide
breakdown increases the likelihood that an integrated circuit (IC) will be permanently
damaged during an ESD event and hence raises a reliability concern. Higher gate tunneling
leakage increases the total static power consumption of the chip. As technology scales further
down, with a thinner oxide, the result is an even higher ESD risk and more gate leakage. The
standard de-cap design experiences these two problems for more please visit the link below.
Before placing the standard cells, we can add boundary cells to the block. Boundary
cells consist of end-cap cells, which are added to the ends of the cell rows and around the
boundaries of objects such as the core area, hard macros, blockages, and voltage areas, and
corner cells, which fill the empty space between horizontal and vertical end-cap cells. (fig1)
End-cap cells are typically nonlogic cells such as a decoupling capacitor for the power
rail. Because the tool accepts any standard cell as an end-cap cell, ensure that you specify
suitable end-cap cells.
Boundary cells include both end-cap cells placed on the left, right, top, and bottom
boundaries, and inside and outside corner cells and.
To know the end of the row and at the edges, endcap cells are placed to avoid the
cell's damages at the end of the row to avoid the wrong laser wavelength for correct
manufacturing.
Boundary cells protect your design from external signals.
These cells ensure that gaps do not occur between the well and implant layer and to
prevent from the DRC violations.
When you insert these at the end of the placement row, these will make sure that
these cells properly integrated to the design and will have a clean unwell, other DRC clean,
hence next block will abut without any issues.
POWER PLANNING
Before 2000 area, delay and performance were the most important parameters, if anyone
design circuit the main focus was on how much less area is occupied by the circuit on the chip
and what the speed is. Now situation is changed, the performance and speed is a secondary
concern. In all nanometer (deep sub-micron) technology power becomes the most important
parameter in the design. Almost all portable devices run on battery power. Power
consumption is a very big challenge in modern-day VLSI design as technology is going to
shrinks Because of
1. Increasing transistors count on small chip
2. Higher speed of operations
3. Greater device leakage currents
Grid structure:
Power planning means to provide power to the every macros, standard cells, and all other
cells are present in the design. Power and Ground nets are usually laid out on the metal layers.
In this create power and ground structure for both IO pads and core logic. The IO pads power
and ground buses are built into the pad itself and will be connected by abutment.
For core logic there is a core ring enclosing the core with one or more sets of power and
ground rings. The next consideration is to construct cell power and ground that is internal to
core logic these are called power and ground stripes that repeat at regular intervals across
the logic or specified region, within the design. Each of these stripes run both vertically and
horizontally at regular interval then this is called power mesh.
The total number of stripes and interval distance is solely dependent on the ASIC core power
consumption. As power consumption (static and dynamic) increases the distance of power
and ground straps interval increase to reduce overall voltage drop, thereby improving
performance.
In addition to the core power and ground ring, macro power and ground rings need to be
created using vertical and horizontal metal layers. A macro ring encloses one or more macros,
completely or partially with one or more sets of power and ground rings.
It is strongly recommended to check for power and ground connectivity after construction of
the entire PG network.
Standard cell are placed in a row since they are placing side by side they are touching each
other, and horizontal routing across the cells already done so you don’t have to anything it’s
like a continuous line. That’s why we are placing filler cells if any gap is thereafter
optimization.
standard cells are placed in the rows
NOTE: as you are going to deep sub-micron technologies the numbers of metal layers are
increased like in 28 nm technology we have only 7 metal layer and for 7nm technology, we
have 13 metal layers, and one more thing as we are going to lower metal layers to higher
metal layers the width of the layers is increasing. So better to use higher layers for the power
line. They can carry more current.
The calculation depends on the power consumed by chip, total current drawn, current rating
of power pad (A/m), current rating of metal layers, and the percentage of current flowing
through metal layers.
In this power rings are formed for I/O cells and trunks are created between core and power
rings and power pads.
Real networks must balance these competing demands, meeting targets of noise and
reliability. The noise goal is typically ±10%; for example, a system with a nominal voltage Vdd=
1.0 V may guarantee the actual supply remains within 0.9-1.1V. The two fundamental sources
of power supply noise are IR drop and L di/dt noise.
IR Drop:
The power supply in the chip is distributed uniformly through metal layers across the design
and these metal layers have their finite amount of resistance. When we applied the voltage
the current starts flowing through these metal layers and some voltage is dropped due to that
resistance of a metal wire and current. This drop is called IR drop. Because of IR drop, delay
will increase and it violates the timing. And this will increase noise and performance will be
degraded.
The value of an acceptable IR drop will be decided at the start of the project and it is one of
the factors used to determine the derate value. If the value of the IR drop is more than the
acceptable value, it calls to change the derate value. Without this change, the timing
calculation becomes optimistic.
For example, a design needs to operate at 1.2 volts and it has tolerance (acceptable IR drop
limit) of ±0.4 volts, so we ensure that the voltage across VDD and VSS should not fall below
0.8v and should not rise above 1.6v. If the drop will be in this limit it does not affect timing
and functionality
Power routes generally conduct a lot of current and due to this IR drop will become into the
picture so we are going to higher layers to make these routes less resistive. For power routing
top metal layers are preferred because they are thicker and offer less resistance.
For example, if there are 15 metal layers in the project then top two or three layers used by
hierarchy people (top level or full chip) and metal 11 and metal 12 used for power planning
purposes.
Static IR drop:
This drop is independent of cell switching. And this is calculated with the help of metal own
resistance.
Dynamic IR drop:
This drop is calculated with the help of the switching of cells. When a cell is switching at the
active edge of the clock the cell requires large current or voltage to turn on but due to voltage
drop sufficient amount of voltage is not reached to the particular cell and cell may be goes
into metastable state and affect the timing and performance.
Electromigration:
When a high density of current is flowing through metal layers, the atoms (electron) in the
metal layers are displaced from their original position causing open and shorts in the metal
layers. Heating also accelerates EM because higher temperature cause a high number of
metal ions to diffuse.
In higher technology nodes we saw the EM on power and clock metal layers but now in lower
nodes the signal metal layers are also needed to be analyzes due to an increased amount of
current density.
Clock nets are more prone to EM because they have high switching activity, because of this
only we are avoiding to use high drive strength clock buffers to build the clock tree.
So,
Dynamic power:
Whenever input signals in the circuit are changed its state with time, and it causes some
power dissipation. Dynamic power is required to charging and discharging the load
capacitance when transistor input switches
When the input switches from 1 to 0 the PMOS transistor (PULL UP network) turns ON and
charges the load to VDD. And that time energy stored in the capacitance is
EC = ½ CL V2DD
When the input switches from 0 to 1, the PMOS transistor turns off and the NMOS transistor
is turned ON, and discharging the capacitor. The energy stored in the capacitor is dissipated
in the NMOS transistor. No energy is drawn from the power supply in this case.
Depending upon the inputs at the gate of the transistor one gate is on the other is off because
of charging and discharging of the capacitor. The inverter is sized for equal rise and fall times
so we know that in one cycle we have rising and falling transition.
On rising edge output change Q = CVDD is required to charge the output node to VDD (i.e. cap
is charged to VDD) and on falling edge the load capacitance is discharged to GND.
Now suppose gate switches at some average frequency fsw (switching frequency). Over the
time period T, the load is charging and discharging T*fsw times. So average power dissipation
is
This is called dynamic power because it arises from the switching of the load. Because most
gates don’t switch every clock cycle, so it is convenient to express switching frequency as an
activity factor(α) times the clock frequency f, now power dissipation written as
Pswitching = α CV2DD f
Activity factor:
The activity factor is the probability that the circuit node transitions from 0 to 1 because that
is only the time the circuits consume power. A cock has an activity factor α = 1 because it rises
and falls every cycle. The activity factor is powerful and easy to use lever for reducing power.
If a circuit can be turned off entirely, the activity factor and dynamic power go to zero. When
a block is on the activity factor is 1. Glitches in the circuit can increase the activity factor.
Techniques to reduce dynamic power:
we know
Pswitching = α C V2DD f
Reduce α (clock gating, sleep mode)
Reduce C (small transistors, short wires, smaller fan-out)
Reduce Vdd (reduce supply voltage up to which circuit works correctly)
Reduce f (reduce the frequency to the extent if possible power-down mode, without
sacrificing the performance beyond acceptable level)
Dynamic power includes a short circuit power component. It occurs in CMOS when input of
gate switches. When both pullup and pulldown networks are conducting for a small duration
and there is a direct path b/w VDD to VSS. during this scenario spikes will be generated
momentarily in the current as shown in fig below. The current is flowing from VDD to VSS is
also called cross-bar current.
This is normally less than 10% of the whole power so it can be estimated by adding 10% to
the switching power. This power is directly related to the frequency of switching, as clock
frequency increases the frequency of transition is also increased thus short circuit power
dissipation also increases.
Clock gating:
Power dissipation is highly dependent on the signal transition activity in a circuit and the clock
is mainly responsible for signal activities. Clock is the brain of the entire system so wherever
clock transition takes place all the circuit works synchronously. Sometimes we have not
required that clock to some blocks if we disable this clock to that particular block then
switching activity reduced and activity factor α will also reduce and power dissipation also
reduced.
The inactive Registers are loaded by clock not directly but loaded through the OR gate using
enable signal. When we know we don’t require a functional unit we will set the enable to 1
so that the output of the OR gate will be constant 1, and the value of register will not change
and therefore there will be no signal transition into a functional unit. When we are switching
off the clock signal from the functional unit additional logic is required depending on the
scenario that the functional unit required or not. But clock signal might add some delay in the
critical path due to additional circuitry (OR gate and for enable different circuitry) then skew
analysis is required.
CLOCK GATING
Glitches:
We know in reality every gate has a finite amount of delay because of that delay only glitch
will occur.
Now the input is changed to 111 then output O1 = 0 but O2 =1 remains the same. These both
the ideal case when gate delays are zero.
Now what happened when we will consider the gate delays also?
The gate delay will be added on to both output because of the small amount of delay the
glitch will appear.
so how it will be reduced?
The output is the same for both and the number of gates is also the same and we achieved
reduced glitch power because the signal will reach at the same time in figure 2 (output of first
and second gate) so there is no delay difference. Another advantage that the critical path
delay is also reduced (in first figure it takes 3 gates to get the outputs and in second figure
output takes only 2 gates. So we can reduce the glitch power dissipation by the realization of
the circuit in balanced form instead of cascaded form.
Static approach: where the distribution of power supply voltage is fixed a priori among
various functional blocks.
First, you analyze the circuit and we can reduce the supply voltage that will make your circuit
consumes less power but the circuit becomes may be slower so you have to analyze the circuit
and find out which part of the circuit is not critical in terms of delay, you can possibly make
that part little slower without touching the overall performances. So identify those parts and
reduce the power supply for that.
Suppose I have a circuit having three functional modules, let’s assume we required the central
block to be running fast but the other two blocks are running slower. So we use pair of voltage
rails one is for low supply voltage and the other is for high supply voltage so the block which
is supposed to run faster they are fed by the high supply voltage and the other two feed by
low supply voltages rails, so our voltage is saved.
Here the distribution of the voltages is always fixed and here additional circuitry is required
between different power domains because signals are sending from slow block to fast block
and vice versa, some voltage translation is required.
Dynamic approach:
We adjust the operating voltage and frequency dynamically to match the performance
requirements.
The modern-day processors can have different power modes like our laptops have different
mode standby, sleep, hibernate, etc. Previously power is not the issue now power becomes
more important because everyone wants high-performance mode with high battery mode
i.e. system should run for a longer time.
This approach provides flexibility & doesn’t limit the performance. The penalty of transition
between two states can be high, it will take some time from one mode to another mode so it
should not be done frequently.
Cells are arranged in a row, there are two different voltage domains high and low. So all the
cells who are supposed to give the high performance will be placed in high voltage domain
and lower performance cells are sits in the low voltage domain. This is allowed for both
macros and standard cell voltage alignment. sometimes times voltage island is in the same
circuit row like some circuit has performance and some have high but if we are doing this the
problem of power routing will become more difficult.
Static power:
Static power is consumed even when a chip is not switching they leak a small amount of
current. CMOS has replaced NMOS processes because contention current inherent to NMOS
logic limited the number of transistors that could be integrated on one chip. Static CMOS
gates have no contention current.
In processes with feature size above 180nm was typically insignificant except in very low
power applications. In 90 and 65 nm processes, threshold voltage has reduced to the point
that subthreshold leakage reaches levels of 1 sec to 10 sec of nA per transistor, which is
significant when multiplied by millions of transistors on a chip.
In the 45 nm process, oxide thickness reduces to the point that gate leakage becomes
comparable to subthreshold leakage unless high-k dielectric is employed. Overall, leakage has
become an important design consideration in nanometer processes.
LEAKAGE CURRENT PATHS
It is caused by the thermal emission of carriers over the potential barrier set by a threshold.
In a real transistor, current does not abruptly cut off below the threshold but rather drops off
exponentially as shown in fig:
When the gate voltage is high (Vgs>Vt), the transistor is strongly ON. When the gate falls
below Vt (Vgs < Vt) the exponential decline in current appears as a straight line. This current
increase with temperature. It also increased as Vt is scaled-down along with the power supply
for better performance.
The p-n junction between diffusion and the substrate or well form diode as shown in fig: The
well-to-substrate junction is another diode. The substrate and well are tied to GND or VDD to
ensure these diodes do not become forward biased in normal operation. However, reversed
– biased diode still conducts a small amount of current I D. Junction leakage occurs when a
source or drain diffusion region is at a different potential from the substrate.
POWER GATING:
The easier way to reduce static current during sleep mode is to turn off the power supply to
the sleeping blocks. This technique is called power gating as shown in fig:
POWER GATING
The logic block receives its power from a virtual V DD rail, VDDV. When the block is active, the
header switch transistors are ON and connecting VDDV to VDD.
When the block goes to sleep, the header switch turns OFF, allowing VDDV to float and
gradually sink towards zero (0). As this occurs, the output of the block may take on voltage
level in the forbidden or unknown state. The output of isolations gates forces the output to
a valid level during sleep so that they do not cause problems in downstream logic.
Power gating introduces a number of design issues like header switch requires careful sizing,
it should add minimal delay to the circuit during active operation and should have low leakage
during sleep mode.
The transition between active and sleep modes takes some time and energy, so power gating
is only effective when a block is turned off long enough. When a block is gated, the state must
either saved or reset upon power-up. State retention registers use a second power supply to
maintain the state. The important registers can be saved to memory so the entire block can
be power-gated. The register must be reloaded from memory when power is restored.
Power gating was originally proposed as multiple threshold CMOS (MTCMOS) because it used
low-vt transistors for logic and high-vt for header and footer switch.
Power gating can be done externally with disable input to a voltage regulator or
internally with high–vt header and footer switches. External power gating completely
eliminates the leakage during sleep, nut it takes a long time and energy because the power
network may have 100s of nF decoupling capacitance to discharge. On-chip power gating can
use a PMOS header switch transistor or NMOS footer switch transistors. NMOS transistors
deliver more current per unit width so they can be smaller.
On the other hand, if both internal and external power gating is used, it is more consistent for
both methods to cut off VDD.
Fine-grained power gating: it is applied to individual logic gates, but placing this in every cell
has enormous area overhead.
Coarse-grained power gating: in this the switch is shared across an entire block.
UPF is an IEEE standard and developed by members of Accellera. UPF is designed to reflect
the power intent of a design at a relatively high level.
UPF scripts describe which power rails should be routed to individual blocks, when blocks are
expected to be powered up or shut down, how voltage levels should be shifted as signals cross
from one power domain to another and whether measures should be taken to retain register
and memory-cell contents if the primary power supply to a domain is removed.
The backbone of UPF (Synopsys), as well as the similar Common Power Format (CPF)
(cadence), is the Tool Control Language (TCL). The TCL command “create_power_domain”,
used to define a power domain and its characteristics. for example, if this command is used
by UPF-aware tools to define a set of blocks in the design that are treated as one power
domain that is supplied differently to other blocks on the same chip. The idea behind this type
of command is that power-aware tools read in the description of which blocks in a design can
be powered up and down independently.
Content in UPF:
Power domains: some time design have more than one power domain like multi Vdd
design and sometimes only a single power domain design. In this group of elements share a
common set of power supply.
Supply rails: distribution of power in supply nets, ports, supply sets, power state
Additional Protection By special cells: level shifters, isolation cells, power switches,
retention registers
Isolation cells:
Isolations cells always used between two domains active mode domains and shut down mode
domain.
Consider there are two power domains one D1 is in shutdown mode and D2 is in active mode,
if data is passed through D1 to D2 this will not a valid data received at D2. To prevent this
condition we insert an isolation cell b/w these two power domains to clamp a known value at
the D2 otherwise it will give unknown value.
Shut down domain outputs may be floating outputs and this could be a problem when other
active domains get these floating outputs as an input. This could affect the proper functioning
of the active domain.
These cells are also called “clamp” because these cells convert the invalid or floating outputs
to a known value or clamp it to some specified known value.
Level shifter:
The chip may be divided into multiple voltage domains, where each domain is optimized for
the needs of certain circuits. For example a system on a chip might use a high supply voltage
for memories to ensure cell stability, a medium voltage for a processor and low voltage for IO
peripherals. Some of the challenges in using voltage domain include voltage levels for signals
that cross domains, selecting which circuits belongs in which domain
Whenever a signal is going from low domain to high domain there will not be full output swing
available at the output of high domain or vice versa. In this condition we required level shifter
to shift up or down the voltage level according to the requirement.
Let us consider one example, there are two designs with different power domains, one design
is working on 1.5 v and the other is working on 1 v power supply. Now if the signal is passed
from 1.5v domain to 1 v domain then wrong logic is interpreted by 1v domain. So to prevent
this level sifter are inserted b/w two domains. The main function of the level shifter is to shift
the voltage level according to the requirement of voltage levels.
Retention Registers:
To reduce power consumption when the devices are not in use, the power domain is switched
off for those devices. When the design blocks are in switched off or sleep mode, data in all
flip-flops contained within a block will be lost. If the designer desires to retain this state then
retention registers are used. Retention register requires D flip-flop and latch and required
always-on supply to retain the data. Using the retention register area required more as
compared to normal flop because here we are using flip-flop with always-on supply. So the
area required more when we are using retention registers.
Always On cells:
These cells are always on irrespective of where they are placed. Generally buffer and inverters
are used for always ON cells. Always cells are also present in UPF. They can be a special cell
or regular buffer. If they are special cells they have their own secondary power supply and
placed anywhere in the design.
If they are regular buffer/inverters they require always-on cells and restrict the placement of
these types of cells in the specific region
For example if data needs to be routed through or from the sleep block domain to active block
domain and distance between both domains is very long or driving load is also very large then
buffer might be needed to drive the long nets from one domain to another.
After completion of floorplanning, power planning and placement of physical only cells
Endcap cells and Tap cells, we check the base DRC and errors related to floorplanning like
vertical spacing error, horizontal spacing error, min site row, vertical site row, and alignment.
After inserting the tap cells, check to ensure that a standard cell peaceable area is protected
by tap cells. Tap cells are placed correctly or not. Tap cells are typically used when most or all
of the standard cells in the library contains no substrate or well taps. Generally, the design
rule specifies the maximum distance allowed between every transistor in a standard cells and
a well or substrate tap.
Advanced nodes often requires the insertion of additional tap cells to manage the substrate
and well noise. Before placing the standard cells we add boundary cells (Endcap cells) also,
which are added to the ends of the cell rows and around the boundaries of objects such as
core and hard macros and we checked that endcap cells are placed or not. After you have
done the floorplanning i.e. created the core area, placed the macros and decided the power
network structure of your design, it is time to let the tool do standard cell placement.
Placement:
Placement is the process of finding a suitable physical location for each cell in the block.
Tool only determine the location of each standard cell on the die.
Placement does not just place the standard cell available in the synthesized netlist, it also
optimized the design.
The tool determines the location of each of the standard cell on the core. Various factors
come into play like the timing requirement of the system, the interconnect length and hence
the connections between cells, power dissipation, etc. the interconnect length depends on
the placement solution used, and it is very important in determining the performance of the
system as the geometries shrink.
Placement will be driven based on different criteria like timing driven, congestion driven,
power optimization.
Placement is performed in two stages: coarse placement and legalization.
Goal of placement:
Timing, power, area optimization
Routable design
Minimum cell density and pin density (Reduce the congestion due to cells and pins)
Minimum timing DRC’s
Before the start of placement optimization all wire load models are removed. Placement uses
RC values from the virtual route to calculate timing. The virtual route is the shortest
Manhattan distance between two pins. Virtual route RC values are more accurate than WLM
RC’s.
Coarse placement:
During the coarse placement, the tool determines an approximate location for each cell
according to the timing, congestion and multi-voltage constraints. The placed cells don’t fall
on the placement grid and might overlap each other. Large cells like RAM and IP blocks act as
placement blockages for standard cells. Coarse placement is fast and sufficiently accurate for
initial timing and congestion analysis.
Legalization:
During legalization, the tool moves the cells to legal locations on the placement grid and
eliminate any overlap between cells. These small changes to cell location cause the lengths of
the wire connections to change, possibly causing new timing violations. Such violations can
often be fixed by incremental optimization, for example: by resizing the driving cells.
Placement constraints provide guidance during placement and placement optimization and
legalization so that congestion and timing violations will be reduced.
1. Placement blockages
2. Placement bounds
3. Density constraint
4. Cell spacing constraint
Placement blockages:
It is the area where the cells must avoid during placement, optimization and legalization.
It can be hard and soft.
ICC tools supports two types of placement blockages
Keep-out margin
Area-based placement blockage: soft, hard, partial
Keep-out margin: it is a region around the boundary of fixed cells in a block in which no other
cells are placed. The width of the keep-out margin on each side of the fixed cell can be the
same or different. Keeping the placement of cells out of such regions avoids congestion and
net detouring and produces better QOR (quality of results).
Hard blockages: It prevents the placement of standard cells being placed in the blockage area.
Create_placement_blockage -boundary {10 20 100 200} –name pb0. By default it is hard
blockages.
Soft blockage: during optimization buffer/inv can be placed in that blockage area.
Create_placement_blockage -boundary {10 20 100 200} –name pb1 –type soft
Note: if you have both blockages are present at the same place then hard blockages take
priority over the soft placement blockages.
Placement bounds:
It is a constraint that controls the placement of groups of leaf cells and hierarchical cells. It
allows you to group cells to minimize wire length and place the cells at most appropriate
locations. When our timing is critical during placement then we create bounds in that area
where two communicating cells are sitting far from another. It is a fixed region in which we
placed a set of cells. It comprises of one or more rectangular or rectilinear shapes which can
be abutted or disjoint. In general we specify the cells and ports to be included in the bound.
If a hierarchical cell is included, all cells in the sub-design belong to the bound.
Types of bounds:
1. Soft move bound
2. Hard move bound
3. Exclusive move bound
Power optimization:
Dynamic power: This is the energy dissipated due to the voltage or logic transitions in the
design objects, such as cells, pins and nets. The dynamic power consumption is directly
proportional to the number and frequency of transition in the design.
Static (leakage) power: This is the energy dissipated even when there is no transition in the
circuit? This is also known as leakage power and depends upon device characteristics. The
main contributor to the leakage power is the sub-threshold-voltage leakage in the device. At
lower technology nodes, leakage power consumption contributes significantly to the total
power consumption of the circuit
During low power placement, the tool tries to minimize the length of high switching nets to
improve the power QOR.
During Dynamic power-driven placement, the tool tries to improve both the timing and power
of the critical nets and the power QOR without affecting the timing QOR.
Place_opt: This command performs coarse placement, HFNS, optimization and legalization.
In the place_opt command, the –congestion option causes the tool to apply –high effort to
congestion removal for better routability, this will require more runtime and cause area
utilization to be less uniform across the available placement area.
Refine_opt: if congestion is found to be a problem after placement and optimization. It can
improve incrementally with the refine_opt command.
Magnet placement:
To improve congestion for a complex floorplan or to improve timing for the design we can
use magnet placement to specify fixed object as a magnet and have the tool place all the
standard cells connected to the magnet object close to it. We can fix macrocells, pins of fixed
macro or IO ports as the magnet object.
For best results perform magnet placement before standard cell placement.
Command: magnet_placement
DFT optimization:
If block contains scan chains by default create_placement, place_opt and clock_opt
commands perform DFT optimization. During initial placement, the tool focuses on the QOR
for the function nets by ignoring the scan chains. After initial placement, the tool further
improves the QOR by repartitioning and reordering the scan chains based on the initial
placement.
Scan chains reordering reduces wire length so timing will improve.
Scan chains reordering minimize congestions and improves routability
The scan chain information (SCANDEF) from synthesis can be transferred to ICC compiler into
two ways:
By loading the netlist in DDC format
By loading a SCANDEF file.
Optimization techniques:
Netlist constructing only changes existing gates, does not change functionality.
1. Cloning
2. Duplicates gates
3. Gate sizing
4. Swapping of pins that can change the final delay
5. Fan-out splitting
Congestion:
Congestion occurs when the number of available routing resources is less than the required
routing resources. This condition we can see in global routing. A congestion map can help us
to visualize the quality of placement. The congestion map shows the borders between global
routing cells highlighted with different colors that represent the different levels of overflow.
The overflow and underflow of the all selected layers. For example, if a highlighted light blue
on the edge global routing cells shows 10/9 that means there are 9 available wire tracks and
the required tracks are 10.
If the design is congested, we rerun the place_opt with the –congestion and –effort high
options. During congestion driven placement, the cells which are sitting together and caused
the congestion are spread apart.
Place_opt –congestion_driven –effort high
If we have more pin density, which can be reduced by adding cell padding to the cells which
is causing congestion. Cell padding can be applied by setting the keepout margin command.
Create_keepout_margin –type soft –outer {10 10 10 10} my_lib_macro
Macro padding or placement halos, soft blockages and hard blockages around the macros
are like placement blockages around the edges of the macros. This makes sure that no
standard cells are placed near the pins of macros and corners of macros, thereby giving extra
breathing space for the macro pin connections to standard cells.
Change the floorplan (macros placement, macro spacing and pin orientation)
After placement we have position of all standard cells and macros and in placement we
have ideal clock (for simplicity we assume that we are dealing with a single clock for the
whole design). At the placement optimization stage buffer insertion and gate sizing and any
other optimization techniques are used only for data paths but in the clock path nothing we
change.
CTS is the process of connecting the clocks to all clock pin of sequential circuits by using
inverters/buffers in order to balance the skew and to minimize the insertion delay. All the
clock pins are driven by a single clock source. Clock balancing is important for meeting all the
design constraints.
Output of CTS:
CTS def
Latency and skew report
Clock structure report
Timing Qor report
CTS target:
Skew
Insertion delay
CTS goal:
Max Tran
Max cap
Max fan-out
A buffer tree is built to balance the loads and minimize skew, there are levels of buffer
in the clock tree between the clock source and clock sinks.
Effect of CTS:
Clock buffers are added congestion may increase non-clock cells may have been moved to
less ideal locations can introduce timing and tran/cap violations.
When deriving the clock tree, the tool identifies two types of clock endpoints:
Sink pins (balancing pins):
Sink pins are the clock endpoints that are used for delay balancing. The tool assign an insertion
delay of zero to all sink pins and uses this delay during the delay balancing.
During CTS, the tool uses sink pins in calculations and optimizations for both design rule
constraints for both design rule constraints and clock tree timing (skew & insertion delay).
Ignore pins:
These are also clock endpoints that are excluded from clock tree timing calculations and
optimizations. The tool uses ignore pins only in calculation and optimizations for design rule
constraints.
During CTS the tool isolate ignore pins from the clock tree by inserting a guide buffer before
the pin. Beyond the ignore pins the tool never performs skew or insertion delay optimization
but it does perform design rule fixing.
Output ports
Float pins: it is like stop pins but delay on the clock pin, macro internal delay.
Exclude pins: CTS ignores the targets and only fix the clock tree DRC (CTS goals).
Nonstop pin: by these pin clock tree tracing the continuous against the default behavior.
Clock which are traversed through divider clock sequential elements clock pins are considered
as non-stop pins.
Why clock routes are given more priority than signal nets:
Clock is propagated after placement because the exact location of cells and modules are
needed for the clock propagation for the estimation of accurate delay, skew and insertion
delay. Clock is propagated before routing of signals nets and clock is the only signal nets
switches frequently which act as sources for dynamic power dissipation.
IN POST PLACEMENT OPTIMIZATION AFTER CTS STAGES THE HOLD SLACK IS IMPROVED. AS
A RESULT OF CTS LOT OF BUFFERS ARE ADDED.
Skew, Latency, Uncertainty,Jitter
Skew:
This phenomenon in synchronous circuits. The Difference in arrival of clock at two consecutive
pins of a sequential element.
Sources of skew:
Wire interconnect length
Capacitive loading mismatch
Material imperfections
Temperature variations
Differences in input capacitance on the clock inputs
Types of clock skew:
Positive skew: if the capture clock comes late than the launch clock.
Negative skew: if the capture clock comes early than the launch clock.
Zero skew: when the capture clock and launch clock arrives at the same time. (ideally,
it is not possible)
Local skew: difference in arrival of clock at two consecutive pins of sequential
element.it can be positive and negative local skew also.
Global skew: the difference between max insertion delay and the min insertion
delay.it can be positive and negative local skew also.
max insertion delay: delay of the clock signal takes to propagate to the farthest leaf cell in
the design.
min insertion delay: delay of the clock signal takes to propagate to the nearest leaf cell in the
design.
The delay difference from the clock generation point to the clock endpoints.
Network latency: The delay from the clock definition points (create_clock) to the flip-flop
clock pins.
The total clock latency at the clock in a flip flop is the sum of the source and network latencies.
Set_clock_latency 0.851 –source –min [get_clocks clk_name2] -----> min source latency
Set_clock_latency 1.322 –source –max [get_clocks clk_name2] ------> max source latency
One important distinction to observe between source and network latency is that once a clock
tree is built for the design, the network latency can be ignored. However the source latency
remains even after the clock tree is built.
The network latency is an estimate of the delay of the clock tree before clock tree synthesis.
After clock tree synthesis, the total clock latency from the clock source to a clock in of a flip
flop is the source latency plus actual delay of the clock tree from the clock definition point to
the flip flop.
Clock Uncertainty: clock uncertainty is the difference between the arrivals of clocks at
registers in one clock domain or between domains. It can be classified as static and dynamic
clock uncertainties.
Static clock uncertainty: it does not vary or varies very slowly with time. Process variation
induced clock uncertainty. An example of this is clock skew.
Dynamic clock uncertainty: it varies with time. Dynamic power supply induced delay variation
and clock jitter is the example of this
Jitter: Jitter is the short term variations of a signal with respect to its ideal position in time. It
is the variation of the clock period from edge to edge.it can vary +/- jitter value. From cycle
to cycle the period and duty cycle can change slightly due to the clock generation circuitry.
This can be modeled by adding uncertainty regions around the rising and falling edge of the
clock waveform.
Sources of jitter:
Internal circuitry of the PLL
Thermal noise in crystal oscillators
Transmitters and receivers of resonating devices
NOTE :
The first important point is that there are two phases in the design of when we are using a
clock signal. In the first stage i.e. during RTL design, during synthesis and during placement
the clock is ideal. The ideal clock has no distribution tree, it is directly connected at the same
time to all flip flop clock pins.
The second phase comes when CTS inserts the clock buffer to build the clock tree into the
design that carries the clock signal from the clock source pin to the all flip flops clock pins.
After CTS is finished clock is called “propagated clock”.
Clock latency term we are using when the clock is in ideal mode. It is the delay that exists from
the clock source to the clock pin of the flip flop. This delay is specified by the user (not a real
value or measured value).
When the clock is in propagated mode the actual delay comes into the picture then this delay
is called as insertion delay. Insertion delay is a real and measured delay path through a tree
of buffers. Sometimes the clock latency is interpreted as a desired target value for insertion
delay.
Clock uncertainty> in the ideal mode we assume the clock is arriving at all the flip flop at the
same time but ideally, we did not get the clock at the same time, maybe the clock will arrive
at different times at different clock pins of a flip flop so in ideal mode clock assume some
uncertainty . for example a 1ns clock with 100 ps clock uncertainty means that next clock
pulse will occur after 1ns±50ps (either + or -).
The question of why the clock does bit always arrive exactly after one clock?
For crosstalk and useful skew, we have to know the basics of setup and hold timing. Here I am
going to write here some small concepts related to timing that will be used for crosstalk and
useful skew.
Setup time: The minimum time before the active edge of the clock, the input data should be
stable i.e. data should not be changed at this time.
Hold time: The minimum time after the active edge of the clock, the input data should be
stable i.e. data should not be changed at this time.
Capture edge: the edge of the clock at which data is captured by a captured flip flop
Launch edge: the edge of the clock at which data is launched by a launch flip flop
For setup time should not violate the required time should be greater than arrival time.
Crosstalk noise: noise refers to undesired or unintentional effect between two or more
signals that are going to affect the proper functionality of the chip. It is caused by capacitive
coupling between neighboring signals on the die. In deep submicron technologies, noise plays
an important role in terms of functionality or timing of device due to several reasons.
Increasing the number of metal layers. For example, 28nm has 7 or 8 metal layers and
in 7nm it’s around 15 metal layers.
Vertically dominant metal aspect ratio it means that in lower technology wire are thin
and tall but in higher technology the wire is wide and thin, thus a greater the proportion of
the sidewall capacitance which maps into wire to wire capacitance between neighboring
wires.
Higher routing density due to finer geometry means more metal layers are packed in
close physical proximity.
A large number of interacting devices and interconnect.
Faster waveforms due to higher frequencies. Fast edge rates cause more current
spikes as well as greater coupling impact on the neighboring cells.
Lower supply voltage, because the supply voltage is reduced it leaves a small margin
for noise.
The switching activity on one net can affect on the coupled signal. The effected signal
is called the victim and affecting signals termed as aggressors.
There are two types of noise effect caused by crosstalk
Glitch: when one net is switching and another net is constant then switching signal may cause
spikes on the other net because of coupling capacitance (Cc) occur between two nets this is
called crosstalk noise.
In fig the positive glitch is induced by crosstalk from rising edge waveform at the aggressor
net. The magnitude of glitch depends on various factors.
Coupling capacitance between aggressor and victim net: greater the coupling
capacitance, larger the magnitude of glitch.
Slew (transition) of the aggressor net: if the transition is more so magnitude of glitch
also more. And we know the transition is more because of high output drive strength.
If Victim net grounded capacitance is small then the magnitude of glitch will be large.
If Victim net drive strength is small then the magnitude of glitch will be large.
Types of glitches:
Rise: when a victim net is low (constant 0) and the aggressor net is at a rising edge.
Fall: when a victim net is low (constant 1) and the aggressor net is at the falling edge.
Overshoot: when a victim net is high (constant 1)) and the aggressor net is at a rising
edge.
Undershoot: when a victim net is low (constant 0) and the aggressor net is at the
falling edge.
Crosstalk delay: when both nets are switching or in transition state then switching signal at
the victim signal may have some delay or advancement in the transition due to coupling
capacitance (Cc) occur between two nets this is called crosstalk delay.
Crosstalk delay depends on the switching direction of aggressor and victim net because of this
either transition is slower or faster of the victim net.
Types of crosstalk:
1. Positive crosstalk: the aggressor net has a rising transition at the same time when the
victim net has a falling transition. The aggressor net switching in the opposite direction
increases the delay for the victim. The positive crosstalk impacts the driving cell, as well as
the net, interconnect - the delay for both gets increased because the charge required for the
coupling capacitance Cc is more.
1. Negative crosstalk: the aggressor net is a rising transition at the same time as the
victim net. The aggressor's net switching in the same direction decrease delay of the victim.
The positive crosstalk impacts the driving cell, as well as the net, interconnect - the delay for
both gets decreased because the charge required for the coupling capacitance Cc is less.
Crosstalk effect on timing analysis:
If the aggressor transition in the same direction as the victim then victim transition becomes
fast because of this data will be arrive early means arrival time will be less.
If the aggressor transition in the same direction as the victim then victim transition becomes
fast because of this clock will be arrive early means Required time will be less.
If the aggressor transition in a different direction as a victim then victim transition becomes
slow because of this clock will be arrive late means the Required time will be more.
Useful skew:
When clock skew is intentionally add to meet the timing then we called it useful skew.
= 11ns -12ns
Let’s introduce some clock skew to path ff1 to ff2 to meet the timing. Here we add 2ns extra
skew in clock path but we have to make sure about the next path timing violation.
= 13ns -12ns
= 11ns -10ns
Slew_lower_threshold_pct_rise : 20.0;
Slew_upper_threshold_pct_rise : 80.0;
Slew_upper_threshold_pct_fall : 80.0;
Slew_lower_threshold_pct_fall : 20.0;
Rise time: The time required for a signal to transition from 20% of its (VDD) maximum value
to 80% of its maximum value.
Fall time: The time required for a signal to transition from 80% of its (VDD) maximum value
to 20% of its maximum value.
Propagation delay: The time required for the signal to change the inputs to its state like 0 to
1 or 1 to 0.
Clock net is a high fan-out net and most active signal in the design. Clock buffer mainly used
for clock distribution to make the clock tree. The main goal of CTS to meet skew and insertion
delay, for this we insert buffer in the clock path. Now if the buffer has different rise and fall
time it will affect the duty cycle with this condition tool can do skew optimization but
complicates the whole optimization process as a tool has to deal with a clock with duty cycle
at different flop paths. If buffer delays are the same only thing the tool has to do balance the
delay by inserting buffer.
The clock buffers are designed with some special property like high drive strength, equal rise
and fall time, less delay and less delay variation with PVT and OCV. Clock buffer has an equal
rise and fall time. This prevents the duty cycle of clock signal from changing when it passes
through a chain of clock buffers.
A perfect clock tree is that gives minimum insertion delay and 50% duty cycle for the clock.
The clock can maintain the 50% duty cycle only if the rise and the fall delays and transition of
the tree cells are equal.
How to decide whether we need to used buffer or inverter for building a clock tree in the
clock tree synthesis stage. This decision totally depends on the libraries which we are using.
The main factors which we consider to choose inverter or buffer are rise delay, fall delay, drive
strength and insertion delay (latency) of the cell. In most of the library files, a buffer is the
combination of two inverters so we can say that inverter will be having lesser delay than
buffer with the same drive strength. Also inverters having more driving capacity than a buffer
that’s why most of the libraries preferred inverter over buffer for CTS.
Clock buffers sometimes have input and output pins on higher metal layers much fewer vias
are needed in the clock distribution root. Normal buffer has pins on lower metal layers like
metal1. Some lib also has clock buffers with input pins on high metal layers and output pins
on lower metal layers. Normally clock routing is done into higher metal layers as compared
to signal routing so to provide easier access to clock pins from these layers clock buffer may
have pins in higher metal layers. And for normal buffer pins may be in lower metal layers.
Clock buffer are balanced i.e. rise and fall time almost the same. If these are not equal then
duty cycle distortion in the clock tree will occur and because of this minimum pulse width
violation comes into the picture. In clock buffer the size of PMOS is greater than NMOS.
On the other hand normal buffer have not equal rise and fall time. In other words they don’t
need to have PMOS/NMOS size to 2:1 i.e. size of PMOS don’t need to be bigger than the
NMOS, because of this normal buffer is in a smaller size as compared to clock buffer and clock
buffer consumes more power.
The advantage of using an inverter-based tree is that it gives equal rise and fall transition so
due to that jitter (duty cycle jitter) get canceled out and we get symmetrical high and low
pulse width.
Buffer contain two inverters with unequal size in area and unequal drive strength. First
inverter is of small size having low drive strength and the second buffer is of large size having
high drive, strength are connected back to back as shown in figure below.
So a load of these two inverters are unequal. The net length b/w two back to back inverter is
small so small wire capacitance will present here we can neglect that but for the next stage
the net length is more and because of net length the capacitance is more by wire capacitance
and next inverter input pin capacitance and we get unequal rise and fall time so jitter will get
added in clock tree with an additional cost of more area than an inverter.
We know NMOS have majority charge carriers are electrons and PMOS have majority charges
carriers are holes. And we also know that electrons are very much faster than holes.
Since electron mobility is greater than the hole mobility, so PMOS width must be larger to
compensate and make the pull-up network more stronger. If W/L of PMOS is the same as
NMOS the charging time of the output node would be higher than the discharging time
because discharging time is related to the pulldown network.
So we make PMOS is of big size so that we can get equal rise and fall time.
Normal buffer are designed with W/L ratio such that sum of rise and fall time is minimum.
For making equal resistance of both transistor the size of PMOS is bigger than the NMOS.
The duty cycle of clock:
It is the fraction of one period of the clock during which clock signal is in the high (active)
state. A period is the time it takes for a clock signal to complete an on-and-off state. Duty
cycle (D) is expressed in percentage (%).
High pulse width = half pulse width of clock signal– (rise delay –fall delay)
= 0.5 - (0.055-0.048) - (0.039-0.032) - (0.025-0.022) - (0.048-0.043) - (0.058-
0.054) = 0.474ns
Low Pulse width = half pulse width of clock signal + (rise delay –fall delay)
= 0.5 + (0.055–0.048) + (0.039–0.032) + (0.025–0.022) + (0.048 – 0.043) +
(0.058 – 0.054) = 0.526ns
here we can see that we are getting min pulse width violation for high pulse as total high pulse
width is less than the required value.
If uncertainty we did not consider then violation will not occur in this scenario.
There are two types of minimum pulse width checks are performed:
Clock pulse width check at sequential devices
Clock pulse width check at combinational circuits
How to report:
report_timing –check_type pulse_width
How to define pulse width:
By liberty file (.lib):
By default all the registers in the design have a minimum pulse width defined in .lib file as this
is the format to convey the std cell requirement to the STA tool.
By convention min pulse width is defined for the clock signal and reset pins.
Command name: min_pulse_width
NOTE:
Balanced buffers means buffer having equal rise and fall time.
Unbalanced buffers means buffer having unequal rise & fall time
Several algorithms exist that are trying to achieve this goal (to minimize the skew).
H-Tree
X-Tree
Method of Mean and Median
Geometric Matching Algorithms
Zero skew clock routing
The first to four algorithm techniques are trying to make minimize the length and the last one
is to use the actual interconnect delay in making the skew is zero.
H-Tree
In this algorithm Clock routing takes place like the English letter H.
In H tree-based approach the distance from the clock source points to each of the clock sink
points are always the same.
In H tree approached the tool trying to minimize skew by making interconnection to subunits
equal in length.
This type of algorithm used for the scenario where all the clock terminal points are arranged
in a symmetrical manner like as in gate array are arranged in FPGAs.
In fig (a) all the terminal points are exactly 7 units from the reference point P0 and hence
skew is zero if we are not considering interconnect delays.
It can be generalized to 4i. When we are going to up terminals are increased like 4, 16, and
64…and so on and regularly placed across the chip in H structure.
fig: H tree with 16 sink points
In this routing algorithm all the wires connected on the same metal layers, we don’t need to
move horizontal to vertical or vertical to horizontal on two layers.
H tree do not produce corner sharper than 900 and no clock terminals in the H tree approach
in close proximity like X tree.
Advantages:
Exact zero skew in terms of distance (here we are ignoring parasitic delay) due to the
symmetry of the H tree.
Typically used for very special structures like top-level clock level distribution not for the
entire clock then distributed to the different clock sinks.
Disadvantages:
Blockages can spoil the symmetry of the H tree because sometimes blockages are present on
the metal layers.
Non-uniform sink location and varying sink capacitance also complicate the design of the H
tree.
X-tree
If routing is not restricted to being rectilinear there is an alternative tree structure with a
smaller delay we can use. The X tree also ensures to skew should be zero.
X-tree routing algorithm is similar to H-tree but the only difference is the connections are not
rectilinear in the X tree-based approach.
Although it is better than the H tree but this may cause crosstalk due to close proximity of
wires.
Like H tree this is also applicable for top-level tree and then feeding to the next level tree.
Disadvantages:
Both of the H Tree and X tree approach basically designed for a four array tree structure. Each
of the 4 nodes connected to the other 4 nodes in the next stages so the number of terminal
points or sink will grow as the power of 4 like 4, 16 and 64 and so on.
These two methods basically did not consider the exact location of the clock terminals it
independently create the clock tree and produce a regular array of sink locations across the
surface of the chip.
But in other approaches, we did not ignore the exact location of the actual clock terminal
points so now the question is how we what these approaches will do for exact location.
They look at where we required the clocks to be sent w.r.t location and try to build the tree
systematically and that tree does not look like the H tree and X tree.
Method of mean and median follows the strategy similar to the H-tree algorithm, but it can
handle sink location anywhere we want.
Step 1: It continuously partitions the set of terminals into two subsets of equal parts (median)
(As Fig.)
Step2: connects the center of mass of the whole set (module) to the center of masses of the
two partitioned subset (mean).
Let Lx denoted the list of clock points sorted accordingly to their x-coordinates
Next, we go for a horizontal partition where we partition a set of points into two sets Pb &Pt
This algorithm ignores the blockages and produces a non-rectilinear (not regularly spaces)
tree. Here some wire may also interact with each other.
It is a top-down approach as we are partitioning till each partition consist of a single point.
This is another binary tree-based routing algorithm in which clock routing is achieved by
constructing a binary tree using exclusive geometry matching.
Unlike the Method of mean & median (MMM) algorithm which is top-down and this is
bottom-up fashion. Here we used the concept of recursive matching.
To construct a clock tree by using recursive matching determines a minimum cost geometric
matching of n sink nodes.
The Center of each segment is called tapping point and the clock signal is provided at this
point then the signal will arrive at the two endpoints of the segment with zero skew.
Find a set of n/2 line segments that match n endpoints and minimum total length. After each
matching step a balance or tapping point is found on each matching segment to maintain zero
skew to the related sinks. These set of n/2 tapping point then forms the input to the next
matching step.
We checked whether the circuit meets all its timing requirements. The timing analysis is used
to refer to either of these two methods - static timing analysis, or the timing simulation
(dynamic timing analysis).
STA is the technique to verify the timing of a digital design. The STA analysis is the static type
and in this analysis of the design is carried out statically and does not depend upon the data
values being applied at the input pins.
The more important aspect of static timing analysis is that the entire design (typically
specified in hardware descriptive languages like VHDL or VERILOG) is analysed once and the
required timing checks are performed for all possible timing paths and scenarios related to
the design. Thus, STA is a complete and exhaustive method for verifying the timing of a design.
In STA whole design is divided into a set of timing paths having start and endpoints and
calculate the propagation delay for each path and check whether there is any violation in the
path and report it.
DTA is a simulation-based timing analysis where a stimulus is applied on input signals, and
resulting behavior is observed and verified using the Verilog test bench, then time is advanced
with new input, the stimulus applied, and the new behavior is observed and verified and so
on. It is an approach used to verify the functionality as well as the timing of the design.
This analysis can only verify the portions of the design that get exercised by stimulus (vectors).
Verification through timing simulation is only as exhaustive as the test vectors used. To
simulate and verify all the timing paths and timing conditions of a design with 10-100 million
gates are very slow and the timing cannot be verified completely. Thus, it is very difficult to
do exhaustive verification through simulation.
Why Static Timing Analysis?
STA is a complete and exhaustive verification of all timing checks of a design.
STA provides a faster and simpler way of checking and analyzing all the timing paths
in a design for any timing violations.
Day by day the complexity of ASIC design is increasing, which may contain 10 to 100
million gates, the STA has become a necessity to exhaustively verify the timing of a design.
In ASIC design, the static timing analysis can be performed at many stages of the
implementation. STA analysis first done at RTL level and at this stage more important is to
verify the functionality of the design not timing.
Once the design is synthesized from RTL to Gate – level, then STA analysis is used for verifying
the timing of the design. STA is also performing logic optimization to identify the worst/critical
timing paths. STA can be rerun after logic optimization to see whether there are failing paths
are still remaining that need to be optimized or to identify the worst paths in the design.
At the start of physical design (PD) stages like floorplan and placement, the clock is considered
as an ideal which means the delay from clock to all the sink pins of the flip flop is zero (i.e.
clock is reaching to all the flip flop at the same time). After placement, in the CTS stage clock
tree is built and STA can be performed to check the timing. During physical design, STA can
be performed at each and every stage to identify the worst paths.
In the logic design phase, interconnect is ideal since there is no physical information related
to the placement of Macros and standard cells. In this stage, to estimate the length of
interconnect we used WLM (wire load model) which provides estimated RC interconnect
length based on the fan-out of the cell.
In the physical design stage, we have the information about the placement of macros and
standard cells and these cells are connected by interconnect metal traces. The parasitic RC of
the metal affects the delay and power dissipation in the design.
Before the routing is finalized this phase is called the Global route phase, the implementation
tool used to estimate the routing length and the routing estimates are used to determine
resistance and capacitance parasitic that are needed to calculate the wire delays. Before the
routing stages we are not focused on the effect of coupling. After the detailed routing
complete, actual RC values obtained from the extraction tool (used to extract the detailed
parasitic from the design) and the effect of coupling also analyzed.
Limitations of STA:
1. If all the flip-flops are in reset mode into their required values after applying
synchronous or asynchronous rest this condition cannot be checked using static timing
analysis.
2. STA is dealing with only known values like logic-0 and logic 1 (or we can say low and
high). If any unknown value X in the design comes then this value will not check by using STA.
3. Ensure that correct clock synchronizer is present whenever there are asynchronous
clock domain crossing is present in the design otherwise STA does not check if the correct
clock synchronizer is being used.
4. If the design having digital and analog blocks then the interface between these two
blocks will not handle by STA because STA does not deal with analog blocks. Some verification
methodologies are used to ensure the connectivity between these kinds of blocks.
5. STA verifies the all the timing paths included that timing path also which does not
meet all the requirements and even though logic may never be able to propagate through the
path these timing paths are false paths. So we have to give proper timing constraints for false
path and multicycle paths then only STA qor result will be better.
Standard cells:
Most of the complex functionality in the chip is designed using basic blocks of AND, OR, NAND,
NOR AOI, OAI cells and flip flops. These blocks are predesigned and called standard cells.
The functionality and timing of these standard cells are pre-characterized and available to the
designer in the form of standard cell libraries and use these blocks according to the
requirement.
Timing arcs:
The timing arc means a path from each input to each output of the cell. Every combinational
logic cell has multiple timing arcs. Basically, it represents how much time one input takes to
reach up to output (eg. A to Y and B to Y). Like if we see AND, OR, NAND, and NOR cell as
shown in the figure. In sequential cells such as flip flop have timing arcs from clock to the
outputs and clock to data input.
Timing arcs can be further divided into two categories – cell arcs and net arcs.
Cell arcs: This arc is between an input pin and an output pin of a cell i.e. source pin is an input
pin of a cell and sink pin is the output pin of the same cell. Cell arcs can be further divided into
sequential and combinational arcs.
Combinational arcs are between an input and output pin of a combinational cell or block.
Sequential arcs are between the clock pin and either input or output pin. Setup and hold
timing arcs are between the input data pin and clock pin of flip flop and are termed as timing
check arcs as they constrain a form of the timing relationship between a set of signals.
Sequential delay arc is between clock pin and output Q pin of FF. An example of a sequential
delay arc is clk to q is called delay arc and clk to D input is called timing check arcs in sequential
circuits
Net arcs: These arcs are between driver (cell) pin of a net and load pin of a net i.e. the source
pin is output pin of one cell and the sink pin is input pin of another cell. Net arcs are always a
delay timing arcs.
Unateness:
Each timing arcs has a timing sense that means how the output changes for different types of
transitions on input this is called unateness. Unateness is important for timing as it specifies
how the output is responding for the particular input and how much time it will take.
Positive unate: If a rising transition on the input gives the output to rise and falling transition
on the input gives the output to fall i.e. there is no change in transition of input and output
then that timing arc is called positive unate.
Buffer: There are two-timing arc in buffer. First is rising input A to Y which gives rising output
Y and second is for falling input A to Y that gives falling output Y i.e. what type of edge is giving
to the input we got same at the output (output is constant).
AND gate: There are four timing arcs.
Check for rising edge, from the truth table we can see that
Negative unate: If a rising transition on the input gives the output to fall and falling transition
on the input gives the output to rise i.e. there is a change in transition of input and output
then that timing arc is called negative unate.
Inverter: There are two-timing arc in inverter. First is rising input A to Y which gives falling
output Y and second is for falling input A to Y that gives rising output Y i.e. what type edge is
giving to the input we got opposite of that at the output (output is inverted).
NAND gate: There are four timing arcs.
Check for the rising edge, from the truth table we can see that
Non-unate: The output transition cannot be determined by not only the direction of an input
but also depends on the state of the other inputs.
Check for the rising edge, from the truth table we can see that
We know that all the cell information related to timing functionality is present in the library
(.lib). The cell is a function of the input to output. For all the combinations of input and outputs
and rising and falling conditions of inputs are defined in the .lib file. In case you have to read
SDF, this delay is picked from SDF (Standard Delay Format) file. The net arcs taken from the
parasitic values that are given in SPEF (Standard Parasitic Exchange Format) file, or SDF.
STA-II TRANSMISSION GATE,D LATCH, DFF,SETUP &HOLD
Before going to understand the setup and hold timing we should have to know about D latch
and D FF and D latch and D FF is made up of transmission gate and inverters. So in this post, I
will cover transmission gate, D LATCH, D FF, setup up and hold time.
Transmission Gate:
Two gate voltage of PMOS and NMOS are the complement of each other.
The effective resistance of the transmission gate is almost constant because of the parallel
connection of PMOS and NMOS.
Transmission Gate
Truth table
Working: When control is high (1) from the truth table we can see both transistors are ON at
the same time and whatever is applied to the input we got at the output.
When control is low (0) from the truth table we can see both transistors are OFF at the same
time and whatever is applied to the input is not reached to the output so we got high
impedance (Z) at the output.
D latch:
The latch is a level-sensitive device and it is transparent when the clock is high if it is a positive
level-sensitive latch and when the clock is low it is called negative level-sensitive latch.
In latch the output (Q) is dependent only on the level of the clock (Clk). In this latch D is control
the output (Q).
When Clk = low (0) T1 is OFF and T2 is ON, now new data entering into the latch is stopped
and we get only previously-stored data at the output.
It is also consist of two transmission gate and two inverters. It is working in an exactly opposite
manner of the positive level-sensitive D latch.
Negative triggered D latch waveform
When Clk = low (0) T1 is ON and T2 is OFF, so output (Q) directly follows the input (D).
When Clk = high (1) T1 is OFF and T2 is ON, now new data entering into the latch is stopped
and we get only previously-stored data at the output.
D Flip flop:
A D flip flop is an edge-triggered device which means the output (Q) follows the input (D) only
at the active edge (for positive rising edge) of the clock (for the positive edge-triggered) and
retain the same value until the next rising edge i.e. output does not change between two
rising edges, it should be changed only at the rising edge.
Positive edge-triggered D FF waveform
It is a combination of negative level-sensitive latch and positive level-sensitive latch that giving
an edge-sensitive
stored till the edge of T2 (path is D-1-2-3-4 and at node 4 it stops) it cannot pass through T2
and T3 transmission gate because they are off. This operation for the master latch. For slave
latch it keeps retaining the previously stored value of output (Q) (path is 5-6-7-8-5).
When Clk= HIGH (1) T2, T3 are ON and T1, T4 are OFF.
Now master latch did not allow new data to enter into the device because T1 is OFF
and the previously stored data at point 4 is going through the path 4-1-2-5-6-Q and this same
data is reflected at the output and this does not change until the next rising edge and this
same data is also going to the transmission gate T4 (path is 4-1-2-5-6-7-8 and stops because
transmission gate T4 is OFF).
when Clk is high
Again if Clk is low the master latching circuit is enabled and there is no change in the output.
Any changes in input is reflected at node 4 which is reflected at the output at the next positive
edge of the clock.
So we can say that if D changes, the changes would reflect only at node 4 when the clock is
low and it will appear at the output only when the Clk is high.
Setup time:
The minimum time for which the data (D) should be stable at the input before the active edge
of clock arrival.
The data is launched from FF1 and captured at the FF2 at the next clock edge. The launched
data should be present at the D pin of capture flop at least setup time before the next active
edge of the clock arrives.
So total time to propagate the data from launch to capture flop = one time period (T) –Tsu
This is the required time for the data travel from launch to capture flop.
And how much time it does take data to arrive at the D pin of capture flop is =Tcq (clock to Q
delay of FF1) + Tcomb (combinational delay). This is called arrival time.
RT > AT
Slack = RT –AT
Now, what is Tsu (setup time)? How do we determine and how much should be the setup
time of flip flop and from where we can find this.
When the CLK is low the input (D) is following the path D-1-2-3-4 and it will take some time
to reach at the node 4 that time we will call setup time.
What happens if data (D) is not stable for the setup time before the next active edge of the
clock arrives?
So now when the clock turns high the data which has to be launched should be present at
node 4 but since the data is slow it would not get enough time to travel till node 4 and the
data (D) is still be present somewhere between node 2 and 3 (let's say) so we don’t know
which data will be launched at the rising edge and output will be indeterminate because data
is not reached at node 4 yet i.e. data is late.
If skew is present in the design:
If there is a positive skew it means we are giving more time to data to arrive at D pin of capture
FF. so positive skew is good for setup but bad for hold.
Tskew is positive or negative depending on the capture clock it comes fast or slow than the
launch clock.
Positive skew: if the capture clock comes late than the launch clock.
Negative skew: if the capture clock comes early than the launch clock.
Hold time:
The minimum time for which the data (D) should be stable at the input after the active edge
of clock has arrived.
Data is launched from FF1 at 0sec and this data should be captured at FFF2 after one time
period (T). The hold time is to make sure that the current data (n) which is being captured at
the FF2 should not be corrupted by the next data (n+1) which has been launched from the
launch flop at the same edge.
Because this same edge is going to both the flip flops if at this edge The capturing flop FF2 is
capturing the current data (n) at this same edge itself the launch flop FF1 is launching the next
data (n+1) so the whole check is to make sure that this new data (n+1) which is being launched
at the same edge from the launch flop FF1 should not come so fast that it corrupts the current
data (n) which is being captured at the capture flop at the same edge.
than the Thold time of capture flop FF2. Basically this current data (n) should be held for
enough time for it to be captured reliably, that enough time is called hold time.
Hold slack = AT – RT
If arrival time is less that means data coming is very fast (or early) so hold violation occurs.
If positive skew is present:
It means the next data (n+1) will be launched early from the launch flop FF1 and till now the
capture clock is not reached to the capture flop FF2 so the data (n) also did not have to capture
yet, but this nth data has to be stable at the capture clock for Tskew+ Thold time otherwise
data n will be corrupted. So we can say +ve skew is bad for hold.
Positive skew
it means the data (n) is being captured at captured flop FF2 early but by the time (n+1) data
will not be getting launched from the launched flop FF1, so the data (n) got enough time to
be held at the input for it to be captured reliability but till now the launch flop did not launch
(n+1) data. So negative skew is good for hold.
Negative skew
so the condition for hold time with skew,
Hold time (Thold) of flip flop:
For working of this structure please read DFF using the transmission gate.
The Clock is turning from low to high the T1 and T4 transmission gate is turned off and stop
entering the new data (D) into the device but transmission gate T1 does not turn off the
immediately it will take some time to become OFF because the clock (Clk) is passes through
many buffer and inverter then reached to the transmission gate T1 so we can say that
transmission gate also take some time to get turned OFF so during the time when it is turning
OFF from ON during that time new data should not come and disturbed the current data
which is being captured.
Basically new data should not enter into the devices during that time also so the hold time
will be the time is to take the transmission gate to turn off completely after the clock edge
has arrived.
if there is any combination delay in the data path then the hold requirement will change.
STA -III Global setup and hold time. Can setup and hold time
of FF be negative??
Data path delay (Tcomb) = The Time taken for the data to reach at the input of transmission
gate or D pin of FF.
Clock path delay (Tclkint) = The Time taken for the clock
to reach from clock generation point (clock) to the clk pin (clk) of transmission gate or Clk pin
of FF.
Both setup and hold time are measured with respect to the active edge of the clock. For a
pure flip flop (containing no extra gate delays) setup and hold time will always be a positive
number and fixed once the chip is fabricated it can’t be changed. But if we put some glue logic
around the FF data path and clock path then setup and hold requirement will be changed and
we called this global setup time (Tsunew) and global hold time (Tholdnew) and these
components are available as a part of the standard cell library.
Setup and hold time can be negative also depending on where we measure the setup and
hold. If we want to measure setup and hold time at the component level then it may be
negative.
First scenario: If some combinational logic is present between Data pin and transmission
gate T1 (data path) or D input of flip flop.
For setup:
Some delay will be induced due to combo logic now the data will take more time to reach
from D pin of FF to node 4. So we can say that setup time will be increased because of combo
logic.
Set up time is the time, when data reach to node 4 and now combinational logic is present so
the setup Time requirement will be increased.
Let us take delay of comb logic (Tcomb) is 1ns and the original setup time (Tsu) is 3ns.
Then new setup time (Tsunew) = Tsu + Tcomb = 3ns + 1ns = 4ns
NOTE: IF COMB DELAY IS PRESENT IN THE DATA PATH THEN THE SETUP REQUIREMENT IS
INCREASED.
For hold:
There is some comb logic sitting between the Data input & transmission gate pin D so now
what will happen during this time, the transmission gate T1 is turning OFF till clock is not high
and the new data comes at the input of the FF it will have to travel through the comb logic
delay before it goes inside and disturbed our original data which is being captured so can we
say that if the delay is added in the data path our hold time will be decreased and new data
have to wait until the new active edge of the clock will not arrive at the transmission gate T1.
Let us take original hold time of FF (Thold) is 2ns it means the clock takes 2ns to reach from
the generation point (clock) to the clock pin (clk) of FF or transmission gate to turn off so
during that time new data should not come into the device and data should be stable for 2ns
at the input after the rising edge of the clock has arrived.
Then New hold time (Tholdnew) = Thold - Tcomb = 2ns - 1 ns =1ns (positive hold time)
If Tcomb = 2ns
If Tcomb = 3ns
Note: IF COMB DELAY IS PRESENT IN THE DATA PATH THEN THE HOLD REQUIREMENT IS
DECREASED.
Summary:
If comb logic present between data pin to the D pin of flop then,
Tsunew = Tsu + Tcomb
Second scenario: if clock delay is present in clock path (the clock is passed through some
buffer and inverter to reach at the T1)
For setup:
It means the active edge of the clock will take more time to reach at the transmission gate
from the clock generation point (clock) this is called internal clock delay. So what will happen
if this delay is present?
So we know that as soon as the active edge of clock has arrived, the transmission gate T1 is
turned off and it stops the data entering into the device. We also know that clock is not
reaching to the T1 instantly because of the internal clock delay even though the active edge
of clock arrived at the generation point and the transmission gate T1 is still ON and the data
still have some time to enter into the device and reached at node 4, then we can say that if
internal clock delay is present in the design, the setup requirement will be decreased because
the data will get some extra time to enter into the device and stabilize at node 4 even though
the clock has arrived at the reference point of the flip flop.
Let us take internal clock delay (Tclkint) is 2ns and Tsu is 3ns.
Then new setup time (Tsunew) = Tsu - Tclkint = 3ns -2ns = 1ns (positive setup time).
if internal clock delay (Tclkint) is 3ns, Then new setup time (Tsunew) = Tsu - Tclkint = 3ns - 3ns
= 0ns (zero setup time)
if internal clock delay (Tclkint) is 4ns,Then new setup time (Tsunew) = Tsu - Tclkint = 3ns -4ns =
-1ns (negative setup time)
It means the data can reach 1ns later than the active edge of clock because internally clock
will take more time to reach the transmission gate T1 and this data even though it arrives 1
ns late it will still be able to make it to node 4 and get stabilized.
For hold:
So we know that as soon as the active edge of clock has arrived, the transmission gate T1 is
turned off and it stops the data entering into the device and data have to wait at the D pin of
transmission gate T1 till the clock is going to low.
Then we can say that if internal clock delay is present in the design the hold requirement will
be increased because the data will have to wait at the D pin of T1 to enter into the device
let take if internal clock delay (Tclkint) is 2ns and hold time (Thold) of FF is 2ns Then
New hold time (Tholdnew ) = Thold + Tclkint = 2ns +2 ns =4ns
Summary:
Tholdnew = Thold + Tclkint
Third scenario: if some comb logic is sitting b/w Data pin and the D pin of FF and
clock internal delay is present b/w clock generation point to the clk pin of FF.
New setup time (Tsunew) = Tsu +Tcomb - Tclkint
when the chip is fabricating all the components on the chip have different propagation delay
because of the PVT (process - voltage - temperature) conditions so Every cell has three types
of delay max, min, and typical delay.
For setup we consider worst (max) data path and min clock path so,
For hold we consider worst (max) clock path and max clock path so,
Setup and hold values cannot be negative simultaneously but individually they may be
negative. So for the setup and hold checks to be consistent, the sum of setup and hold values
should be positive.
from where got the setup and hold values: library file
so the next post is related to how the setup and hold are defined for rise and fall constraints
in the library file (.lib).
How the setup and hold checks are defined in the library?
For synchronous inputs such as D, SI, SE there are following timing arcs
Setup check arc (rising and falling)
Hold check arc (rising and falling)
For asynchronous inputs such as CDN there are following timing arcs
Recovery check arc
Removal check arc
The setup and hold timing checks are needed to check the proper propagation of data through
the sequential circuits. These timing checks are used to verify the data input (D) is
unambiguous at the active edge of the clock so the proper data is latched at the active edge.
These timing checks verify the data input is stable before and after the clock.
The setup and hold timing constraints for the synchronous pin of a sequential cell described
in terms of two-dimensional table as shown in the figure.
The setup and hold timing constraints on the input pin D with respect to the rising (active)
edge of clock CK of the FF.
The two-dimensional models are in terms of data and clock transition time at the
constrained_pin (Data pin D) and the related_pin (clock pin CK) respectively. Index 1 is
showing data transition (D) at the rising edge and index 2 is showing the clock transition (CK)
at the rising edge. Thus with data D pin rise transition of 0.4ns and clock CK pin rise transition
of 0.84ns so the setup constraint for the rising edge of the D pin is 0.112ns (this value is read
from the rise_constraint table).
For the falling edge of the D pin, the setup constraint will tool from the fall_constraint table
of setup.
In the last post I mentioned the reason for negative values in the library. We can consider a
flip flop like a black box and we don't know what is inside the flop. Just we can assume
sometimes data reach early at the data pin than the clock reaches at clock pin or sometimes
clock reaches early at the clock pin than the data at data pin.
We notice that some of the hold values in the fig are negative. This normally happens when
the path from the pin of the Flop to the internal latch point for the data is longer than the
corresponding path for the clock. Thus a negative hold timing check implies that the data pin
(D) of the flop can change before the clock pin (CK) and still meet hold time check.
The setup values of a FF can also be negative and this means that at the pins of the flip flop,
the data can change after the clock pin and still, meet the setup check.
Can both setup and hold be negative? This question is related to the library file.
No, For the setup and hold checks to be consistent and proper, the sum of setup and hold
values should be positive. If there is one of the setup (or hold) check contains negative values
–then corresponding hold (or setup) should be sufficiently positive so that the setup plus hold
value is a positive quantity. so the data will proper in the setup and hold window.
The setup plus hold time is the width of the region where the data signal is required to be
stable.
For flip flops, it is helpful to have a negative hold time on scan data input pins. This gives the
flexibility in terms of clock skew and can eliminate the need for almost all the buffer insertion
for fixing hold violations in scan mode (scan mode is the one in which flip flop are tied serially
forming a scan chain – output of flip flop is typically connected to the scan data input pin of
the next flip flop in series, these connections are used for testability)
Time Borrowing concept in STA
Difference between latches and flip flop.
Flip flop (Register) Latch
Edge triggered and synchronous device. Level sensitive and asynchronous device.
Flip Flops are of two types depending upon the Latches are two types depending upon enable
clock signal polarity i.e. positive edge-triggered signal polarity i.e. positive level latch and
and negative edge triggered. negative level latch
Flip Flop are made up of latches (-ve latch and Latches are made up of logic gates.
+ve latch)
FF changes its state only when the active edge Latches are transparent that means when they
of clock (either positive or negative depending are enabled, output changes immediately if
on requirement) is triggered. input changes.
Flip Flop based design are complex to design Latch based designs are simpler to design
because they have clock signal and in FF to (because not based not edge of clock) and
maintain the clock skew is big problem. have a small die size.
Latch based design is more variation tolerant
and we get a better yield than the flip flop
based design
The operation of FF based design is slower as The operation of latches is faster because
compare to latches based design due to clock latches don’t wait for the clock edges.
signal.
There are two types of storage sequential circuits are flip flop (registers) and latches.
Data launches on one rising edge of the clock and must set up before the next rising edge of
the clock.
If the combinational logic delay present between two Flip Flops is very large then arrives late
at the capture flip flop and data may goes into metastability.
If the combinational logic delay present between two FF is less i.e. it arrives early, so time is
wasted in this case.
Data can pass through latch while the latch is transparent. If there is a large delay comb logic
is present in the design then it can borrow the time from the next cycle. So latch based designs
are used, where we want the high performance of the design.
Time borrowing is applies only to latch based design and cycle stealing for flip flop based
design.
It is the property of latch, a path ending at a latch can borrow time from the next path in the
pipeline such that the overall time of two paths remains the same. STA applies a concept of
time borrowing for latch based designs.
Whatever data launched from Flip Flop1 at ons it should be reached to Flip Flop2 at next
active edge i.e. 10ns (ideal case when setup hold time and skew and clock delay all are zero).
If data reaches at Flip Flop2 after 10ns will not be able to capture the correct data. Similarly
if we launched our data from Flip Flop2 at 10 ns then it should be reached Flip Flop3 at 20ns
(next active clock edge) without any violation.
What if the combinational logic delay is large (greater than time period 10ns):
If comb delay is greater than 10 then Flip Flop2 did not capture the correct data because Data
is reaching at Flip Flop2 after 12ns and 0ns to 12ns Flip Flop will check only one positive edge
so here this is the problem of setup violation so to avoid this we can use latch between
positive levels.
What is latch: whenever the clock is enabled/high the latch will work for a half period of time
depending upon the polarity.
Now if the Flip Flop2 is replaced with a latch where the gate of the latch is driven by the
same clock line.
What is the benefit of using this latch?But if data reaches at latch after 10ns then what will
happen?
If it reaches to the latch input before 10ns, this data waits at the latch’s D pin. This is a similar
case as if we are using FF in place of latch.
Let say data is coming after the 12ns so there is a problem if we are using flip flop but if we
are using latch there is no problem because the latch is transparent from 10 to 15, and no
problem to receive the data so we can receive our data by borrowing the time from the next
cycle, this means latch provides the advantage over Flip flop of 2ns time. The maximum time
we can borrow is 5ns from the latch. But this time is reduced for the latch to flip flop3 so Flip
Flop3 must get the data at 20ns so the latch must send the data before 20ns.
So we are borrowing the time from latch because it is open for 10 to 15ns and also this time
is reduced for the next logic.
The data comes out of Latch at 12ns will sample at Flip Flop3 at the time 20ns. Thus the path
from latch2 to Flip Flop3 gets only 8ns time.
In the circuit which had all the flops, the second path had 10ns time however in this circuit it
gets 2ns less time.
So it should be noted that the path from Flip Flop1 to Flip Flop3 still there is total 20ns which
is not changed only distribution of time is changed.
So we can say that in flip flop based design the combinational delay should not longer than
the clock period except for some exceptions like multicycle path and false path in static timing
analysis.
In latch based design has larger combinational delay can be compensated by a shorter
combination delay path in logic states. So for high performance of circuit we mostly used latch
based design.
In simple term-time borrowing is the technique in which a longer path takes to borrow the
time from the next path of subsequent logic.
Time borrowing typically affects the setup since time borrowing is slowing the data arrival
time i.e. data arrival time is more. It does not affect the hold time because in hold time data
arrival time is more.
Time borrowing examples:
Example1: There are two flip flops and 2 combinational logics arranged between flip flops.
The clock period is 5ns.
Setup violation present in this scenario, because data coming to FF1 after 7ns and clock period
is only 5ns. If we increase the clock period more than 7ns then the timing can be met. But
increasing the clock period affects the performance of the design.
This timing violation issue can be resolved with the same clock period 5ns using the timing
borrowing concept. We can replace the flip flop1 with the positive sensitive latch. Latch opens
at the same time like flip flop at 0ns. This latch is open from 0ns to 2.5ns.
Ideally data from path1 should have arrived at ons but it is not reached. Path 1 borrowed 2ns
from the latch if the latch is not present there would have been timing violation at ons. Now
o.5ns used by path2.
So path1 have an extra 2.5ns time to borrow from the next cycle. Since latch closes at 2.5ns
there is no timing violation for path1 because path1 arrives 0.5ns before the latch is closed.
Output of latch is immediately available for comb path2. From the fig path2 start where path1
left off. Path 2 could have used up to 3ns (0.5ns from previous stage + 2.5ns of half clock
period of current cycle) but the given delay for path2 is only 1ns.
Valid data is available for capture flip flop2 at 3ns since the rising edge of capture Flip flop2
happens at 5ns. FLIP FLOP2 has extra 2ns this is +ve slack.
Time borrowing from the next cycle and using slack from the previous cycle. Timing is met
without changing the clock, but just by replacing flip flop to latch. Time borrowing is used only
for latch based designs.
Example 2: There are four positive level-sensitive latches and 4 combination logic between
latches. Latch1 and latch3 are controlled by CK1, latch2, and latch4 are controlled by CK2.
The relationship between ck1 and ck2 is shown in fig.
For simplicity all four latches assumed to have 0ns propagation delay and 0ns setup and
hold time.
Similarly path3 has a delay of 8ns, borrowed 3ns from the next stage. In both cases where we
borrowed the time from the next stage is not used fully because path2 and path4 has delay
less than 5ns the half of the clock period.
Path3 has a delay of 2ns which is less than the 5ns (half clock period) it means NO TIME
BORROWING is required from the succeeding stage. This is the same case as if there had been
a flip flop instead of a latch.
NOTE: flip flop in place of latch4 stops time borrowing that we have seen from latch1 to latch3
at launch edge of flip flop.
If the circuit made up of 4 flip flop instead of 4 latches, the 4 stage flip flop based design would
consume a total of 28ns time but if we are using latch the delay has been reduced to 20ns.
This is the advantage of latches over flip flops.
The High-speed CMOS clocking design style are --- time borrowing in Latches and time-
stealing in Edge triggered flip flop.
Since in an edge-triggered system, the operation time of each pipeline partition will never
equal to others and the longest logic delay between two registers will determine the
maximum clock frequency of the system.
In any circuit, the concept of how to fit more combinational logic within each logic partition
(between flip flops). So every pipeline partition will want more time than allocated. Time
borrowing and time-stealing which allows one stage to pass slack time from fast logics to
neighboring slower logic then we will remedy those problems.
Time borrowing by definition is permitting logic to automatically use slack time (borrow the
time) from a previous cycle. It always indicates the situation that logic partition in a pipeline
structure (flip flops and combinational delays are present between flip flops) use leftover time
from the previous stage and this passing of slack time from one cycle to the next cycle is
automatic without any additional circuitry or clock adjustments.
This transparent nature allows latches to be used for high-performance designs since they
offer more flexibility than edge-triggered circuits in terms of the minimum clock period
achievable – a result of time borrowing.
Method of borrowing time from the previous stage allowing combinational paths to have a
delay of more than the clock period is referred to variously as time borrowing or cycle
borrowing.
Time borrowing happens due to only the latches because latches are level sensitive. Since the
use of an edge-triggered structure must require a clock arrival time adjustment at the circuit
and this will violate the definition of time borrowing. So time borrowing is ideally suitable for
static logic in a two-phase clocking system latches (non-edge triggered).
The advantage of slack (time) borrowing is that it allows logic between cycle boundaries to
use more than one clock cycle while satisfying the cycle time constraint. Mostly time
borrowing is only allowed in exceptional cases, which are carefully verified individually.
Time borrowing has been traditionally used to reduce the effect of clock skew and jitter on
the maximum clock frequency and it also has been widely used in critical delay paths of high-
speed circuits especially in high-speed adder designs.
Since time borrowing can automatically average out delay variations along a path caused by
process variation and inaccuracies, time borrowing is used to alleviate the mean maximum
clock frequency degradation caused by within-die parameter variations.
Time Stealing:
Time stealing gains time by taking it from the next cycle. Time stealing happens when some
logical partition needs additional time for evaluation but cannot use leftover time from
previous cycles or phases like in time borrowing.
Therefore the most important difference between time stealing and time borrowing is that
time-stealing will not automatically use the leftover time but time borrowing automatically
use the leftover time.
It has to be forced to steal evaluation time from the subsequence cycle and leave less time to
that cycle or phase which is achieved by adjusting clock arrival time. Since this additional time
is obtained by adjusting clock arrival time, it is usually used in edge-triggered logic.
Time Stealing can be used when a particular logic partition needs additional time. The
additional time required should be deterministic at the time of the design and can adjust the
clock phase of capture Flip Flop (FF2), so that data arrival time at the capture edge of FF2, will
not violate setup.
So if a dynamic logic needs more time to evaluate, it has to increase its phase time by widening
the active clock time and this can only be done by shifting the rising edge earlier or falling
edge later.
This means instead of using the symmetric 2 phase system with a 50%duty cycle as shown in
time borrowing, time-stealing has to use asymmetric duty cycle to gain additional time. Time
stealing is used to reduce leakage power.
Time Stealing can be used when a particular logic paths need additional time. This additional
time should be deterministic at the time of the design so we can adjust the phase of capture
Flip Flop2 clock, so that data arrival time at the capture edge of Flip Flop2, should not violate
setup time.
In the fig, there are 4 flip flops and 4 combination logic circuits having path1,path2,path3
and path4 delays are 12ns, 1ns, 8ns, and 1ns respectively.
The combinational delay in the first stage between FF1 and FF2 is path1 having a delay of 12ns
and it is higher than the clock time period (10ns). On the other hand, we have a positive slack
available in the second stage between FF2 and FF3 because of the smaller combinational
delay of path2 with a delay of 1ns. Here we were able to support the higher combinational
path delay (path1) without increasing the time period by controlling the clock arrival time to
FF2.
As shown in fig the path1 stole a time of 4ns (Ck2 offset, not the time borrowed by path1)
from path2 available time of 10ns, leaving path2 with 6ns. Since path2 needs only 1ns, there
is enough time for Flip Flop3 to capture data at 20ns.
If we replace Latch1 with negative edge-triggered flip-flop as shown in Fig, the path1 will still
have that extra 2.5 ns (half of the clock cycle) to borrow from the next clock cycle, just like in
latch. So why we are not using Flip flops in place of Latch.
On the input side, a negative edge triggered flip flop will behave just the same way as a latch.
The transparent nature of the latch will help to the succeeding stage to use positive slack
(leftover, if any) in the current stage OR pass on the negative slack in the current stage to
succeeding stage.
By comparing waveforms of both, we can understand the benefits of the LATCH over negative
edge triggered FLOP.
In the case of positive level-sensitive latch, data appears at the input of path2 (output of
latch1) at time = 2ns, because of the transparent nature of the latch (in case of negative edge
flop data appeared at the input of path2 at t = 2.5ns) so in this case, we have a positive slack
of 2 ns (5 – 3) ns.
Now let’s assume path2 requires 2.7 ns, data arrival time at second positive clock edge is 2ns
+ 2.7ns = 4.7ns. The second positive clock edge occurs at t = 5 ns and in this case we have
a positive slack of 0.3 ns (5 – 4.7) ns.
In fig, in case of negative edge flop, the data appears at the input of path2 (output of negative
edge triggered FF) at time = 2.5 ns. With path2 consuming 1 ns, data arrives at the output of
path2 at 3.5 ns (2.5ns + 1ns) and we have a positive slack of 1.5 ns (5 ns -3.5 ns).
Now, assume a situation where path2 requires 2.7ns, instead of 1ns so available time is 2.5
ns and Required time by path2 is 2.7ns then in this situation, there is a timing violation at the
edge of the clock at t = 5 ns, with negative slack of 0.2 ns as shown in fig below.
So as compared to positive level-sensitive latch in negative edge triggered flop the positive
slack 0.5 ns (between t=2ns and t=2.5ns) available in path1 is wasted because of the nature
of flops. The transparent nature of the latch makes use of the prior cycle’s positive slack of
0.5ns in the current cycle.
Transistors within a gate take a finite time to switch. This means that a change in the input of
a gate takes a finite time to cause a change in the output.
Cnet-->Net capacitance
Cell timing models are used to provide accurate timing for various instances of the cells
present in the design. The timing model normally obtained from detailed circuit simulation of
the cell to model the actual scenario of the cell operation.
Let’s consider an example of a timing arc of a simple inverter. For the inverter if a rising
transition at the input causes a falling transition at the output and vice versa. So there are
two types of delay characterized for the cell first is output rise delay and output fall delay.
These delays are measured based upon the threshold value defined in the library which is
typically 50% of the Vdd. Thus delay is measured from input crossing its threshold to the
output crossing its threshold.
The delay for the timing arc through the inverter depends on the two factors: The output load
i.e. capacitance load at the output and the transition time at the input. The delay value is
directly proportional to the load capacitance- larger the load capacitance means larger the
delay.
In most cases, the delay increases with increasing input transition time and in some cases the
delay through the cell may be showing non-monotonic behavior with respect to the input
transition time that means a larger input transition time may produce a smaller delay
especially if the output is highly loaded.
The transition time at the output is directly proportional to the output capacitance i.e. output
transition time increases with the output load. Thus a large transition time at the input can
improve the transition and a slow transition at the input can deteriorate the transition at the
output depending upon the cell type and its output load. In the figure shown two cases where
the transition time at the output of a cell can improve or deteriorate depending on the load
at the output.
In this the delay and the output transition time of cells are represented as linear functions of
the two-parameter: input transition time and output load capacitance.
D = D0 + D1 * S + D2 * C
Where D0, D1, D2 are constant, S is the input transition time, and C is the output load
capacitance.
The linear delay models are not accurate over the range of input transition time and output
capacitance for deep submicron technologies so presently most of the cell libraries use the
more complex models like Non-linear Delay Model (NLDM) and CCS model.
Non-linear Delay Model (NLDM):
Nowadays the non-linear delay model (NLDM) or the composite current source timing model
(CCS) based look-up table (LUT) is widely used for static timing analysis (STA). In those LUTs,
the characterization data such as cell delay and transition time is indexed by a fixed number
of input transition time and load capacitance values.
A Synopsys Liberty (.lib) format file, also known as a timing library file (Lib file), contains
several kinds of LUTs for computing cell delay. Usually, the Lib file is provided by the foundry,
however, when a designer wants to have his own cell library or to change some parameters
in the process, he needs to generate the lib file himself.
For doing this work, the designer should decide the characteristic parameters input transition
time and output load of the LUT and then get the timing information by Hspice simulation or
other EDA tools.
After completing the whole library’s characterization, the designer needs to analyze the
accuracy of the timing file for the future work by himself, as no EDA tools can analyze it
automatically. The accuracy of timing analysis depends on the accuracy of the delay model
used in the cell characterization.
The LUT only includes a fixed number of parameters, for example, the size of 5x5 or 7x7 (input
transition time and load) pairs, while the delays for the other pairs are obtained using linear
interpolation.
As a result, when making a standard cell library of one’s own, choosing appropriate
characteristic parameters of input transition time and load is the key work for the accuracy of
the timing library file.
NLDM is a highly accurate timing model as it is derived from SPICE characterizations. Most of
the cell libraries used table models to specify the delay and timing checks for various timing
arcs of the cell, the table model is referred to as an NLDM and are used for calculating the
delay, output slew, or other timing checks.
The table gives the delay through the cell for various combinations of input transition time at
the cell input pin and total output capacitance at the cell output.An NDLM delay model is
represented in two-dimensional form, with two independent variables are the input
transition time and the output load capacitance and entries in the table denoting the delay.
Some newer timing libraries also provide advanced timing models based on current sources
such as CCS, ECSM, etc. for deep submicron technologies.
Example of delay table for inverter
This is related to rising and falling delay models of inverter timing arc from pin INP1 and OUT
as well as max transition allowed time at pin OUT for This separate timing model for the rise
and fall delays for output pin and these are denoted as cell_rise and cell_fall respectively.
pin (OUT) {
max_transition: 1.0;
timing () {
related_pin: "INP1";
timing_sense: negative_unate;
cell_rise (delay_template_3x3) {
cell_fall (delay_template_3x3) {
}
In the lookup table template has two variable is specified, the first variable in the table is the
input transition time and the second variable is the output capacitance but this can be in
either i.e. the first variable can be output capacitance but usually designer is consistent across
all the templates in the library. This form of representing delays in a lookup table is called the
Non-Linear delay model because of the nonlinear variations of delay with input transition
time and load capacitance.
The NLDM models are not used only for the output delay but also for used for the output
transition time which is characterized by the input transition time and the output load. This
also separates two-dimensional tables used for computing the output rise and fall transition
times of a cell. These are denoted by rise_transition and fall_transition at the output.
pin (OUT) {
max_transition: 1.0;
timing () {
related_pin: "INP";
timing_sense: negative_unate;
rise_transition (delay_template_3x3) {
fall_transition (delay_template_3x3) {
...
...
An inverter cell with a Non-Linear Delay model has the following tables:
Rise delay
Fall delay
Rise transition
Fall transition
This information is specified in the timing_sense field of the timing arc. For the inverter, the
timing arc is negative_unate i.e. the output pin direction is opposite (negative) of the input
pin direction. Thus the cell_rise table lookup corresponds to the falling transition time at the
input pin and cell_fall table lookup corresponds to the rising transition time at the input pin.
Case1: When Input transition and output load values match with lookup table indexes
values.
Based upon the delay table an input fall transition time 0f 0.3ns and an output load of 0.16pf
will correspond to the rise delay of the inverter of 0.1018ns because a falling transition at the
input gives the inverter output rise.
fall_transition (delay_template_3x3) {
Case2: When input transition and output load values doesn't match with table index values
If any port/net on a timing path exceeds a value beyond the LUT’s range, the STA tool
extrapolates the desired value from existing LUT tables. However, the linear interpolation
method does not guarantee the accuracy of the extrapolated values.
The example is related to where the lookup table does not correspond to any entries available
in the table. In this case two-dimensional interpolation is used to provide the resulting delay
value. The two nearest values are chosen for the table interpolation in each dimension.
Consider the lookup table for the fall transition for the input transition of 0.14ns and the
output capacitance of 1.15pF. The fall transition table given below for two- dimensional
interpolation is reproduced below.
fall_transition (delay_template_3x3)
values (\
“. . . 0.1937, 07280” \
“. . . 0.2327, 0.7676”
...
Input transition time 0.14ns lies between the 0.1ns and 0.3ns input transition so we assume
0.1ns is x1, 0.3ns is x2 and 0.14ns is x0. Similarly output load 1.15pF lies between the 0.35 and
1.43 output load so we assume 0.35 is y1, 1.43ns is y2 and 1.15 is y0 and corresponding delays
entries are T11 = 0.1937, T12=.07280, T21=.2327 and T22=0.7676.
If lookup table is required for (x0, y0), the delay value T00 is obtained by interpolation and
the formula is:
xo1 = (x0 – x1) / (x2 – x1) = (0.14 – 0.1) / (0.3 – 0.1) = 0.04/0.2 = 0.2
x2o = (x2 – x0) / (x2 – x1) = (0.3 – 0.14) / (0.3 – 0.1) = 0.16/0.2 = 0.8
y01 = (y0 – y1) / (y2 – y1) = (1.15 – 0.35) / (1.43 – 0.35) = 0.8/1.08 =.7407
y20 = (y2 – y0) / (y2 – y1) = (1.43 – 1.15) / (1.43 – 0.35) = .28/1.08 =.3651
T00 = x20 * y20 * T11 + x20 * y01 * T12 + x01 * y20 * T21 + x01 * y01 * T21
T00 = 0.8 * 0.3651 *0.1937 +0.8 * 0.7407 *.7280 + 0.2 * 0.3651 *.2327 + 0.2 * 0.7407 * 0.7676
= 0.6186
This equation is valid for interpolation as well as extrapolation that means when the indices
(x0, y0) lies outside the index1 and index2 (characterized range of indices).
For example, for the lookup table with 0.05 for index1 and 1.7 for index2 then the fall
transition value obtained as:
x2o = (x2 – x0) / (x2 – x1) = (0.3 – 0.05) / (0.3 – 0.1) = 0.25/0.2 = 1.25
y01 = (y0 – y1) / (y2 – y1) = (1.7 – 0.35) / (1.43 – 0.35) = 1.35/1.08 =1.25
y20 = (y2 – y0) / (y2 – y1) = (1.43 – 1.7) / (1.43 – 0.35) = -0.27/1.08 = -0.25
T00 = x20 * y20 * T11 + x20 * y01 * T12 + x01 * y20 * T21 + x01 * y01 * T21
=1.25 * (-0.25)* 0.1937 + 1.25 * 1.25 * .7280 + (-0.25) * (-0.25) * 0.2327 + (-0.25) * 1.25
* 0.7676
= 0.8516
Ques1: How to interpolate and extrapolate delay values from the library?
Ques2: What is standard Cell Characterization Concepts in the library?
It is a process of analyzing a circuit using static and dynamic methods to generate models
suitable for chip implementation flows.
No digital chip is possible without cell models. Characterization is necessary for the use of
Standard Cells and this is done on extracted netlists.
The Non-linear Delay Model (NLDM) is the most common delay model for the characterization
of standard cells. As we are going to deep sub-micron technology it requires more accurate
timing models like CCS (composite current source)
Every digital chip implementation (RTL-to-GDSII) flow requires cell models for analysis (logic
simulation, verification, timing, power, noise, etc), implementation (synthesis, test insertion,
placement, clock tree synthesis, routing) and fixing (engineering change order, rule fixing,
etc).
Wire Load Model (WLM)
How do you estimate the parasitics (RC) of a net before placement and routing?
Prior to the Routing stage, net parasitics and delays cannot be accurately determined we
know only the fanout of net and the size of the block.
Before going for floorplanning or layout, wire load models (WLM) can be used to calculate
interconnect wiring delays (capacitance (C), resistance (R)), and the area overhead (A) due to
interconnect.
The wire load model is also used to estimate the length of a net-based upon the number of
its fanouts.
The wire load model depends upon the area of the block, and designs with different areas
may choose different wire load models.
The wire load model also maps the estimated length of the net into the resistance,
capacitance, and the corresponding area overhead due to routing. The average wire length
within a block increases as the block size is increased.
Generally, a number of wire-load models are present in the Synopsys technology library, each
representing a particular size block of the logic. These models define the capacitance,
resistance, and area factor of the net.
Typically a wire load model selection is based upon the chip area of the block. However these
WLM models can be modified or changed according to the user’s requirement by the
designers.
Figure shows different areas (chip or block size), having different wire load models would
typically be used in determining the parasitics (RC), so from the figure, it is clear that the
smaller sized block has a smaller capacitance.
For interpolation:
Length = Average of fanout lengths = (Net length at fanout 4 + Net length at fanout 6)/2
Capacitance = New calculated Length * Capacitance coefficient given in the table
Resistance = New calculated Length * Resistance coefficient given in the table
Area overhead due to interconnect = New calculated Length * Area coefficient given in the
table
Length = (4.1 +5.1)/2 = 4.6 units
Capacitance = 4.6 * 1.2 = 5.52 units
Resistance = 4.6 * 6 = 27.6units
Enclosed
Segmented
Top:
In this mode, all the nets within the hierarchy use the wire load model of the top-level, and if
any wire load models specified in lower-level blocks (sub-blocks) then we ignored that WLM
model. Only the top-level wire load model takes precedence over all the WLM modes.
From the figure, the wlm_1 wire load model specified in block A1 is used over all the other
wire load models specified in blocks A2, A3, and A4.
Enclosed:
In this WLM mode, the wire load model of the block that fully encompasses the net is used
for the entire net.
From the figure, the net Net1 is included in block A2 and thus the wire load model of block
A2, wlm_2 is used for this net.
Other net that is fully contained in block A3 use the wlm_3 wire load model, and net that is
fully contained within block A4 use the wlm_4 wire load model.
Segmented:
In this WLM mode each segment of the net gets its wire load model from the block that
encompasses the net segment. Each portion of the net uses the appropriate wire load model
within that block.
Figure illustrates an example of a net Net1 that has segments in three blocks. The
interconnect within block, A3 uses the wire load model wlm_3, the segment of the net within
block A4 uses the wire load model wlm_4, and the segment within block A2 uses the wire
load model wlm_2.
Standard Parasitic Extraction Format (SPEF)
SPEF allows the representation of parasitic information of a design(R, L, and C) in an ASCII
(American Standard Code for Information Interchange exchange format). A user can read and
check the values in a SPEF file. Users would never create this file manually it is automatically
generated by the tool. It is mainly used to pass parasitic information from one tool to another.
Interconnect parasitics depends on the process. SPEF supports the specification of all the
cases like best-case, typical, and worst-case values. These triplets (best, typical, and worst)
are allowed for R, L, and C values, ports slows, and loads. The units of the parasitics R, C, and
inductance L are specified at the beginning of the SPEF file.
The figure shows that SPEF can be generated by place-and-route tool or a parasitic extraction
tool, and then this SPEF is used by timing analysis tool for checking the timing, in-circuit
simulation or to perform crosstalk analysis.
Parasitics can be represented at many different levels. SPEF supports three models.
Distributed net model: In this model (D_NET), each segment of a net route has its own R and
C values.
Reduced net model: In this model (R_NET), on the load pins of the net have single reduced R
and C, and on the driver pin of the net a pie model (C-R-C) is considered.
Lumped capacitance model: In this model, only a single capacitance is specified for the entire
net.
An example of a net with two fanouts (*I *8 and *I *10) is given below.
*CONN
*I *8: Q O *L 0 *D CELL1
*CAP
1 *9:0 0.00372945
2 *9:1 0.0206066
3 *9:2 0.035503
4 *9:3 0.0186259
5 *9:4 0.0117878
6 *9:5 0.0189788
7 *9:6 0.0194256
8 *9:7 0.0122347
9 *9:8 0.00972101
10 *9:9 0.298681
11 *9:10 0.305738
12 *9:11 0.0167775
*RES
14 *9:12 *8: Q 0
15 *9:13 *10: I 0
*END
header_definition: contains basic information such as the SPEF version number, design name,
and units for R, L and C.
[name_map]: specifies the mapping of net names and instance names to indices.
internal_definition: contains the guts of the file, which are the parasitics of the design.
HEADER DEFINITION
*DATE "Fri Sep 21 00:49:32 2005"---> Timestamp when the file was created
*VENDOR "SGP Design Automation_1"---> Vendor tool
*VERSION "V2000.09” ----> version number of the program that was used to generate the
SPEF
*DESIGN_FLOW "PIN_CAP NONE" "NAME_SCOPE LOCAL" ---> specifies at what stage the SPEF
file was created. It describes information about the SPEF file that cannot be derived by reading
the file.
*BUS_DELIMITER [ ] ---> specifies the prefix and suffix used to identify a bit of a bus.
NAME MAP:
A name map consisting of a map of net names and instance names to indices, the SPEF file
size is made effectively smaller, and more importantly, all long names appearing in only one
place. It specifies the mapping of names to unique integer values (their indices).
*NAME_MAP
*positive_integer name
*positive_integer name
...
*NAME_MAP
*1 memclk
*2 memclk_2x
*3 reset_
*4 refresh
*5 resync
*6 int_d_out[63]
*7 int_d_out[62]
*8 int_d_out[61]
*9 int_d_out[60]
*10 int_d_out[59]
*11 int_d_out[58]
*12 int_d_out[57]
...
*364 mcdll_write_data/write19/d_out_2x_reg_19
*366 mcdll_write_data/write20/d_out_2x_reg_20
*368 mcdll_write_data/write21/d_out_2x_reg_21
...
*14954 test_se_15_S0
*14955 wr_sdly_course_enc[0]_L0
*14956 wr_sdly_course_enc[0]_L0_1
*14957 wr_sdly_course_enc[0]_S0
This helps in reducing the file size by making all future references of the name by the index.
A name can be an instance name or net name.
The name map thus avoids repeating long names and their paths by using their unique integer
representation.
*POWER_NETS VDDQ
*GROUND_NETS VSSQ
It contains the definition of the physical & logical ports of the design.
*PORTS
...
Example:
*PORTS
*1 I
*2 I
*3 I
*4 I
*5 I
*6 I
*7 I
*8 I
*9 I
*10 I
*11 I
...
*450 O
*451 O
*452 O
*453 O
*454 O
*455 O
*456 O
PHYSICAL PORTS
Format:
*PHYSICAL_PORTS
*PDEFINE ---> Used when the entity instance is a physical partition instead of logical.
example:
This means that there would be another SPEF file with a *DESIGN value of ddrphy1 - this file
would contain the parasitics for the design ddrphy1, and possible to have physical and logical
hierarchy.
It describes the parasitics for the nets in the design. There are basically two forms:
1. Distributed net, D_NET
2. Reduced net, R_NET
*CONN
*I *14214: Q O *C21.495076.6000*DDFFQX1
*CAP 1
5 *5426:10278 0.529736
*RES
The capacitance value is the sum of all capacitances on the net like cross-coupling
capacitances that are assumed to be grounded, and load capacitances.
It may or may not include pin capacitances depending on the setting of PIN_CAP in the
*DESIGN_FLOW definition.
In the second line is the connectivity section which describes the drivers and loads for the net.
In:
*CONN
CAPACITANCE SECTION:
5 *5426:10278 0.529736
The first number is the capacitance identifier. In SPEF there are two forms of capacitance
specification.
First through fourth are of one formà (first through fourth) specifies the cross-coupling
capacitances between two nets. So in index 1 of capacitance is 0.217446 and it represents the
cross-coupling capacitance between nets *5426 and*5290.
Fifth is of the second form à with index 5) specifies the capacitance to ground and in
capacitance index 5, the capacitance to ground is 0.529736.
Notice that the first node name is necessarily the net name for the D_NET and here the name
of D_NET is *5426. The positive integer 10278 in *5426:10278 specifies an internal node or
junction point.
So capacitance index 4 states that there is a coupling capacitance between net *5426 with
internal node 10278 and net *5116 with internal node 9922, and the value of this coupling
capacitance is 0.113918.
RESISTANCE SECTION
*RES
The first index is between the internal node *5426:10278 to the D pin on *14212 and the
resistance value is 0.34.
The capacitance and resistance section can be better understood with the RC network shown
pictorially in fig.
This net has two loads and one driver and the total capacitance on the net is 2.69358.
*D_NET*5423 2.69358
*CONN
*CAP
10 *5423:10213 1.30707
*RES
*END
Figure RC network for D_NET *5423 corresponds to the distributed net specification.
*END
*END
[conn_section]
[cap_section]
[res_section]
[inductance_section]
*END
INDUCTANCE SECTION
This section used to define the inductances and the format is very similar to the resistance
section.
The *V is used to specify the accuracy of the parasitics of the net. The accuracy can be
specified individually with a net and globally using the *DESIGN_FLOW statement with the
ROUTING_CONFIDENCE value, such as:
*DESIGN_FLOW “ROUTING_CONFIDENCE 100” ---> means the parasitics were extracted after
final cell placement and final route and 3d extraction was used.
The possible values of routing confidence are 10, 20, 30, 40, 50, 60, 70, 80, 90 and, 100.
REDUCED NET
It is the net that is the reduced from the distributed net. There is one driver reduction section
for each driver on a net. The example of a reduced net SPEF is below
*CELL SEDFFX1
*C2_R1_C1 0.511 2.922 0.106 ----> shows the parasitics for the pie model on the driver pin of
the net
*RC *1203: A 0.946 --------> rc_value in the *RC construct is the Elmore delay (R*C)
*END
It can be define using either a *D_NET or a *R_NET construct with only the total capacitance
and no other information is given. The examples of lumped capacitance are given below.
*D_NET *1 80.2096
*CONN
*I *2: Y O *L 0 *D CLKMX2X3
*P *1 O *L 0 *END
*END
Values in a SPEF file can be in a triplet form that represents the process variations, such as
0.243: 0.269: 0.300.
CRPR (Clock Reconvergence Pessimism Removal) topic is comes after the STA introduces OCV
(On Chip Variation) analysis, So here I covered small introduction of OCV so that we can
understand the CRPR and How CRPR is related to OCV.
we all know during the manufacturing of chips on the same die may suffer from variations
due to process, voltage, or temperature change, thus transistors can be faster or slower in
different dies.
Delays vary across a single die due to PVT (processor, voltage, temperature). The delay value
of IC in cold weather is different and hot weather is different. In cold weather, the metals in
IC will shrink so the delay will decrease. In hot weather, the metal will expand so the delay
will increase.
The variation may be random or deterministic. Random variations are oxide thickness
variation, implant doses, and metal or dielectric thickness variations.
Suppose there is two inverters having the same characteristics on the single-chip but due to
process, voltage, and temperature variation the delay of these two has different delay
propagation.
To compensate these variations, STA introduces a concept called On-Chip Variation (OCV). In
this concept, extra timing margins are added in timing analysis. In OCV, all cells or nets in the
launch clock path, data path, and capture clock path will be added a fixed derate value,
bringing more pessimism in timing analysis and compensating the variation.
In simple words, OCV is a technique in which this flat derate is applied to make a faster path
more fast and slower path slower. so OCV adds some pessimism in the common path of lauch
and capture path i.e. for a same cell there are two delays min and max.
In this concept, we remove the extra pessimism from common path. Generally, we add the
delay to every buffer in the process of OCV. But adding more delay is also affects the speed
of the chip and it may cause violations to overcome this we are removing the delay to the
common path in the process of CRPR.
The delay difference along the common paths of the launching and capturing clock paths is
called CRPR.
Problem: - In the fig three buffers, flip flops, combinational circuit have two delays one is
min delay another is the delay after adding derating i.e. max delay. Consider Time period
8ns and Tsetup and Thold are 0.2ns.
Let's consider a buffer that is placed in a common path (both data path and clock path) for
buf2 and buf3 buffer.
The tool calculates max. delays for setup calculation and min. delays for hold (worst- and best-
case analysis).
Without CRPR: -
Setup slack = (required time) min - (arrival time) max
Arrival time = 0.70 + 0.65 +0.60 + 3.6 = 5.55ns
Requited time = 8+ 0.60 + 0.45 -0.2 = 8.85ns
Setup slack = 8.85ns – 5.55ns = 3.3ns
With CRPR:
When comes to OCV analysis, the tool further considers, max. for data path and min. for clock
path during setup analysis. max. for clock path and min. for data path during hold analysis.
So, buffer placed in the common path now has 2 values i.e., max. and min. values. As we
know, a cell can't have two different values at a particular instant of time. Thereby we
calculate the buffer value as:
Without CRPR the setup and hold values are: - 3.3ns, 2.48ns
With CRPR the setup and hold values are: - 3.4ns, 2.58ns
From the above results, it is clear that with the CRPR method both setup and hold are
benefited.
Process:
There are millions of transistors on the single-chip as we are going to lower nodes and all the
transistors in a chip cannot have the same properties. Process variation is the deviation in
parameters of the transistor during the fabrication.
During manufacturing a die, the area at the center and at the boundary will have different
process variations. This happens because layers that will be getting fabricated cannot be
uniform all over the die.
Below are a few important factors which can cause the process variation;
1. The wavelength of the UV light
2. Manufacturing defects
3. Oxide thickness variation
4. Dopant and mobility fluctuation
5. Transistor width
6. RC Variation
7. channel length
8. doping concentration,
9. metal thickness
10. impurity concentration densities
11. diffusion depths
12. imperfections in the manufacturing process like mask print, etching
These variations will cause the parameters like threshold voltage and threshold voltage
depends on different parameters like doping concentration, surface potential, channel
length, oxide thickness, temperature, source-to-body voltage, and implant impurities, etc.
The threshold voltage equals the sum of the flat band voltage, twice the bulk potential, and
the voltage across the oxide due to the depletion layer charge.
So, the current flowing through the channel directly depends upon
Mobility (μn), (mobility is depending upon temperature)
Oxide capacitance Cox (and Cox = εox/ tox hence the thickness of oxide i.e. tox)
The ratio of width to length (W/L)
If any of these parameters change, it will change the current and change in current will affect
the delay of the circuit because The delay depends upon the R and C values (Time constant
RC) of the circuit. The relation between process and delay shown in Figure. From the figure,
we conclude that delay is more for slow process MOSFETs and it is less for fast process
MOSFETs.
The process of fabrication includes Oxidation, diffusion, Ion Implantation, Deposition, Etching,
Photolithography, drawing out of metal wires, gate drawing, etc. The diffusion density and
the width of metal wire are not uniform throughout wafer and diffusion regions for all
transistors will not have the same diffusion concentrations. So, all transistors are expected to
have different characteristics. This introduces variations in the sheet resistance (Rs) and
transistor parameters such as threshold voltage (Vth) and because of this, it will causes (W/L)
variations in MOS transistors.
Process variation is different for different technologies but is more dominant in lower node
technologies because transistors are in millions on the chip. Process variations are due to
variations in the manufacturing conditions such as temperature, pressure, and dopant
concentrations. As a consequence, the different transistors have different lengths
throughout the chip. This makes the different propagation delay everywhere in a chip
because a smaller transistor is faster and therefore the propagation delay is smaller.
Voltage:
As we are going to the lower nodes the supply voltage for a chip is also going to less. Let’s say
the chip is operating at 1.2V. So, there are chances that at certain instances of time this
voltage may vary. It can go to 1.5V or 0.8V. To take care of this scenario, we consider voltage
variation.
There are multiple reasons for voltage variation.
IR drop is caused by the current flow over the power grid network.
Supply noise caused by parasitic inductance in combination with resistance and
capacitance. when the current is flowing through parasitic inductance (L) it will causes the
voltage bounce.
The supply voltage is given to any chip either externally from the DC source or some voltage
regulator. The voltage regulator will not give the same voltage all the time. It can go above or
below to the expected voltage and hence if voltage change it will change the current and
making the circuit slower or faster than earlier.
Power is distributed to all transistors on the chip with the help of a power grid network.
Throughout a chip, the power supply is not constant it will change with the placement of cells.
The power grid network is made up of metals and metals have their own resistance and
capacitance. So, there is a voltage drop along the power grid.
The supply voltage reaching the power pins will not be the same for all standard cells and
macros because of the resistance variation of the metals. Consider there are two cells, one
which is placed closer to the DC power source, and others placed far. As the interconnect
length is more for the farther cell, it has more resistance and results in a higher IR drop, and
it reduces the supply voltage reaching the farthest cell. As the voltage is less, this cell will take
more delay to power on than the cell which is placed closer. If nearer cells get higher voltage
then the cell is faster and hence the propagation delay is also reduced. That is the reason
because of which, there is variation in delays across the transistors.
The delay of a cell is depending on the saturation current and the saturation current of a cell
depends on the power supply. In this way, the power supply affects the propagation delay of
a cell.
The self-inductance of a supply line contributes also to a voltage drop. For example, when a
transistor is switching to high, it takes a current to charge up the output load. This time-
varying current (for a short period of time) causes an opposite self-induced electromotive
force. The amplitude of the voltage drop is given by V=L*dI/dt, where L is the self-inductance
and I is the current through the line.
Temperature:
The transistor density is not uniform throughout the chip. Some regions of the chip have
higher density and higher switching, resulting in higher power dissipation and Some regions
of the chip have lower density and lower switching, resulting in lower power dissipation
Hence the junction temperature at these regions may be higher or lower depending upon the
density of transistors. Because of the variation in temperature across the chip, it introduces
different delays across all the transistors.
The temperature variation is with respect to the junction and not ambient temperature. The
temperature at the junction inside the chip can vary within a big range and that’s why
temperature variation needs to be considered. Delay of a cell increases with an increase in
temperature. But this is not true for all technology nodes. For deep sub-micron technologies,
this behavior is contrary. This phenomenon is called a temperature inversion.
When a chip is operating, the temperature can vary throughout the chip. This is due to the
power dissipation in the MOS-transistors. The power consumption in the transistors is mainly
due to switching, short-circuit, and leakage power consumption.
The average switching power dissipation is due to the required energy to charge up the
parasitic and load capacitances and the short-circuit power dissipation is due to the finite rise
and fall times and leakage power consumption is due to the reverse leakage and sub-
threshold currents.
The biggest contribution to power consumption is switching. The dissipated power will
increase the temperature. Mobility depends on temperature.
mobility= temp^-m
we know that with an increase in temperature, the resistivity of a metal wire(conductor)
increases. The reason for this phenomenon is that with an increase in temperature, thermal
vibrations also increase. This gives rise to increased electron scattering and electrons start
colliding with each other more and the mobility of the primary carriers decreases with an
increase in temperature.
Similarly, for higher doping concentrations, the temperature is higher and thermal vibrations
are also increasing and the electrons and holes move slower i.e. mobility decreases, then the
propagation delay increases. Hence, the propagation delay increases with increased
temperature. The threshold voltage of a transistor depends on the temperature. A higher
temperature will decrease the threshold voltage. A lower threshold voltage means a higher
current and therefore a better delay performance. This effect depends extremely on the
power supply, threshold voltage, load, and input slope of a cell. There is a competition
between the two effects and generally the mobility effect wins.