[go: up one dir, main page]

0% found this document useful (0 votes)
283 views20 pages

ASR9K Punt Fabric Data Path Failure: Mahesh S

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

White Paper

ASR9K Punt Fabric Data Path Failure


Mahesh S { maheshk@cisco.com }

Introduction
The purpose of this white paper is to provide understanding and clarity on punt fabric data path
failure message when seen during ASR9K router operation. The message when appears will be
in the format “RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-
PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 1) (0/7/CPU0, 2) (0/7/CPU0, 3) (0/7/CPU0,
4) (0/7/CPU0, 5) (0/7/CPU0, 6) (0/7/CPU0, 7) “ This document should help customers, TME, TAC,
DE/DT community and anyone who is interested in understanding the error message and
whether any action needs to be taken or not .

Abstract
A high level understanding of ASR9K line card, fabric cards, route processor cards and chassis
configurations is helpful. The document however doesn’t require readers to be familiar with
hardware details. Necessary background information is provided before this papers attempts to
explain error message.

Suggestion on How to Read


Author of this paper suggests following reading format to glean both essential details and how
to use as reference document during debugging.

• During spare time and when no urgency to root cause Punt Fabric Data Path failure it
will help to read all sections of the document. From the start of the document until
section “Analyzing Faults”, this paper builds necessary background to isolate faulty
component when such an error occurs.
• For Further Reading to build more knowledge on both NP and punt fabric faults use
Cisco provided documentation (if any).
• If there is a specific question in mind for which a quick answer is needed, please use
“FAQ” section. If question doesn’t show up in the section, then please if check this
white paper answer the question.
• If we have router where fault has occurred, and we are in process of debugging to
narrow down faulty component or to check if it’s a know issue, all sections starting from
“Analyzing Faults” may help.

Background
Packet through switch fabric can traverse either two hops or three hops depending upon line
card type. Typhoon generation line cards add an extra switch fabric element while Trident based
line cards switch all traffic using fabric on router processor card only. The picture below shows
fabric elements for both these line card types.

The diagrams below show fabric connectivity to Route Processor Card.

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 1
White Paper

Trident$Line$card$
CPU$
NP0$

NP1$

Trident$Line$card$
B0 CPU$
FIA0% $ NP0$ $
NP2$ $ $
B1
RSP$$3$Switch$ NP1$ RSP$$3$Switch$
NP3$ Fabric$ B0 Fabric$
A9K84T$ $ FIA0% $
$ NP2$ $
B1
$ $
$ NP3$ $
CPU$
$
Switch$
$
$
A9K84T$ $
Switch$
NP0$ $ $
Fabric$ $ Fabric$
$ $
$
Trident$Line$card$

$ RSP0% $ RSP0%
NP1$ $ $ CPU$ $
B0 $ NP0%
FIA0% $
$ FIA0%
$
NP2$ $
$ NP1%
$
B1

Switch%Fabric%ASIC%
$ $
$ $ $

Typhoon$Line$card$
NP3$ $ $ NP2% $
$
$
$
FIA1% $
$ Switch$ NP3% $ Switch$
NP4$ $ Fabric$ $ $ Fabric$
$ $ NP4% $
RSP1% RSP1%
$
NP5$ $
$
FIA2% $
B2 NP5%
FIA1% $
NP6$ $
NP6%
B3 $
NP7$
$ FIA3%
A9K88T$ $ NP7%
$ A9K824x10G$

Punt Fabric Diagnostic Packet Path


Diagnostic application running on route processor card cpu injects diagnostic packets destined
to each NP periodically. Diagnostic packet is loop backed inside NP and re-injected back
towards route processor card cpu that sourced packet. This periodic health check of every NP
using unique packet per NP by diagnostic application on route processor card will alert any
functional errors on the data path during router operational conditions. It is essential to note that
diagnostic application on both active route processor card and standby route processor card
injects one packet per NP periodically and maintain a per NP success or failure count. When
diagnostic packets are dropped for up to threshold number of drop tolerance count, the
application will raise fault.

Conceptual view of Diagnostic Path


Before the document describes diagnostic path on Trident and Typhoon based line cards, this
section will give a general outline of fabric diagnostic path both from active and standby route
processor card towards NP on line card.

Packet Path between Active Route Processor Card and Line card:
Diagnostic packets injected from active route processor card into fabric towards NP are treated
as unicast packets by switch fabric. Since for unicast packets, switch fabric chooses out going
link based on current traffic load of the link, this helps to subject diagnostic packets to traffic
load on the box. (When there as multiple outgoing links towards NP, switch fabric ASIC choose a link that is
currently least loaded)
The diagram below depicts diagnostic packet path sourced from active route processor card.
Note that first link connecting FIA on line card to XBAR on route processor card is chosen all
the time for packets destined to NP. Response packets from NP are subjected to link load
distribution algorithm (if line card is typhoon based). This means that response packet from NP
towards active route processor card can choose any of the back plane links connecting line cards
to route processor card depending on the back plane link load.

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 2
White Paper

RSP0'
LC$CPU$
NP0$ Ac0ve''
F RSP'FIA'
P
G
NP1$ A $
$
Absent Fabric' RSP$$3$Switch$
On all
Typhoon
ASIC' Fabric$ FPGA'
:$ Line card (Mul0ple' $ Switch$
Fabric$
:$ Except
2x100Ge FIA''
$
$
1x100Ge $
or'' $
FGPA
NP6$ Single'' $
Present Ac0ve''
On all
Trident
XBAR' $
$ RSP'CPU'
Based $
Line card
$
NP7$ $
$
$
$
$
$ Switch$
$ Fabric$
$
From$RSP$to$NPU$diagnos>c$applica>on$injected$packet$ RSP1'
$
NPU$to$RSP$response$packet$looped$back$inside$NPU$$$

Packet Path between Standby Route Processor Card and Line card:

RSP0'
LC$CPU$
NP0$ Standby''
F RSP'FIA'
P
G
NP1$ A $
$
Absent Fabric' RSP$$3$Switch$
On all
Typhoon
ASIC' Fabric$ FPGA'
:$ Line card (Mul0ple' $ Switch$
Fabric$
:$ Except
2x100Ge FIA''
$
$
1x100Ge $
or'' $
FGPA
NP6$ Single'' $
Present Standby''
On all
Trident
XBAR' $
$ RSP'CPU'
Based $
Line card
$
NP7$ $
$
$
$
$
$ Switch$
$ Fabric$
$
From$RSP$to$NPU$diagnos>c$applica>on$injected$packet$ RSP1'
$
NPU$to$RSP$response$packet$looped$back$inside$NPU$$$

Diagnostic packets injected from standby route processor card into fabric towards NP are
treated as multicast packets by switch fabric. Although multicast packet, there is no replication
inside fabric. Every diagnostic packet sourced from standby route processor card still reaches
only one NP at a time. Response packet from a NP towards route processor card is also
multicast packet over fabric with no replication. Hence diagnostic application on standby route

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 3
White Paper

processor card receives single response packet from NPs one packet at a time. Thus diagnostic
application keeps tab on every NP in the system by injecting one packet per NP and expecting
response from every NP one packet at a time. Since for multicast packet switch fabric chooses
out going link based on a field value in packet header, this helps to inject diagnostic packets
over every fabric link between route processor card and line card back plane links. Standby
route processor keeps tab on a NP health over every connecting fabric link between route
processor card and line card slot.
The diagram above depicts diagnostic packet path sourced from standby route processor card.
Note that unlike active route processor card case, all links connecting line card to XBAR on
route processor card is exercised. Response packets from NP take same back plane link that
was used by the packet in route processor card to line card direction. This testing ensures that all
links connecting standby route processor card to line card are monitored all the time.

Punt Fabric Diagnostic Packet Path on Trident Line Card


The diagram below depicts route processor card sourced diagnostic packets destined to NP that
is loop backed towards route processor card. It is important to note data path links and ASICs
that are common to all NPs as well links and components that are specific to sub set of NPs.
(example B0 is common to NP0 and NP1 but FIA0 is common to all NPs. On route processor card end all links and
data path asics and FGPA are common to all line cards and hence to all NPs in a chassis)
Trident$Line$card$

CPU$
NP0$ $
$
NP1$ RSP$$3$Switch$
B0 Fabric$
FIA0% $
NP2$ $
B1
$
NP3$ $
RSP%FIA%
$
$
A9K84T$ $
Switch$
$
$ Fabric$
$
$ $ RSP0%
$ CPU$ $
$ NP0%
$
$ FIA0% $
$ NP1% FPGA%
Switch%Fabric%ASIC%

$
$ $
Typhoon$Line$card$

$ NP2% $
$
$
FIA1% $
NP3% $ Switch$
$ $ Fabric$
$ NP4% $ RSP1%
$
$
FIA2% $ RSP%CPU%
NP5%
$
$
NP6%
$
$ FIA3%
$ NP7%
$ A9K824x10G$

Punt Fabric Diagnostic Packet Path on Typhoon Line Card


The diagram below depicts route processor card sourced diagnostic packets destined to NP that
is loop backed towards route processor card. It is important to note data path links and ASICs
that are common to all NPs as well links and components that are specific to sub set of NPs.
(example - B0 is common to NP0 and NP1 but FIA0 is common to all NPs. On route processor card end all links
and data path asics and FGPA are common to all line cards and hence to all NPs in a chassis)

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 4
White Paper

Trident$Line$card$
CPU$
NP0$ $
$
NP1$ RSP$$3$Switch$
B0 Fabric$
FIA0% $
NP2$ $
B1
$
NP3$ $
RSP%FIA%
$
$
A9K84T$ $
Switch$
$
$ Fabric$
$
$ $ RSP0%
$ CPU$ $
$ NP0%
$
$ FIA0% $
$ NP1% FPGA%

Switch%Fabric%ASIC%
$
$ $
Typhoon$Line$card$

$ NP2% $
$
$
FIA1% $
NP3% $ Switch$
$ $ Fabric$
$ NP4% $ RSP1%
$
$
FIA2% $ RSP%CPU%
NP5%
$
$
NP6%
$
$ FIA3%
$ NP7%
$ A9K824x10G$

In the next few sections, the document attempts to depict packet path to every NP. This will be
necessary to under stand punt fabric data path error message and also to locate failure point.

Punt Fabric Diagnostic Alarm and Failure Reporting


Failure to get response from a NP in ASR9K system will result in an alarm. The decision to
raise an alarm by online diagnostic application executing on route processor card happens when
there are three consecutive failures. Diagnostic application maintains a three packets failure
window for every NP. Active route processor card and standby route processor card diagnose
independently and in parallel. Hence depending on fault location only active, only standby or
both can report error and raise alarm.
In default condition the frequency of diagnostic packet towards each NP is maintained at one
packet per 60 seconds or one per minute.

The format of alarm message is as shown below.

“RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-


PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 1) (0/7/CPU0, 2) (0/7/CPU0, 3) (0/7/CPU0,
4) (0/7/CPU0, 5) (0/7/CPU0, 6) (0/7/CPU0, 7) “

The message should be read as failure to reach NP 1, 2, 3, 4, 5, 6, 7 on line card 0/7/cpu0 from
route processor card 0/rsp0/cpu0.
From the list of online diagnostic tests, we can see attributes of punt fabric loopback test using

RP/0/RSP1/CPU0:ios(admin)#sh diagnostic content location 0/rsp1/cpu0

RP 0/RSP1/CPU0:
Diagnostics test suite attributes:
M/C/* - Minimal bootup level test / Complete bootup level test / NA

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 5
White Paper

B/O/* - Basic ondemand test / not Ondemand test / NA


P/V/* - Per port test / Per device test / NA
D/N/* - Disruptive test / Non-disruptive test / NA
S/* - Only applicable to standby unit / NA
X/* - Not a health monitoring test / NA
F/* - Fixed monitoring interval test / NA
E/* - Always enabled monitoring test / NA
A/I - Monitoring is active / Monitoring is inactive
Test Interval Thre-
ID Test Name Attributes (day hh:mm:ss.ms shold)
==== ================================== ============
================= =====
1) PuntFPGAScratchRegister ---------> *B*N****A 000 00:01:00.000 1
2) FIAScratchRegister --------------> *B*N****A 000 00:01:00.000 1
3) ClkCtrlScratchRegister ----------> *B*N****A 000 00:01:00.000 1
4) IntCtrlScratchRegister ----------> *B*N****A 000 00:01:00.000 1
5) CPUCtrlScratchRegister ----------> *B*N****A 000 00:01:00.000 1
6) FabSwitchIdRegister -------------> *B*N****A 000 00:01:00.000 1
7) EccSbeTest ----------------------> *B*N****I 000 00:01:00.000 3
8) SrspStandbyEobcHeartbeat --------> *B*NS***A 000 00:00:05.000 3
9) SrspActiveEobcHeartbeat ---------> *B*NS***A 000 00:00:05.000 3
10) FabricLoopback ------------------> MB*N****A 000 00:01:00.000 3
11) PuntFabricDataPath --------------> *B*N****A 000 00:01:00.000 3
12) FPDimageVerify ------------------> *B*N****I 001 00:00:00.000 1

Here we see that PuntFabricDataPath frequency of packet is every minute and failure threshold
is 3 implying loss of up to 3 packets is tolerated. Three consecutive packet losses result in an
alarm. The test attributes shown are default values. Using admin config commands these
attributes can be changed if desired.

Trident Line Card Diagnostic Packet Path

A.1 NP0 Diagnostic Failure

Fabric Diagnostic Path


Trident$Line$card$

RSP0%
LC$CPU$
NP0$

B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 6
White Paper

The diagram below depicts packet path between route processor card CPU and Line card NP0.
Here we see that link connecting Bridge 0 (B0) and NP0 is the only link specific to NP0. All
other links fall in common path.
Also make note of packet path from route processor card towards NP0. Although there exists
four links to use for packet destined towards NP0 from route processor card, the first link
between route processor card and line card slot is used for packet in from route processor card
to line card direction. While returned packet from NP0 can sent back to active route processor
card over any of the two fabric link paths between line card slot to active route processor card.
The choice of which one of the two links will be used depends on link load at that time.
Response packet from NP0 towards standby route processor card uses all two links but one link
at a time. The choice of link is based on header field that diagnostic application populates.

A.1.1 NP0 Diagnostic Failure Analysis

Single Fault Scenario


If we detect a single pfm punt fabric data path failure alarm with NP0 in the failure message (if
more than one faults occurred please refer to Multiple Fault Scenario section), the fault is only on fabric path
connecting route processor card and line card’s NP0.
“RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-
PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 0) “ [The discussion in this section of the
document applies to any lien card slot in a chassis regardless of chassis type. Hence can be applied to all line card
slots]
Using the above data path diagram, we see that fault has to be in one or more of the following
• Link connecting NP0 and B0
• Inside B0 queues directed towards NP0
• Finally inside NP0.

Multiple Fault Scenario

Mutiple NP faults
If apart from PUNT_FABRIC_DATA_PATH_FAILED fault, if any other faults are observed
on NP0 or the fault PUNT_FABRIC_DATA_PATH_FAILED is also reported by other NPs on
the same line card, then fault isolation can be done by correlating all the faults. For example on
NP0 if both PUNT_FABRIC_DATA_PATH_FAILED fault and
LC_NP_LOOPBACK_FAILED [Please refer to appendix section to understand loopback fault ]
fault occur, then NP has stopped processing packets. This could be an early indication of critical
failure inside NP0. However if only one of the either faults occurred, then fault is localized to
either punt fabric data path or on line card CPU to NP path.
If more than one NP on a line card has punt fabric data path fault, then we need to walk up the
tree path of fabric links to narrow down faulty component. Example if both NP0 and NP1 has
fault, then fault has to be in B0 or link connecting B0 and FIA0. It’s less likely that both NP0
and NP1 run into critical internal error at the same time. Albeit less likely, nothing precludes
from not having both NP0 and NP1 run into critical error fault due to a incorrect processing of a
particular kind of packet or bad packet.

Both route processor cards report Fault


If 0/rsp0/cpu0 is in active redundancy state and if 0/rsp1/cpu0 is in standby state and if both
route processor cards report fault to one or more NPs on a line card, then the fault may allude to
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 7
White Paper

check all common links and components on the data path between NP/NPs and both route
processor cards.

A.2 NP1 Diagnostic Failure


The diagram below depicts packet path between route processor card CPU and Line card NP1.
Here we see that link connecting Bridge 0 (B0) and NP1 is the only link specific to NP1. All
other links fall in common path.
Also make note of packet path from route processor card towards NP1. Although there exists
four links to use for packet destined towards NP1 from route processor card, the first link
between route processor card and line card slot is used for packet in from route processor card
to line card direction. While returned packet from NP1 can sent back to active route processor
card over any of the two fabric link paths between line card slot to active route processor card.
The choice of which one of the two links will be used depends on link load at that time.
Response packet from NP1 towards standby route processor card uses all two links but one link
at a time. The choice of link is based on header field that diagnostic application populates..

Fabric Diagnostic Path


Trident$Line$card$

RSP0%
LC$CPU$
NP0$

B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%

A.2.1 NP1 Diagnostic Failure Analysis

Refer to section A.1.1 but apply the same reasoning for NP1 (instead of NP0)

A.3 NP2 Diagnostic Failure


The diagram below depicts packet path between route processor card CPU and Line card NP2.
Here we see that link connecting Bridge 1 (B1) and NP2 is the only link specific to NP2. All
other links fall in common path.
Also make note of packet path from route processor card towards NP2. Although there exists
four links to use for packet destined towards NP2 from route processor card, the first link
between route processor card to line card slot is used for packet in from route processor card to
line card direction. While returned packet from NP2 can sent back to active route processor card
over any of the two fabric link paths between line card slot to active route processor card. The

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 8
White Paper

choice of which one of the two links will be used depends on link load at that time. Response
packet from NP2 towards standby route processor card uses all two links but one link at a time.
The choice of link is based on header field that diagnostic application populates..

Fabric Diagnostic Path


Trident$Line$card$

RSP0%
LC$CPU$
NP0$

B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%

A.3.1 NP2 Diagnostic Failure Analysis

Refer to section A.1.1 but apply the same reasoning for NP2 (instead of NP0)

A.4 NP3 Diagnostic Failure


The diagram below depicts packet path between route processor card CPU and Line card NP3.
Here we see that link connecting Bridge 1 (B1) and NP3 is the only link specific to NP3. All
other links fall in common path.
Also make note of packet path from route processor card towards NP3. Although there exists
four links to use for packet destined towards NP3 from route processor card, the first link
between route processor card to line card slot is used for packet in from route processor card to
line card direction. While returned packet from NP3 can sent back to active route processor card
over any of the two fabric link paths between line card slot to active route processor card. The
choice of which one of the two links will be used depends on link load at that time. Response
packet from NP3 towards standby route processor card uses all two links but one link at a time.
The choice of link is based on header field that diagnostic application populates..

Fabric Diagnostic Path

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 9
White Paper

Trident$Line$card$
RSP0%
LC$CPU$
NP0$

B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%

A.4.1 NP3 Diagnostic Failure Analysis

Refer to section A.1.1 but apply the same reasoning for NP3 (instead of NP0)

Typhoon Line Card Diagnostic Packet Path


In order to establish background for fabric punt packet, we use two examples in this section.
First example uses NP1 and second example uses NP3. The description and analysis can be
extended to other NPs on any typhoon based card.

B.1 Typhoon NP1 Diagnostic Failure


The diagram below depicts packet path between route processor card CPU and Line card NP1.
Here we see that link connecting FIA0 and NP1 is the only link specific to NP1 path. All other
links between line card slot and route processor card slot fall in common path. Links connecting
fabric ASIC on line card to FIAs on line card will be specific to sub set of NPs. (Example both
links between FIA0 and local fabric ASIC on linecard will be used for traffic to NP1)
Also make note of packet path from the route processor card towards NP1 as depicted.
Although there exists 8 links to use for packet destined towards NP1 from route processor card,
a single path between route processor card to line card slot is used. While returned packet from
NP1 can sent back to route processor card over 8 fabric link paths between line card slot to
route processor card. Each of these 8 links are exercised one at a time when diagnostic packet is
destined back to route processor card CPU.

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 10
White Paper

Fabric Diagnostic Path

RSP0$

!
!
! RSP!!3!Switch! RSP$FIA$
! Fabric!
! ! Switch!
! ! Fabric!
! LC!CPU! !
! NP0$ !
! FIA0$ !
! NP1$ ! FPGA$

Switch$Fabric$ASIC$
! !
Typhoon!Line!card!

! NP2$ !
! !
!
FIA1$
NP3$ !
! !
! NP4$ !
! !
!
FIA2$ !
RSP$CPU$
NP5$
! !
! ! Switch!
NP6$
! ! Fabric!
! FIA3$ !
! NP7$ ! RSP1$
! A9K524x10G!

B.2 Typhoon NP3 Diagnostic Failure


The diagram below depicts packet path between route processor card CPU and Line card NP1.
Here we see that link connecting FIA1 and NP3 is the only link specific to NP3 path. All other
links between line card slot and route processor card slot fall in common path. Links connecting
fabric ASIC on line card to FIAs on line card will be specific to sub set of NPs. (Example both
links between FIA0 and local fabric ASIC on linecard will be used for traffic to NP1)
Also make note of packet path from route processor card depicted by towards NP3. Although
there exists 8 links to use for packet destined towards NP3 from route processor card, a single
path between route processor card to line card slot is used. While returned packet from NP3 can
sent back to route processor card over 8 fabric link paths between line card slot to route
processor card. Each of these 8 links are exercised one at a time when diagnostic packet is
destined back to route processor card CPU.

Fabric Diagnostic Path

RSP0$

!
!
! RSP!!3!Switch! RSP$FIA$
! Fabric!
! ! Switch!
! ! Fabric!
! LC!CPU! !
! NP0$ !
! FIA0$ !
! NP1$ ! FPGA$
Switch$Fabric$ASIC$

! !
Typhoon!Line!card!

! NP2$ !
! !
!
FIA1$
NP3$ !
! !
! NP4$ !
! !
!
FIA2$ !
RSP$CPU$
NP5$
! !
! ! Switch!
NP6$
! ! Fabric!
! FIA3$ !
! NP7$ ! RSP1$
! A9K524x10G!

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 11
White Paper

Analyzing Faults
This section will categorize faults into hard and transient cases and lists out steps to identify a
fault is hard or transient fault. Once fault type is detected, the document also specifies CLIs that
can be executed on router to understand fault and what actions can be applied.

Transient Fault: If a Set PFM message is followed by clear PFM message, then fault had
occurred and router has corrected the fault itself. Transient faults can occur to due to
environmental conditions, recoverable faults in hardware components, and some times it can be
hard to associate transient faults to any particular event. The suggested approach for transient
errors is to only monitor for further occurrence of such errors. If transient fault occurs more than
once, then treat transient fault as hard faults and use recommendations and setups to analyze
such fault described in the next section.
If faults occur and clear to an NP at the rate of at least one per minute and at least ten faults are
observed towards a given NP within a span of ten minutes, even if faults are cleared at the end
of barrage of fault and corresponding recovery of fault, we can treat such flurry of failures as
hard fault.

An example of transient fabric fault is listed below for clarity.

RP/0/RSP0/CPU0:Feb 5 05:05:44.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-


PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[237686]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0)

RP/0/RSP0/CPU0:Feb 5 05:05:46.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-


PUNT_FABRIC_DATA_PATH_FAILED : Clear|online_diag_rsp[237686]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0)

Hard Fault: If a Set PFM message is not followed by clear PFM message, then fault had
occurred and router has not corrected the fault itself by fault handling code or the nature of
hardware fault is not recoverable. Hard faults can occur to due to environmental conditions,
unrecoverable faults in hardware components. The suggested approach for hard errors is to use
guidelines mentioned in section “Steps to Analyze Hard Faults”.
An example of hard fabric fault is listed below for clarity. For the example message below there
will not be a corresponding clear pfm message.
RP/0/RSP0/CPU0:Feb 5 05:05:44.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-
PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[237686]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0)

Recommendation: Under hard fault scenario, collect out of all the CLIs mentioned in section
“Data To Collect Before SR Creation” and raise an SR. In urgent cases, after collecting all
debug output, initiate route processor card or line card reload (depending on fault isolation).
After reload, if error is not recovered, initiate RMA.

Steps to Analyze Transient Faults


• The first step is to find out if error occurred once or multiple times. [Use CLI in 1 listed
below in this section]
• The second step is to note down the current status; which is to find if the error is
outstanding or cleared. [Use CLI in 2 listed below in this section]
• If error status is changing between set and clear, then one or more faults within fabric

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 12
White Paper

ata path is repeatedly occurring as well as rectified either by software of hardware.


• To monitor future occurrence of the fault (when the last status of the error is CLEAR
and no new faults are occurring), provision either SNMP traps or run script that collects
“sh pfm location all” output and searches for the error string periodically.

CLIs to Use
1: show logging | inc “PUNT_FABRIC_DATA_PATH”
2: show pfm location all ; Check for status type is SET or CLEAR

Steps to Analyze Hard Faults


If we view fabric data path links on a line card a tree, whose details are described in section
“Back Ground”, it’s immediate to infer that depending upon point of fault one or more NP may
not be accessible. When multiple faults occur on multiple NPs, then use CLIs listed in this
section to look at faults.

CLIs to Use
1: show logging | inc “PUNT_FABRIC_DATA_PATH”
The output might contain one or more different NPs (Example: NP2, NP3)
2: show controller fabric fia link-status location <lc>
Since both NP2 and NP3 (in section B.2) receive and send through single FIA, its
reasonable to infer that fault is in associated FIA on the path.
3: show controller fabric crossbar link-status instance <0 and 1> location <LC or RSP>
If all NPs on line card are not reachable for diagnostic application, then its reasonable to
infer that links connecting line card slot to route processor card might have fault of any of the
ASIC that forwards traffic between route processor card and line card has fault.
Show controller fabric cross-bar link-status instance 0 location <lc>
Show controller fabric cross-bar link-status instance 0 location 0/rsp0/cpu0
Show controller fabric cross-bar link-status instance 1 location 0/rsp0/cpu0
Show controller fabric cross-bar link-status instance 0 location 0/rsp1/cpu0
Show controller fabric cross-bar link-status instance 1 location 0/rsp1/cpu0
4. show controller fabric fia link-status location 0/rsp*/cpu0
show controller fabric fia bridge sync-status location 0/rsp*/cpu0
If all NPs on all line cards report fault, then most likely fault is on route processor card
(active route processor card or standby route processor card). Please refer to link connecting
route processor card CPU to FPGA and route processor card FIA in background section.
Show controller fabric fia link-status location 0/rsp0/cpu0
Show controller fabric fia link-status location 0/rsp1/cpu0
Show controller fabric fia bridge sync-status location 0/rsp0/cpu0
Show controller fabric fia bridge sync-status location 0/rsp1/cpu0

Past Failures
In the past we have seen faults that were 99% recoverable and in most cases software initiated
recovery action fixed faults. However in very rare cases, unrecoverable errors are seen that can
only be fixed with RMA of cards.
In this section we will pull from our past experience errors seen before so that they serve as
guidance if similar errors are observed.

Transient Error due to NP oversubscription

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 13
White Paper

RP/0/RP1/CPU0:Jun  26  13:08:28.669  :  pfm_node_rp[349]:  


%PLATFORM-­‐DIAGS-­‐3-­‐PUNT_FABRIC_DATA_PATH_FAILED  :  
Set|online_diag_rsp[200823]|System  Punt/Fabric/data  Path  
Test(0x2000004)|failure  threshold  is  3,  (slot,  NP)  failed:  
 (0/10/CPU0,  0)    
RP/0/RP1/CPU0:Jun  26  13:09:28.692  :  pfm_node_rp[349]:  
%PLATFORM-­‐DIAGS-­‐3-­‐PUNT_FABRIC_DATA_PATH_FAILED  :  
Clear|online_diag_rsp[200823]|System  Punt/Fabric/data  Path  
Test(0x2000004)|failure  threshold  is  3,  (slot,  NP)  failed:  
 (0/10/CPU0,0)  
 
Hard  Fault  due  to  NP  Lock  
 
When PUNT_FABRIC_DATA_PATH_FAILED occurs, and if the failure is due to NP lock
up, then faults similar to what is listed below will appear .
   
LC/0/2/CPU0:Aug  26  12:09:15.784  CEST:  prm_server_ty[303]:  prm_inject_health_mon_pkt  :  Error  injecting  health  
packet  for  NP0    status  =  0x80001702    
LC/0/2/CPU0:Aug  26  12:09:18.798  CEST:  prm_server_ty[303]:  prm_inject_health_mon_pkt  :  Error  injecting  health  
packet  for  NP0    status  =  0x80001702    
LC/0/2/CPU0:Aug  26  12:09:21.812  CEST:  prm_server_ty[303]:prm_inject_health_mon_pkt  :  Error  injecting  health  
packet  for  NP0    status  =  0x80001702    
LC/0/2/CPU0:Aug  26  12:09:24.815  CEST:  prm_server_ty[303]:  NP-­‐DIAG  health  monitoring  failure  on  NP0      
LC/0/2/CPU0:Aug  26  12:09:24.815  CEST:  pfm_node_lc[291]:%PLATFORM-­‐NP-­‐0-­‐NP_DIAG  :  
Set|prm_server_ty[172112]|Network  Processor  Unit(0x1008000)|  NP  diagnostics  warning  on  NP0.  
LC/0/2/CPU0:Aug  26  12:09:40.492  CEST:  prm_server_ty[303]:  Starting    fast  reset  for  NP  0  LC/0/2/CPU0:Aug  26  
12:09:40.524  CEST:  prm_server_ty[303]:  Fast  Reset  NP0    -­‐  successful  auto-­‐recovery  of  NP  

Failures between RSP440 and Typhoon line cards


Cisco has fixed an issue where rarely links between RSP440 and Typhoon based line cards
across backplane get retrained. The retraining of fabric link happens because the signal strength
is not optimal. This issue is present in base releases of 4.2.1, 4.2.2, 4.2.3, 4.3.0, 4.3.1, and 4.3.2.
SMU for all these releases is posted on CCO and tracked with DDTS CSCuj10837.
When this known and fixed issue occurs on router, any of the following can occur
1. Link goes down and comes up. (transient)
2. Link is permanently down
In both the cases platform fault messages will appear on the screen. Depending on how fast link
gets retrained, one or more NPs on the connected line card will report fault. Logs of this issue
that occurred on a router at customer site is listed below. In this case all 8 NPs reported error.
However, we have seen instances when link is retrained quickly only a subset of NPs on line
card reported error.

Jan 12 03:02:00 phlasr1.router1 3418: RP/0/RSP1/CPU0:Jan 12 03:02:00.857 :


pfm_node_rp[357]: %PLATFORM-DIAGS-3-PUNT_FABRIC_DATA_PATH_FAILED :
Set|online_diag_rsp[348276]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP)failed: (0/4/CPU0, 1)
(0/4/CPU0, 2) (0/4/CPU0, 3) (0/4/CPU0, 4) (0/4/CPU0, 5) (0/4/CPU0, 6)
(0/4/CPU0, 7)

Jan 12 04:16:37 phlasr1.router1 3431: RP/0/RSP1/CPU0:Jan 12 04:16:37.725 :


pfm_node_rp[357]: %PLATFORM-DIAGS-3 PUNT_FABRIC_DATA_PATH_FAILED :
Clear|online_diag_rsp[348276]|System Punt/Fabric/data Path

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 14
White Paper

Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/4/CPU0, 1)


(0/4/CPU0, 2) (0/4/CPU0, 3) (0/4/CPU0, 4) (0/4/CPU0, 5) (0/4/CPU0, 6)
(0/4/CPU0, 7)

Ltrace To confirm Link Retraining


Using CLI mentioned below, we can be check if link got retrained.
RP/0/RSP0/CPU0:edge1.ord1#sh  controllers  fabric  ltrace  crossbar  location  0/rsp1/cPU0  |  include      
link_retrain  

Sep 20 07:38:49.772 crossbar 0/RSP1/CPU0 t1 detail xbar_fmlc_handle_link_retrain:


rcvd link_retrain for (1,1,1),(2,1,0),0.

Here (1,1,1) and (2, 1, 0) depict end points. The parameters within () should be read as (SlotId,
Asic-Type, Asic-Instance). Hence (1,1,1) implies on slot 1 (hence RSP1 in 9006 chassis), asic-
type 1 (identifies generation of Fabric Xbar Asic) and Xbar Instance 1.
(2,1,0) implies physical slot 2, (hence line card 0/0/cpu0 in 9006 chassis), xbar asic-type 1 and
xbar instance 0.
(1,1,1),(2,1,0),0. The trailing number (in this case 0) identifies link number (out of 4 fabric links
that connect to each route processor card) . This will be between 0 to 3.

Online Diagnostic Test Report


A summary of all online diagnostic tests and failure along with last time stamp when a test
passed are listed in the output of “show diagnostic result location <node> [test <test-id>
detail ]” (Test-Id for punt fabric data path failure is 10. A list of all tests along with frequency of test packets can be seen
with “show diagnostic content location <node>” )
Output of punt fabric data path test result will be similar to sample output listed below.
RP/0/RSP0/CPU0:ios(admin)#show diagnostic result location 0/rsp0/cpu0 test 10 detail
Current bootup diagnostic level for RP 0/RSP0/CPU0: minimal
Test results: (. = Pass, F = Fail, U = Untested)
___________________________________________________________________________
10 ) FabricLoopback ------------------> .
Error code ------------------> 0 (DIAG_SUCCESS)
Total run count -------------> 357
Last test execution time ----> Sat Jan 10 18:55:46 2009
First test failure time -----> n/a
Last test failure time ------> n/a
Last test pass time ---------> Sat Jan 10 18:55:46 2009
Total failure count ---------> 0
Consecutive failure count ---> 0

FAQ
1. Does the primary or standby route processor card sends the keep alives or online
diagnostic packets to every NP in the system. ?

A: Yes. Both route processor cards send online diagnostic packets to every NP.

2. Is the path the same when route processor card1 is active?

A : Diagnostic path is same whether route processor card0 or route processor card1.
Path is dependent of state of route processor card. Section “Punt Fabric Diagnostic
Pack
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 15
White Paper

et Path” has more details.

3. How often does the route processor card send keep alives and how many keep alives do
we need to miss to trigger the alarm?

A: In default condition every minute a packet is sent towards every NP. It requires three
consecutive miss to trigger fault.

4. How do we determine whether an NP is or has been over subsribed?

A: One way to check if NP is over subscribed in the past or is currently oversubscribed


is to check for certain kind of drops inside NP and for tail drops in FIA. IFDMA drops
inside NP occur when NP is oversubscribed and cannot keep up with incoming traffic.
FIA tail drops occur when egress NP asserting flow control (asking ingress line card to
send less traffic). Under flow control scenario, FIA has tail drops.

5. How do we determine whether an NP is locked up?

A: Typically NP lock up is cleared by fast reset. The reason for fast reset will clearly
state NP lock up.

6. What do we see if an NP is completely dead due to an HW failure?

A: We see both Punt Fabric Data Path failure for that NP as well as NP Loopback test
failure. (NP loopback test failure message similar to as shown in Appendix section and
punt fabric data path failure message similar to as shown in the Introduction section of
this paper will appear)

7. Please explain the meaning and numbering of (9,1,0),(5,1,0),1 in ltrace message “Sep 3
13:47:07.027 crossbar 0/7/CPU0 t1 detail xbar_fmlc_handle_link_retrain: rcvd
link_retrain for (9,1,0),(5,1,0),1. “

A: Here (9,1,0) and (5, 1, 0) depict end points. The parameters within () should be read
as (SlotId, Asic-Type, Asic-Instance). Hence (9,1,0) implies on slot 9 (hence line card
0/7/CPU0), asic-type 1 (identifies generation of Fabric Xbar Asic) and Xbar Instance 0.
(5,1,0) implies physical slot 5, (hence route processor card 0/rsp1/cpu0 in 9010 chassis),
xbar asic-type 1 and xbar instance 0.
(9,1,0),(5,1,0),1. The trailing number (in this case 1) identifies link number (out of 4
fabric links that connect typhoon line card to each route processor card) . This will be
between 0 to 3.
 
8. Just because a diag message is sourced from one route processor card does not mean that
it comes back to the same one. Yes or No?

A: Since diagnostic packets are sourced from both route processor cards and tracked on
per route processor card basis, a diagnostic packet sourced from a route processor card
will be loop backed to same route processor card by NP.

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 16
White Paper

9. The CSCuj10837 SMU and critical announcement is only for the link retrain event. How
did we determine to put out an announcement for this and not the other events? Is this
just the most common event, and if so again how did they determine that this is the
cause of 99% of these messages?

A: After considerable effort spent in root cause analysis and verification of fix, hardware
team is confident that link retraining issues will be fixed.
It is software teams analysis that all or most of the diagnostic failures seen before are
tied to link retraining. Hence BU has acknowledged the issue tracked by CSCuj10837 as
the single most prevalent cause for all the issues seen till now between RSP440 and
Typhoon based line cards. This SMU and root cause analysis does not apply for any
failures seen on Trident based line cards.

10. How long it takes to retrain serdes once the decision to do so is made?

A: Decision to retrain is made as soon as fault is detected. Fabric drivers take up to 5


seconds to detect link failure. Once failure is detected link gets retrained (in the absence
of hard fault) within 2 seconds. Hence in the worst case traffic will black hole for upto 5
seconds plus retrain time.

11. At what point is the decision to retrain made?

A: As soon as link fault is detected, the decision to retrain is made by fabric asic driver.

12. So it's only between FIA on active route processor card and fabric that we use the first
link and then after that it's the least loaded link when there are multiple links available?
A: Correct. First link connecting to first xbar instance on active route processor is used
to inject traffic into fabric. Response packet from NP can reach back to active route
processor card on any of all the links connecting back to route processor card. The
choice of link depends on link load.

13. During the retrain are all packets sent over that fabric link lost?

A: Yes. In the worst case this can be up to 7 seconds. However on average with in 3
seconds link is retrained if fault is transient. If link has fatal unrecoverable fault for any
reason, the link is admin shutdown and traffic is re-routed around the faulty link. Hence
within on average 5 seconds (3 seconds to detect and 2 seconds for retraining to fail)
traffic drop within switch fabric will stop and faulty link will be isolated and not used
for forwarding traffic.

14. How frequently do we expect to see a crossbar serdes retrain event after customers have
the fix for CSCuj10837?

A: Once customer has fix for CSCuj10837, serdes retrain on back plane link between
RSP440 and Typhoon based line card should never occur. If it does, then SR case has to
be raised and proper debugging should be done.

Data To Collect Before SR Creation

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 17
White Paper

Minimum logging to collect before any action is taken is

• Show logging
• Show pfm location all
• admin sh diagn result loc 0/rsp0/cpu0 test 8 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 8 detail
• admin sh diagn result loc 0/rsp0/cpu0 test 9 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 9 detail
• admin sh diagn result loc 0/rsp0/cpu0 test 10 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 10 detail
• admin sh diagn result loc 0/rsp0/cpu0 test 11 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 11 detail
• show controller fabric fia link-status location <lc>
• show controller fabric fia link-status location <both rsp>
• show controller fabric fia bridge sync-status location <both rsp>
• show controller fabric crossbar link-status instance 0 location <lc>
• show controller fabric crossbar link-status instance 0 location <both rsp>
• show controller fabric crossbar link-status instance 1 location <both rsp>
• show controller fabric ltrace crossbar <both rsp>
• show controller fabric ltrace crossbar <affected lc>
• show tech fabric location <fault showing lc> file <path to file>
• show tech fabric location <both rsp> file <path to file>

Some Useful Diagnostic Commands


show diagnostic ondemand settings
show diagnostic content location < loc >
show diagnostic result location < loc > [ test {id|id_list|all} ] [ detail ]
show diagnostic status
admin#diagnostic start location < loc > test {id|id_list|test-suite}
admin#diagnostic stop location < loc >
admin#diagnostic ondemand iterations < iteration-count >
admin#diagnostic ondemand action-on-failure {continue failure-count|stop}
admin-config#[ no ] diagnostic monitor location < loc > test {id | test-name} [disable]
admin-config# [ no ] diagnostic monitor interval location < loc > test {id | test-name} day
hour:minute:second.millisec
admin-config# [ no ] diagnostic monitor threshold location < loc > test {id | test-name} failure count

Conclusion
As of 4.3.4 release time frame all issues related to punt fabric data path failure are addressed.
The only SMU for this issue for releases prior to 4.3.4 can be obtained from CCO using
CSCuj10837.
The platform team has put in state of the art fault handling so that router recovers in sub second
if and when any data path recoverable failure occur. We recommend to use this document to
understand problem even if no such fault has been observed on your system.

Other References

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 18
White Paper

https://supportforums.cisco.com/docs/DOC32083#What_to_collect_if_there_is_still_an
_issue

Appendix
NP LoopBack Diagnostic Path
Diagnostic application executing on line card CPU will keep tab on the health of NP by
periodically monitoring working status of NP. A packet is injected from line card CPU destined
to local NP which the NP should loopback and return to line card CPU. Any loss in such
periodic packets is flagged using platform message. An example of such message is below
“LC/0/7/CPU0:Aug 18 19:17:26.924 : pfm_node[182]: %PLATFORM-PFM_DIAGS-2-
LC_NP_LOOPBACK_FAILED : Set|online_diag_lc[94283]|Line card NP loopback Test(0x2000006)|link failure
mask is 0x8”
This means this test failed to get loopback packet from NP3: "link failure mask is 0x8" i.e. bit 3
set==>NP3.
Output of CLIs below can help to get more details.
• admin show diagnostic result location 0/x/cpu0 test 9 detail
• show controllers NP counter NP(0-3) location 0/x/cpu0

NP0$

Trident$Line$card$
B0
NP1$

LC$CPU$ LC$$ FIA0%


PCIe$SWITCH$

NP2$
B1

NP3$

A9K.4T$

Fabric Debug Commands


Commands listed in this section apply to all trident based line cards as well as typhoon based
100Ge line card. Since “bridge” FPGA is missing on typhoon based line cards (except 100Ge
typhoon based line cards) all CLI under “sh controller fabric fia bridge …” do not apply.

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 19
White Paper

This pictorial representation will help to map cli to location of data path. This show help in
isolating drop or fault point.

Author
Mahesh S
maheshk@cisco.com
October/2013
 

Printed in USA C11-539588-01 04/10

All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 20

You might also like