ASR9K Punt Fabric Data Path Failure: Mahesh S
ASR9K Punt Fabric Data Path Failure: Mahesh S
ASR9K Punt Fabric Data Path Failure: Mahesh S
Introduction
The purpose of this white paper is to provide understanding and clarity on punt fabric data path
failure message when seen during ASR9K router operation. The message when appears will be
in the format “RP/0/RSP0/CPU0:Sep 3 13:49:36.595 UTC: pfm_node_rp[358]: %PLATFORM-DIAGS-3-
PUNT_FABRIC_DATA_PATH_FAILED: Set|online_diag_rsp[241782]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/7/CPU0, 1) (0/7/CPU0, 2) (0/7/CPU0, 3) (0/7/CPU0,
4) (0/7/CPU0, 5) (0/7/CPU0, 6) (0/7/CPU0, 7) “ This document should help customers, TME, TAC,
DE/DT community and anyone who is interested in understanding the error message and
whether any action needs to be taken or not .
Abstract
A high level understanding of ASR9K line card, fabric cards, route processor cards and chassis
configurations is helpful. The document however doesn’t require readers to be familiar with
hardware details. Necessary background information is provided before this papers attempts to
explain error message.
• During spare time and when no urgency to root cause Punt Fabric Data Path failure it
will help to read all sections of the document. From the start of the document until
section “Analyzing Faults”, this paper builds necessary background to isolate faulty
component when such an error occurs.
• For Further Reading to build more knowledge on both NP and punt fabric faults use
Cisco provided documentation (if any).
• If there is a specific question in mind for which a quick answer is needed, please use
“FAQ” section. If question doesn’t show up in the section, then please if check this
white paper answer the question.
• If we have router where fault has occurred, and we are in process of debugging to
narrow down faulty component or to check if it’s a know issue, all sections starting from
“Analyzing Faults” may help.
Background
Packet through switch fabric can traverse either two hops or three hops depending upon line
card type. Typhoon generation line cards add an extra switch fabric element while Trident based
line cards switch all traffic using fabric on router processor card only. The picture below shows
fabric elements for both these line card types.
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 1
White Paper
Trident$Line$card$
CPU$
NP0$
NP1$
Trident$Line$card$
B0 CPU$
FIA0% $ NP0$ $
NP2$ $ $
B1
RSP$$3$Switch$ NP1$ RSP$$3$Switch$
NP3$ Fabric$ B0 Fabric$
A9K84T$ $ FIA0% $
$ NP2$ $
B1
$ $
$ NP3$ $
CPU$
$
Switch$
$
$
A9K84T$ $
Switch$
NP0$ $ $
Fabric$ $ Fabric$
$ $
$
Trident$Line$card$
$ RSP0% $ RSP0%
NP1$ $ $ CPU$ $
B0 $ NP0%
FIA0% $
$ FIA0%
$
NP2$ $
$ NP1%
$
B1
Switch%Fabric%ASIC%
$ $
$ $ $
Typhoon$Line$card$
NP3$ $ $ NP2% $
$
$
$
FIA1% $
$ Switch$ NP3% $ Switch$
NP4$ $ Fabric$ $ $ Fabric$
$ $ NP4% $
RSP1% RSP1%
$
NP5$ $
$
FIA2% $
B2 NP5%
FIA1% $
NP6$ $
NP6%
B3 $
NP7$
$ FIA3%
A9K88T$ $ NP7%
$ A9K824x10G$
Packet Path between Active Route Processor Card and Line card:
Diagnostic packets injected from active route processor card into fabric towards NP are treated
as unicast packets by switch fabric. Since for unicast packets, switch fabric chooses out going
link based on current traffic load of the link, this helps to subject diagnostic packets to traffic
load on the box. (When there as multiple outgoing links towards NP, switch fabric ASIC choose a link that is
currently least loaded)
The diagram below depicts diagnostic packet path sourced from active route processor card.
Note that first link connecting FIA on line card to XBAR on route processor card is chosen all
the time for packets destined to NP. Response packets from NP are subjected to link load
distribution algorithm (if line card is typhoon based). This means that response packet from NP
towards active route processor card can choose any of the back plane links connecting line cards
to route processor card depending on the back plane link load.
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 2
White Paper
RSP0'
LC$CPU$
NP0$ Ac0ve''
F RSP'FIA'
P
G
NP1$ A $
$
Absent Fabric' RSP$$3$Switch$
On all
Typhoon
ASIC' Fabric$ FPGA'
:$ Line card (Mul0ple' $ Switch$
Fabric$
:$ Except
2x100Ge FIA''
$
$
1x100Ge $
or'' $
FGPA
NP6$ Single'' $
Present Ac0ve''
On all
Trident
XBAR' $
$ RSP'CPU'
Based $
Line card
$
NP7$ $
$
$
$
$
$ Switch$
$ Fabric$
$
From$RSP$to$NPU$diagnos>c$applica>on$injected$packet$ RSP1'
$
NPU$to$RSP$response$packet$looped$back$inside$NPU$$$
Packet Path between Standby Route Processor Card and Line card:
RSP0'
LC$CPU$
NP0$ Standby''
F RSP'FIA'
P
G
NP1$ A $
$
Absent Fabric' RSP$$3$Switch$
On all
Typhoon
ASIC' Fabric$ FPGA'
:$ Line card (Mul0ple' $ Switch$
Fabric$
:$ Except
2x100Ge FIA''
$
$
1x100Ge $
or'' $
FGPA
NP6$ Single'' $
Present Standby''
On all
Trident
XBAR' $
$ RSP'CPU'
Based $
Line card
$
NP7$ $
$
$
$
$
$ Switch$
$ Fabric$
$
From$RSP$to$NPU$diagnos>c$applica>on$injected$packet$ RSP1'
$
NPU$to$RSP$response$packet$looped$back$inside$NPU$$$
Diagnostic packets injected from standby route processor card into fabric towards NP are
treated as multicast packets by switch fabric. Although multicast packet, there is no replication
inside fabric. Every diagnostic packet sourced from standby route processor card still reaches
only one NP at a time. Response packet from a NP towards route processor card is also
multicast packet over fabric with no replication. Hence diagnostic application on standby route
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 3
White Paper
processor card receives single response packet from NPs one packet at a time. Thus diagnostic
application keeps tab on every NP in the system by injecting one packet per NP and expecting
response from every NP one packet at a time. Since for multicast packet switch fabric chooses
out going link based on a field value in packet header, this helps to inject diagnostic packets
over every fabric link between route processor card and line card back plane links. Standby
route processor keeps tab on a NP health over every connecting fabric link between route
processor card and line card slot.
The diagram above depicts diagnostic packet path sourced from standby route processor card.
Note that unlike active route processor card case, all links connecting line card to XBAR on
route processor card is exercised. Response packets from NP take same back plane link that
was used by the packet in route processor card to line card direction. This testing ensures that all
links connecting standby route processor card to line card are monitored all the time.
CPU$
NP0$ $
$
NP1$ RSP$$3$Switch$
B0 Fabric$
FIA0% $
NP2$ $
B1
$
NP3$ $
RSP%FIA%
$
$
A9K84T$ $
Switch$
$
$ Fabric$
$
$ $ RSP0%
$ CPU$ $
$ NP0%
$
$ FIA0% $
$ NP1% FPGA%
Switch%Fabric%ASIC%
$
$ $
Typhoon$Line$card$
$ NP2% $
$
$
FIA1% $
NP3% $ Switch$
$ $ Fabric$
$ NP4% $ RSP1%
$
$
FIA2% $ RSP%CPU%
NP5%
$
$
NP6%
$
$ FIA3%
$ NP7%
$ A9K824x10G$
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 4
White Paper
Trident$Line$card$
CPU$
NP0$ $
$
NP1$ RSP$$3$Switch$
B0 Fabric$
FIA0% $
NP2$ $
B1
$
NP3$ $
RSP%FIA%
$
$
A9K84T$ $
Switch$
$
$ Fabric$
$
$ $ RSP0%
$ CPU$ $
$ NP0%
$
$ FIA0% $
$ NP1% FPGA%
Switch%Fabric%ASIC%
$
$ $
Typhoon$Line$card$
$ NP2% $
$
$
FIA1% $
NP3% $ Switch$
$ $ Fabric$
$ NP4% $ RSP1%
$
$
FIA2% $ RSP%CPU%
NP5%
$
$
NP6%
$
$ FIA3%
$ NP7%
$ A9K824x10G$
In the next few sections, the document attempts to depict packet path to every NP. This will be
necessary to under stand punt fabric data path error message and also to locate failure point.
The message should be read as failure to reach NP 1, 2, 3, 4, 5, 6, 7 on line card 0/7/cpu0 from
route processor card 0/rsp0/cpu0.
From the list of online diagnostic tests, we can see attributes of punt fabric loopback test using
RP 0/RSP1/CPU0:
Diagnostics test suite attributes:
M/C/* - Minimal bootup level test / Complete bootup level test / NA
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 5
White Paper
Here we see that PuntFabricDataPath frequency of packet is every minute and failure threshold
is 3 implying loss of up to 3 packets is tolerated. Three consecutive packet losses result in an
alarm. The test attributes shown are default values. Using admin config commands these
attributes can be changed if desired.
RSP0%
LC$CPU$
NP0$
B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 6
White Paper
The diagram below depicts packet path between route processor card CPU and Line card NP0.
Here we see that link connecting Bridge 0 (B0) and NP0 is the only link specific to NP0. All
other links fall in common path.
Also make note of packet path from route processor card towards NP0. Although there exists
four links to use for packet destined towards NP0 from route processor card, the first link
between route processor card and line card slot is used for packet in from route processor card
to line card direction. While returned packet from NP0 can sent back to active route processor
card over any of the two fabric link paths between line card slot to active route processor card.
The choice of which one of the two links will be used depends on link load at that time.
Response packet from NP0 towards standby route processor card uses all two links but one link
at a time. The choice of link is based on header field that diagnostic application populates.
Mutiple NP faults
If apart from PUNT_FABRIC_DATA_PATH_FAILED fault, if any other faults are observed
on NP0 or the fault PUNT_FABRIC_DATA_PATH_FAILED is also reported by other NPs on
the same line card, then fault isolation can be done by correlating all the faults. For example on
NP0 if both PUNT_FABRIC_DATA_PATH_FAILED fault and
LC_NP_LOOPBACK_FAILED [Please refer to appendix section to understand loopback fault ]
fault occur, then NP has stopped processing packets. This could be an early indication of critical
failure inside NP0. However if only one of the either faults occurred, then fault is localized to
either punt fabric data path or on line card CPU to NP path.
If more than one NP on a line card has punt fabric data path fault, then we need to walk up the
tree path of fabric links to narrow down faulty component. Example if both NP0 and NP1 has
fault, then fault has to be in B0 or link connecting B0 and FIA0. It’s less likely that both NP0
and NP1 run into critical internal error at the same time. Albeit less likely, nothing precludes
from not having both NP0 and NP1 run into critical error fault due to a incorrect processing of a
particular kind of packet or bad packet.
check all common links and components on the data path between NP/NPs and both route
processor cards.
RSP0%
LC$CPU$
NP0$
B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%
Refer to section A.1.1 but apply the same reasoning for NP1 (instead of NP0)
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 8
White Paper
choice of which one of the two links will be used depends on link load at that time. Response
packet from NP2 towards standby route processor card uses all two links but one link at a time.
The choice of link is based on header field that diagnostic application populates..
RSP0%
LC$CPU$
NP0$
B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%
Refer to section A.1.1 but apply the same reasoning for NP2 (instead of NP0)
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 9
White Paper
Trident$Line$card$
RSP0%
LC$CPU$
NP0$
B0
NP1$
$
$
RSP$$3$Switch$ RSP%FIA%
FIA0% Fabric$
$ Switch$
$ Fabric$
$
$
$
NP2$ $ FPGA%
B1 $
$
$
$
NP3$ $
$
$
RSP%CPU%
A9K94T$ $
$
$ Switch$
$ Fabric$
$
$ RSP1%
Refer to section A.1.1 but apply the same reasoning for NP3 (instead of NP0)
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 10
White Paper
RSP0$
!
!
! RSP!!3!Switch! RSP$FIA$
! Fabric!
! ! Switch!
! ! Fabric!
! LC!CPU! !
! NP0$ !
! FIA0$ !
! NP1$ ! FPGA$
Switch$Fabric$ASIC$
! !
Typhoon!Line!card!
! NP2$ !
! !
!
FIA1$
NP3$ !
! !
! NP4$ !
! !
!
FIA2$ !
RSP$CPU$
NP5$
! !
! ! Switch!
NP6$
! ! Fabric!
! FIA3$ !
! NP7$ ! RSP1$
! A9K524x10G!
RSP0$
!
!
! RSP!!3!Switch! RSP$FIA$
! Fabric!
! ! Switch!
! ! Fabric!
! LC!CPU! !
! NP0$ !
! FIA0$ !
! NP1$ ! FPGA$
Switch$Fabric$ASIC$
! !
Typhoon!Line!card!
! NP2$ !
! !
!
FIA1$
NP3$ !
! !
! NP4$ !
! !
!
FIA2$ !
RSP$CPU$
NP5$
! !
! ! Switch!
NP6$
! ! Fabric!
! FIA3$ !
! NP7$ ! RSP1$
! A9K524x10G!
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 11
White Paper
Analyzing Faults
This section will categorize faults into hard and transient cases and lists out steps to identify a
fault is hard or transient fault. Once fault type is detected, the document also specifies CLIs that
can be executed on router to understand fault and what actions can be applied.
Transient Fault: If a Set PFM message is followed by clear PFM message, then fault had
occurred and router has corrected the fault itself. Transient faults can occur to due to
environmental conditions, recoverable faults in hardware components, and some times it can be
hard to associate transient faults to any particular event. The suggested approach for transient
errors is to only monitor for further occurrence of such errors. If transient fault occurs more than
once, then treat transient fault as hard faults and use recommendations and setups to analyze
such fault described in the next section.
If faults occur and clear to an NP at the rate of at least one per minute and at least ten faults are
observed towards a given NP within a span of ten minutes, even if faults are cleared at the end
of barrage of fault and corresponding recovery of fault, we can treat such flurry of failures as
hard fault.
Hard Fault: If a Set PFM message is not followed by clear PFM message, then fault had
occurred and router has not corrected the fault itself by fault handling code or the nature of
hardware fault is not recoverable. Hard faults can occur to due to environmental conditions,
unrecoverable faults in hardware components. The suggested approach for hard errors is to use
guidelines mentioned in section “Steps to Analyze Hard Faults”.
An example of hard fabric fault is listed below for clarity. For the example message below there
will not be a corresponding clear pfm message.
RP/0/RSP0/CPU0:Feb 5 05:05:44.051 : pfm_node_rp[354]:%PLATFORM-DIAGS-3-
PUNT_FABRIC_DATA_PATH_FAILED : Set|online_diag_rsp[237686]|System Punt/Fabric/data Path
Test(0x2000004)|failure threshold is 3, (slot, NP) failed: (0/2/CPU0, 0)
Recommendation: Under hard fault scenario, collect out of all the CLIs mentioned in section
“Data To Collect Before SR Creation” and raise an SR. In urgent cases, after collecting all
debug output, initiate route processor card or line card reload (depending on fault isolation).
After reload, if error is not recovered, initiate RMA.
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 12
White Paper
CLIs to Use
1: show logging | inc “PUNT_FABRIC_DATA_PATH”
2: show pfm location all ; Check for status type is SET or CLEAR
CLIs to Use
1: show logging | inc “PUNT_FABRIC_DATA_PATH”
The output might contain one or more different NPs (Example: NP2, NP3)
2: show controller fabric fia link-status location <lc>
Since both NP2 and NP3 (in section B.2) receive and send through single FIA, its
reasonable to infer that fault is in associated FIA on the path.
3: show controller fabric crossbar link-status instance <0 and 1> location <LC or RSP>
If all NPs on line card are not reachable for diagnostic application, then its reasonable to
infer that links connecting line card slot to route processor card might have fault of any of the
ASIC that forwards traffic between route processor card and line card has fault.
Show controller fabric cross-bar link-status instance 0 location <lc>
Show controller fabric cross-bar link-status instance 0 location 0/rsp0/cpu0
Show controller fabric cross-bar link-status instance 1 location 0/rsp0/cpu0
Show controller fabric cross-bar link-status instance 0 location 0/rsp1/cpu0
Show controller fabric cross-bar link-status instance 1 location 0/rsp1/cpu0
4. show controller fabric fia link-status location 0/rsp*/cpu0
show controller fabric fia bridge sync-status location 0/rsp*/cpu0
If all NPs on all line cards report fault, then most likely fault is on route processor card
(active route processor card or standby route processor card). Please refer to link connecting
route processor card CPU to FPGA and route processor card FIA in background section.
Show controller fabric fia link-status location 0/rsp0/cpu0
Show controller fabric fia link-status location 0/rsp1/cpu0
Show controller fabric fia bridge sync-status location 0/rsp0/cpu0
Show controller fabric fia bridge sync-status location 0/rsp1/cpu0
Past Failures
In the past we have seen faults that were 99% recoverable and in most cases software initiated
recovery action fixed faults. However in very rare cases, unrecoverable errors are seen that can
only be fixed with RMA of cards.
In this section we will pull from our past experience errors seen before so that they serve as
guidance if similar errors are observed.
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 13
White Paper
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 14
White Paper
Here (1,1,1) and (2, 1, 0) depict end points. The parameters within () should be read as (SlotId,
Asic-Type, Asic-Instance). Hence (1,1,1) implies on slot 1 (hence RSP1 in 9006 chassis), asic-
type 1 (identifies generation of Fabric Xbar Asic) and Xbar Instance 1.
(2,1,0) implies physical slot 2, (hence line card 0/0/cpu0 in 9006 chassis), xbar asic-type 1 and
xbar instance 0.
(1,1,1),(2,1,0),0. The trailing number (in this case 0) identifies link number (out of 4 fabric links
that connect to each route processor card) . This will be between 0 to 3.
FAQ
1. Does the primary or standby route processor card sends the keep alives or online
diagnostic packets to every NP in the system. ?
A: Yes. Both route processor cards send online diagnostic packets to every NP.
A : Diagnostic path is same whether route processor card0 or route processor card1.
Path is dependent of state of route processor card. Section “Punt Fabric Diagnostic
Pack
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 15
White Paper
3. How often does the route processor card send keep alives and how many keep alives do
we need to miss to trigger the alarm?
A: In default condition every minute a packet is sent towards every NP. It requires three
consecutive miss to trigger fault.
A: Typically NP lock up is cleared by fast reset. The reason for fast reset will clearly
state NP lock up.
A: We see both Punt Fabric Data Path failure for that NP as well as NP Loopback test
failure. (NP loopback test failure message similar to as shown in Appendix section and
punt fabric data path failure message similar to as shown in the Introduction section of
this paper will appear)
7. Please explain the meaning and numbering of (9,1,0),(5,1,0),1 in ltrace message “Sep 3
13:47:07.027 crossbar 0/7/CPU0 t1 detail xbar_fmlc_handle_link_retrain: rcvd
link_retrain for (9,1,0),(5,1,0),1. “
A: Here (9,1,0) and (5, 1, 0) depict end points. The parameters within () should be read
as (SlotId, Asic-Type, Asic-Instance). Hence (9,1,0) implies on slot 9 (hence line card
0/7/CPU0), asic-type 1 (identifies generation of Fabric Xbar Asic) and Xbar Instance 0.
(5,1,0) implies physical slot 5, (hence route processor card 0/rsp1/cpu0 in 9010 chassis),
xbar asic-type 1 and xbar instance 0.
(9,1,0),(5,1,0),1. The trailing number (in this case 1) identifies link number (out of 4
fabric links that connect typhoon line card to each route processor card) . This will be
between 0 to 3.
8. Just because a diag message is sourced from one route processor card does not mean that
it comes back to the same one. Yes or No?
A: Since diagnostic packets are sourced from both route processor cards and tracked on
per route processor card basis, a diagnostic packet sourced from a route processor card
will be loop backed to same route processor card by NP.
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 16
White Paper
9. The CSCuj10837 SMU and critical announcement is only for the link retrain event. How
did we determine to put out an announcement for this and not the other events? Is this
just the most common event, and if so again how did they determine that this is the
cause of 99% of these messages?
A: After considerable effort spent in root cause analysis and verification of fix, hardware
team is confident that link retraining issues will be fixed.
It is software teams analysis that all or most of the diagnostic failures seen before are
tied to link retraining. Hence BU has acknowledged the issue tracked by CSCuj10837 as
the single most prevalent cause for all the issues seen till now between RSP440 and
Typhoon based line cards. This SMU and root cause analysis does not apply for any
failures seen on Trident based line cards.
10. How long it takes to retrain serdes once the decision to do so is made?
A: As soon as link fault is detected, the decision to retrain is made by fabric asic driver.
12. So it's only between FIA on active route processor card and fabric that we use the first
link and then after that it's the least loaded link when there are multiple links available?
A: Correct. First link connecting to first xbar instance on active route processor is used
to inject traffic into fabric. Response packet from NP can reach back to active route
processor card on any of all the links connecting back to route processor card. The
choice of link depends on link load.
13. During the retrain are all packets sent over that fabric link lost?
A: Yes. In the worst case this can be up to 7 seconds. However on average with in 3
seconds link is retrained if fault is transient. If link has fatal unrecoverable fault for any
reason, the link is admin shutdown and traffic is re-routed around the faulty link. Hence
within on average 5 seconds (3 seconds to detect and 2 seconds for retraining to fail)
traffic drop within switch fabric will stop and faulty link will be isolated and not used
for forwarding traffic.
14. How frequently do we expect to see a crossbar serdes retrain event after customers have
the fix for CSCuj10837?
A: Once customer has fix for CSCuj10837, serdes retrain on back plane link between
RSP440 and Typhoon based line card should never occur. If it does, then SR case has to
be raised and proper debugging should be done.
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 17
White Paper
• Show logging
• Show pfm location all
• admin sh diagn result loc 0/rsp0/cpu0 test 8 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 8 detail
• admin sh diagn result loc 0/rsp0/cpu0 test 9 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 9 detail
• admin sh diagn result loc 0/rsp0/cpu0 test 10 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 10 detail
• admin sh diagn result loc 0/rsp0/cpu0 test 11 detail
• admin sh diagn result loc 0/rsp1/cpu0 test 11 detail
• show controller fabric fia link-status location <lc>
• show controller fabric fia link-status location <both rsp>
• show controller fabric fia bridge sync-status location <both rsp>
• show controller fabric crossbar link-status instance 0 location <lc>
• show controller fabric crossbar link-status instance 0 location <both rsp>
• show controller fabric crossbar link-status instance 1 location <both rsp>
• show controller fabric ltrace crossbar <both rsp>
• show controller fabric ltrace crossbar <affected lc>
• show tech fabric location <fault showing lc> file <path to file>
• show tech fabric location <both rsp> file <path to file>
Conclusion
As of 4.3.4 release time frame all issues related to punt fabric data path failure are addressed.
The only SMU for this issue for releases prior to 4.3.4 can be obtained from CCO using
CSCuj10837.
The platform team has put in state of the art fault handling so that router recovers in sub second
if and when any data path recoverable failure occur. We recommend to use this document to
understand problem even if no such fault has been observed on your system.
Other References
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 18
White Paper
https://supportforums.cisco.com/docs/DOC32083#What_to_collect_if_there_is_still_an
_issue
Appendix
NP LoopBack Diagnostic Path
Diagnostic application executing on line card CPU will keep tab on the health of NP by
periodically monitoring working status of NP. A packet is injected from line card CPU destined
to local NP which the NP should loopback and return to line card CPU. Any loss in such
periodic packets is flagged using platform message. An example of such message is below
“LC/0/7/CPU0:Aug 18 19:17:26.924 : pfm_node[182]: %PLATFORM-PFM_DIAGS-2-
LC_NP_LOOPBACK_FAILED : Set|online_diag_lc[94283]|Line card NP loopback Test(0x2000006)|link failure
mask is 0x8”
This means this test failed to get loopback packet from NP3: "link failure mask is 0x8" i.e. bit 3
set==>NP3.
Output of CLIs below can help to get more details.
• admin show diagnostic result location 0/x/cpu0 test 9 detail
• show controllers NP counter NP(0-3) location 0/x/cpu0
NP0$
Trident$Line$card$
B0
NP1$
NP2$
B1
NP3$
A9K.4T$
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 19
White Paper
This pictorial representation will help to map cli to location of data path. This show help in
isolating drop or fault point.
Author
Mahesh S
maheshk@cisco.com
October/2013
All contents are Copyright © 2006–2013 Cisco Systems, Inc. All rights reserved. This document is Cisco Confidential Information. 20