Its interesting, I have seen something similar but MUCH more frequently than 1 in 10k packets... It was a misbehaving linecard in a 7609-S router (sup720, linecard was a 6748 i believe... chassis also had a SIP-400 + spa-5x1GE-v2 and SSG-400 + spa-ipsec-2g) Cross-linecard port-channels, distributed forwarding, etc all played a part. we rebooted the router and the issue never came back though. Tons of duplicate acks flying across the network, started causing all kinds of havoc (supposedly it was duplicating OTHER packets as well but the ACKs and RSTs and whatnot really hosed things up)
|reply to falcon |
Just curious if you had any pointers on how you found that misbehaving
7600 / SUP card deepblackmag.
One client I support I swear is going sniffer-happy on every slowness
issue they encounter across their network and our group's always the
one called to read the sniffer traces. DUP ACKs are very nearly always
encountered in the traces but how do you pin it down to TCP performing its
job versus flaky hardware? To be sure we do check the hardware status
along the visible hops in the network path, but I've personally found the
diag info in CatOS / IOS limited beyond telling you "OKAY / NOT OKAY", and
this is the type of client that takes the meaning of 5 Nines VERY seriously.
Unfortunately thats a tricky question and is very hardware dependant. You either need the vendor or in-house expertise to analize equipment at a deeper level than "IOS claims to be functioning correctly".
There are an incredible number of bugs found, corrected, and still present in the hardware driven platforms. Some are so difficult to recreate that even the vendor with packet captures has hit and miss results in a lab.
That said, in the particular example in question:
1) We isolated the issue to a given POP. Having truly redundant locations can be a great thing. We cut production traffic out of the broken pop, with minimal traffic flowing through the broken environment but the issue still obvious, we moved to 2.
2) We started looking at one device pair at a time. Every piece of the infrastructure inside a pop is also redundant. an A device active and B device standby. Some places call these odd and even or 1 and 2, etc. We would cut traffic on one pair at a time from the A chassis to the B chassis. Once we identified it was a single chassis causing the issues the real fun begins.
3) Isolate the interfaces that are misbehaving. We knew the traffic path causing the issue (entering via the SPA-5x1GE-v2 in a SIP-400, through the SUP720, to a pair of 6748s out a cross-linecard port-channel to the next pair's A-side chassis)
We began simple troubleshooting of interfaces, looking for errors, etc. There were no issues reported by the supervisors IOS. We went lower: shutting down interfaces in the port-channel to force traffic out exactly ONE physical interface at a time until we identified the problem point.
A word about our port-channels: They are always split across linecards, and across multiple ASIC groups on a given linecard. That is all done to maximize bandwidth on oversubscribed linecards and ensure that asic or linecard failures will have a minimal impact on the traffic.
Once we found that the issue was limited to a particular linecard / asic / interface group, we used some internal trickery. You can execute show commands on linecards themselves (on their processors, on their running IOS or NXOS image). These commands are very hardware specific. We found nothing wrong but the card was deffinetly duplicating packets.
At this point you need vendor involvement. Generally we are in the position of being able to beat up our SE's boss's boss into just replacing hardware without question. Its not worth the time to dig into WHY somethings borked up, we just want a new one. If that doesnt fix it, we can go deeper.
We have sent off multiple cards to Cisco (et al.) for failure analysis. 6700s, nexus linecards, etc. There was usually either an issue with the DFC or the ASIC. In the Nexus linecard case we had a borked Metropolis board (board has 2 asics, each asic controls 4 10GbE ports).
Sometimes simply rebooting a piece of equipment will temporarily releave an issue, and it might never come back. Other times it makes repeat appearances (SUP720 bug on 7600s with WCCP ACL based redirection is still popping up.)
In the end, if you have a redundant architecture, you can get the 5 9's but there are gotchas.
Failure detection needs to be "out of band" meaning you need something sitting out on the internet like a client attempting to connect in (like an ACE and GSS setup). If there is a problem with the path, you can use the ACE and GSS together to track and signal with other protocols like DNS. In all the cases I mentioned the in-band fault detection in the IOS failed. Routing did not flap or shift, interface status stayed up eventhough traffic was blackholed, and processors spiked so the control plane was unusable but the data plane was still passing packets.
IOS/NXOS/IOS-XE&XR lie. they are software and can tell you something in english based on what it is told to interpret. It isnt smart enough to know a user cant get to the ABCompany.com website and reroute accordingly. Nor should it. Layer 7 reachability is the job of higher layers. Reliance on the network itself wont get you 5x9's.
Back to your original question though: we saw hundreds and thousands of dupe packets, not just ACKs. we couldnt keep a telnet session open when traffic was going through that POS router, but BGP and OSPF to its SUP were functioning just fine, it was only transit traffic screwing up.
Sniffers are powerful tools but too often i think people misread their information, or corelate it poorly.
"the network is slow" isnt a valid complaint. "Some application is performing too slow" is much more accurate. The biggest performance problems i run into are usually something to do with the host, server, storage, or application (transactional app that worked fine on a LAN with 1ms, doesnt work over a wan with 40ms latency becuase it chats back and forth 10,000 times between client and server before displaying the splash screen)