 | How to take down a network Backstory (I was a sys admin 2 yrs at my current job I have 8yrs of IT experience previous to coming here. I was promoted to network admin, the former network admin is the IT Director now. 2 network support techs are on the networking team to assist the network admin position)
Today was a f#$%@#^ nightmare! Started first thing in the morning around 9, our phones lost connection then power (the phones are POE) which was followed by a few people asking is the network down or is the internet down? So I grabbed my tech bag run to the MDF, once I got there I do a quick visual check to see if the leds on network gear is flashing which they are so it doesnt look like anything is locked up or frozen so I then start to console into the gear 1by1 to check logs, test connectivity, run some other tools analyze traffic etc. After doing all those things I proceed to restart those devices 1by1 (had to do this several times) to see if the conditions change at all, after everything comes back up I notice that traffic flows normally for a short amount of time then stops working again. I changed over to hp switches a few months ago which replaced un-managed dell junk, anyway I configured the hp switches in a stacked configuration so I logged to the commander which is located in another building on our campus to see 100% port utilization then the connectivity issues come back. So I ask the network support techs which are standing around me while Im doing all this troubleshooting (Ill come back to this because this is a hot button for me) I ask them has anyone made any kind of changes this morning? I did said one of the techs
.so I ask what did you do? He says well one of the users called the helpdesk and said they had a computer in their office that couldn't logon to the domain so I went over to check it out and I noticed that the pocket switch that the computer was plugged into was off so I turned it on a little while ago. I tell him go unplug it right now! And return here afterwards! I manage to get back into the commander to monitor the port utilization and as soon as he unplugs it everything goes back normal. He gets back and I ask him to take me where this pocket switch is located, we get there and I check out the cables to see what goes where then I ask him do you know what a network loop is? He says sort of
. is that what happen
I walk away at that point because Ive had enough. I seek out the IT Director and bring him up to speed on what occurred then I tell him I want change control or the rest of these guys cant touch any components of the network anymore! He goes I think we should have an internal meeting to discuss what happened and how it was fixed I respond thats crazy
why should we do that? If they dont know anything about network loops they should be computer technicians. He then says well if you feel we really need change control then implement it.
After all that (Network being down for a few hours) I take some time to think
its so much to do here server upgrades, data migrations, an ASA that hasnt seen a software upgrade in 7yrs (security risk) and its all on my shoulders. Last year 10 major infrastructure projects all done by myself and another 12 I have planned for this year but Im starting to question myself, if I should be here? Should I be working as hard as I do? Having a good team is priceless and these guys dont offer to contribute to any projects or any troubleshoot anything they just stand around and stare at me.
What would you guys do?
PS, yes I know my grammar probably sucks. So no need to point it out. |
|
|
|
 Reviews:
·Comcast
| Couple things.
1. Having myself been the troubleshooter on the end of a switch loop, the problem should have been immediately obvious (to you, I must emphasize) based on the lights on the switches. Depending on how many ports you're working with, start unplugging them one by one. Eventually, you'll unplug the problem port and all of the lights will settle down.
2. You have stacked switches but no STP?
3. You want techs that can't physically touch the network? It almost sounds like you don't want them to do anything, so that's why they aren't.
4. God help you when somebody starts backfeeding DHCP into your system. There's no way you'll be able to figure that one out. |
|
 Wily_OnePremium join:2002-11-24 San Jose, CA | reply to Some1 SaveMe Yeah a layer 2 loop is it what is sounded like, but I'll be honest I've never heard the term "pocket switch" before.
I work for a big company, so we absolutely require change control and furthermore don't allow "pocket switches" on the network to begin with. 
As for your predicament, it sounds like you're overworked and have no support. Welcome to IT. Kidding aside, you can only do so much yourself. Why is everything on your shoulders if you have a team? There's a clear lack of leadership there. I say you have two choices: step up to fill the void, or move on. |
|
 DarkLogixTexan and ProudPremium join:2008-10-23 Baytown, TX kudos:3 | reply to Some1 SaveMe Ya switching loops are crap
Where I work one dept was given permission to use a AP (because the powers that be haven't gotten around to adding any AP's yet.) well it had worked fine but suddenly one day it caused a intermitant l2 loop.
so the end user vlan was maxed but the otehrs were ok, so it seemed like a issue with the router, I know cisco but won't though the juniper crap we have.
well the end result was some settign on the AP got changed and caused the whole mess, still no idea what setting on an ap could make it a l2 loop with only one link to the switch.
Later I configured a diff AP and it hasn't given any problems since, still waiting on the cisco AP's we were told were coming down the line.
BTW in my case pings were going through with crazy high responce times on the client vlan, and timing out for any intervlan traffic. And server to server was normal, but anything to the router was crap.
Seems the powers that be don't want STP (or rather don't know WTF it is. (we have 3x juniper 4200ex switches but a POS juniper firewall SRX is dooing the intervlan routing.) -- »Death Star Petition |
|
 | reply to Some1 SaveMe Immediate takeaway from this is as follows :
a) change control. I've had it drilled into my head the last 5.5 years to do it everytime. I can't count the number of times it has saved by backside, while all the other yahoos still run in cowboy mode.
b) STP -- get it implemented ASAP! Also look into securing your layer 2. VLAN attacks, DHCP snooping, no trunk autonegotiation, BPDU / ROOT guard, etc. I'm trying to visualize your network; it sounds like a collapsed core, but I'm dreading that you've also collapsed your access layer into the core as well. I'm not sure.
said by Some1 SaveMe :and these guys dont offer to contribute to any projects or any troubleshoot anything they just stand around and stare at me. What's the knowledge base / qualifications? Across the board? Fresh out of school? Handy with a screwdriver but that's about it? Second Wily_One in that SOMEone needs to step up and say "we're going to fix this." First thing I can think of is cross-training. The fact that they just "stare at you" is the first thing that needs fixing. If this is a team, then anyone should have been able to pick this up and start working on it, with everyone else backing him / her up as needed. If all they did was stand around and only offer information when you asked for it, "Houston, we have a problem."
A good starting point, after this mess is cleaned up, is sit everyone down and ask "okay, now if I wasn't here, what would one of you have done in my place?"
My 00000010bits.
Regards |
|
 Reviews:
·Comcast
| reply to DarkLogix said by DarkLogix:Seems the powers that be don't want STP (or rather don't know WTF it is. (we have 3x juniper 4200ex switches but a POS juniper firewall SRX is dooing the intervlan routing.) Hopefully that doesn't impact your inter-vlan traffic throughput too much. |
|
 | reply to Some1 SaveMe ...Just another dumb question Some1 SaveMe... how many users were affected in total?
I've seen outages as few as one person, to multi-site / continent. Key thing is breathe, and go on a date with Crown Royal afterwards (or whatever it is you do) to keep your balance.
Regards |
|
 | reply to Oedipus said by Oedipus:Couple things.
1. Having myself been the troubleshooter on the end of a switch loop, the problem should have been immediately obvious (to you, I must emphasize) based on the lights on the switches. Depending on how many ports you're working with, start unplugging them one by one. Eventually, you'll unplug the problem port and all of the lights will settle down.
2. You have stacked switches but no STP?
3. You want techs that can't physically touch the network? It almost sounds like you don't want them to do anything, so that's why they aren't.
4. God help you when somebody starts backfeeding DHCP into your system. There's no way you'll be able to figure that one out. 1. It wasn't obvious to me that early in the morning (i'm not a coffee person i just need proper hours of sleep which didn't happen the night before).
2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary) so what i had in front of me was 7 10/100 dell switches which i consolidated to 4 48 port switches. It should have been upgraded years ago but I swapped them out 1 by 1 during business hours and moved around connections from old to new, it tooks days to complete.
3. It really be beneficial to have other techs that can make changes as long as they understand the implications of what they are doing. Trust me this is not the first time an issue has cropped up, it's the same everytime even when i have went over procedures. |
|
 | reply to HELLFIRE said by HELLFIRE:Immediate takeaway from this is as follows :
a) change control. I've had it drilled into my head the last 5.5 years to do it everytime. I can't count the number of times it has saved by backside, while all the other yahoos still run in cowboy mode.
b) STP -- get it implemented ASAP! Also look into securing your layer 2. VLAN attacks, DHCP snooping, no trunk autonegotiation, BPDU / ROOT guard, etc. I'm trying to visualize your network; it sounds like a collapsed core, but I'm dreading that you've also collapsed your access layer into the core as well. I'm not sure.
said by Some1 SaveMe :and these guys dont offer to contribute to any projects or any troubleshoot anything they just stand around and stare at me. What's the knowledge base / qualifications? Across the board? Fresh out of school? Handy with a screwdriver but that's about it? Second Wily_One  in that SOMEone needs to step up and say "we're going to fix this." First thing I can think of is cross-training. The fact that they just "stare at you" is the first thing that needs fixing. If this is a team, then anyone should have been able to pick this up and start working on it, with everyone else backing him / her up as needed. If all they did was stand around and only offer information when you asked for it, "Houston, we have a problem." A good starting point, after this mess is cleaned up, is sit everyone down and ask "okay, now if I wasn't here, what would one of you have done in my place?" My 00000010bits. Regards What's the knowledge base / qualifications? Across the board? Fresh out of school? Handy with a screwdriver but that's about it?
One of them had prior experience but not sure he has worked with enterprise type applications, server roles/configurations.....but he's good with hardware so i guess he may fall into the handy with a screwdriver category? And the other tech is entry level, had some very basic knowledge but still has ALOT to learn. He's really here because we couldn't find anyone that was qualified.
To answer everything else...well lets just say i have had to document step by step & print out procedures for which they have posted all over their cube walls  |
|
 DarkLogixTexan and ProudPremium join:2008-10-23 Baytown, TX kudos:3 | reply to Oedipus said by Oedipus:said by DarkLogix:Seems the powers that be don't want STP (or rather don't know WTF it is. (we have 3x juniper 4200ex switches but a POS juniper firewall SRX is dooing the intervlan routing.) Hopefully that doesn't impact your inter-vlan traffic throughput too much. It does but powers that be don't care.
We have a home drive for every office user and it's synced so they have an offline copy, the syncing app often considers the LAN to be to slow to sync. -- »Death Star Petition |
|
 DarkLogixTexan and ProudPremium join:2008-10-23 Baytown, TX kudos:3 | reply to Some1 SaveMe said by Some1 SaveMe :2. I couldn't do STP because they wouldn't pay me OT to do it after hours Your doing it wrong
STP should have been enabled 1st thing, its to easy not to. (unless the powers that be say don't) -- »Death Star Petition |
|
 Reviews:
·Comcast
| reply to Some1 SaveMe said by Some1 SaveMe :2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary) I would be in the HR doghouse if I used that as an excuse, especially since your not doing so contributed to the downtime. |
|
 DarkLogixTexan and ProudPremium join:2008-10-23 Baytown, TX kudos:3 1 edit | said by Oedipus:said by Some1 SaveMe :2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary) I would be in the HR doghouse if I used that as an excuse, especially since your not doing so contributed to the downtime. Ya and salary doesn't mean you quit because its the time to leave, it means you stay and do what you need to.
I'm also salary and I've stayed past midnight before working on something.
You should expect any major change to have the potential to require sleeping in the office.
Frankly if someone implemented new switches and didn't apply STP then when this happened they'd be out the door fast. -- »Death Star Petition |
|
 Badger3kWe Don't Need No Stinkin BadgersPremium join:2001-09-27 Franklin, OH | reply to Some1 SaveMe It sounds like you are just dealing with young guys that are green. Coach them, mold them, and grow them into what your team needs, not exclude them.
Change control is good but you don't want them to not touch anything because then they won't learn. Plus I'm not sure how a change management process would have stopped your loop from happening. Technically the change didn't occur to the equipment, something was plugged in at a user desk (or sounds like it anyway). I've never had change control down to requesting permission to plug in a device to the network. Now if the port was inactive and a config change was required that would make sense.
And no offense, but not doing something because you don't get paid OT is stupid. Do what is needed regardless of when it's needed and try to work in comp time if you are really bothered by it. -- Team Discovery: Project Hope |
|
 | And no offense, but not doing something because you don't get paid OT is stupid. Do what is needed regardless of when it's needed and try to work in comp time if you are really bothered by it. I have to second this. I'm about to kick off a 30+ hour upgrade here at work that I've spent months planning. Unfortunately I can only go as fast as the systems. Will I get OT? No, but being salary gives you negotiation power. I fully expect some comp time in my future and management has no problem with that.
And yes STP is your friend implement it now or die under Layer 2 loops for the rest of your life. So many people fear Layer 2 and I don't know why, it's very simple and solves so many problems. -- »vinfotech.blogspot.com |
|
 tubbynetreminds me of the danse russePremium,MVM join:2008-01-16 Chandler, AZ kudos:1 | said by rsaturns:And yes STP is your friend implement it now or die under Layer 2 loops for the rest of your life. So many people fear Layer 2 and I don't know why, it's very simple and solves so many problems.
because it can fail *spectacularly* even when configured properly. the issue is all about scalability and how to ensure that l2 domains don't grow and that the convergence is quick after if-status change. given the anemic cpus that are put into most desktop switches, its possible for stp bpdu's to get lost/dropped if there is too much activity on the network. proper pruning and vlan scaling, along with stp mode choice can go a long way in helping -- but at the end of the day -- the best way to prevent stp issues is to remove it from the network altogether.
campus networks should be l3 to the access-layer. very few (if any) applications require l2 reachability for end-hosts to function.
data-center networks should be kept in pods, with redundancy if possible. traffic to/from those pods should be routed to core. pod size should be kept as appropriate. stretched l2 requirements inside of a d/c should be provided by technologies such as vpls or otv -- and not pure stp links. mcec should be used as appropriate (or trill implementations by $vendor_of_choice).
q. -- "...if I in my north room dance naked, grotesquely before my mirror waving my shirt round my head and singing softly to myself..." |
|
 donoreoPremium join:2002-05-30 North York, ON | reply to Oedipus said by Oedipus:said by Some1 SaveMe :2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary) I would be in the HR doghouse if I used that as an excuse, especially since your not doing so contributed to the downtime. That is WHY they pay salary! Do the job to get it done. -- The irony of common sense, it is not that common. I cannot deny anything I did not say. A kitten dies every time someone uses "then" and "than" incorrectly. I mock people who give their children odd spelling of names. |
|
 | reply to Some1 SaveMe Second Badger3k's comments in their entirety. Assuming everything's cleaned up from this little gongshow, sit down with your team and walk through it with everyone. I'm hoping they're eager to contribute and learn, but simply scared as they haven't a lot of practical experience -- we were all like that once upon a time. A walkthrough / postmortem would be a great chance for them to pick up some pointers.
The more I think about it, I'm guessing "pocket switch" is a small 4 or 8 port switch? STP AND BPDU guard would've DEFINATELY been your friend. The minute the switch stack saw BPDUs from its ports, down the port goes... till you unshut it or the rogue device is removed.
Regards |
|
 | reply to Some1 SaveMe Fun times indeed, been there and done that. What I'm paid to do is not IT related but since I'm known to have knowledge of such I get tossed into the SHTF situations since I'm physicially there already. Basically end user plugged ip phone into wall jack and then ran a patch cord from the pc port on the phone to other port of the wall jack. Completely crippled the cisco router acting as primary gateway so brought everything to a halt. Equipment is capable of STP and BPDU but all turned off When they were running out of available IP addresses instead of segmenting network they just switched to a 16-bit subnet. |
|
 Reviews:
·Comcast
| reply to Some1 SaveMe I actually had a TV that was plugged into a network take it down. Started with users losing connection to the network drives and authentication issues. We had just moved the domain from 2003 to 2008 and thought we f'ed something up real bad. Couldn't figure out what was going on and started unplugging machines one by one off the switch to see if it helped, unplugged the port that the TV was on, everything went back to normal. Still don't know what on the TV caused the problem. -- I thought I knew everything...then I got married...
|
|