dslreports logo
site
 
    All Forums Hot Topics Gallery
spc

spacer




how-to block ads


Search Topic:
uniqs
3965
share rss forum feed


Some1 SaveMe

@..nehome-server.info

How to take down a network

Backstory
(I was a sys admin 2 yrs at my current job I have 8yrs of IT experience previous to coming here. I was promoted to network admin, the former network admin is the IT Director now. 2 network support techs are on the networking team to assist the network admin position)

Today was a f#$%@#^ nightmare! Started first thing in the morning around 9, our phones lost connection then power (the phones are POE) which was followed by a few people asking is the network down or is the internet down?
So I grabbed my tech bag run to the MDF, once I got there I do a quick visual check to see if the led’s on network gear is flashing which they are so it doesn’t look like anything is locked up or frozen so I then start to console into the gear 1by1 to check logs, test connectivity, run some other tools analyze traffic etc. After doing all those things I proceed to restart those devices 1by1 (had to do this several times) to see if the conditions change at all, after everything comes back up I notice that traffic flows normally for a short amount of time then stops working again. I changed over to hp switches a few months ago which replaced un-managed dell junk, anyway I configured the hp switches in a stacked configuration so I logged to the commander which is located in another building on our campus to see 100% port utilization then the connectivity issues come back. So I ask the network support techs which are standing around me while I’m doing all this troubleshooting (I’ll come back to this because this is a hot button for me) I ask them has anyone made any kind of changes this morning? “I did” said one of the techs….so I ask what did you do? He says “well one of the users called the helpdesk and said they had a computer in their office that couldn't logon to the domain so I went over to check it out and I noticed that the pocket switch that the computer was plugged into was off so I turned it on a little while ago.” I tell him go unplug it right now! And return here afterwards! I manage to get back into the commander to monitor the port utilization and as soon as he unplugs it everything goes back normal. He gets back and I ask him to take me where this pocket switch is located, we get there and I check out the cables to see what goes where then I ask him do you know what a network loop is? He says “sort of…. is that what happen”

I walk away at that point because I’ve had enough. I seek out the IT Director and bring him up to speed on what occurred then I tell him I want change control or the rest of these guys can’t touch any components of the network anymore! He goes “I think we should have an internal meeting to discuss what happened and how it was fixed” I respond that’s crazy…why should we do that? If they don’t’ know anything about network loops they should be computer technician’s. He then say’s “well if you feel we really need change control then implement it.”

After all that (Network being down for a few hours) I take some time to think…it’s so much to do here server upgrades, data migrations, an ASA that hasn’t seen a software upgrade in 7yrs (security risk) and it’s all on my shoulders. Last year 10 major infrastructure projects all done by myself and another 12 I have planned for this year but I’m starting to question myself, if I should be here? Should I be working as hard as I do? Having a good team is priceless and these guys don’t offer to contribute to any projects or any troubleshoot anything they just stand around and stare at me.

What would you guys do?

PS, yes I know my grammar probably sucks. So no need to point it out.


Oedipus

join:2005-05-09
kudos:1

Couple things.

1. Having myself been the troubleshooter on the end of a switch loop, the problem should have been immediately obvious (to you, I must emphasize) based on the lights on the switches. Depending on how many ports you're working with, start unplugging them one by one. Eventually, you'll unplug the problem port and all of the lights will settle down.

2. You have stacked switches but no STP?

3. You want techs that can't physically touch the network? It almost sounds like you don't want them to do anything, so that's why they aren't.

4. God help you when somebody starts backfeeding DHCP into your system. There's no way you'll be able to figure that one out.



Wily_One
Premium
join:2002-11-24
San Jose, CA
Reviews:
·AT&T U-Verse
reply to Some1 SaveMe

Yeah a layer 2 loop is it what is sounded like, but I'll be honest I've never heard the term "pocket switch" before.

I work for a big company, so we absolutely require change control and furthermore don't allow "pocket switches" on the network to begin with.

As for your predicament, it sounds like you're overworked and have no support. Welcome to IT. Kidding aside, you can only do so much yourself. Why is everything on your shoulders if you have a team? There's a clear lack of leadership there. I say you have two choices: step up to fill the void, or move on.



DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3
reply to Some1 SaveMe

Ya switching loops are crap

Where I work one dept was given permission to use a AP (because the powers that be haven't gotten around to adding any AP's yet.) well it had worked fine but suddenly one day it caused a intermitant l2 loop.

so the end user vlan was maxed but the otehrs were ok, so it seemed like a issue with the router, I know cisco but won't though the juniper crap we have.

well the end result was some settign on the AP got changed and caused the whole mess, still no idea what setting on an ap could make it a l2 loop with only one link to the switch.

Later I configured a diff AP and it hasn't given any problems since, still waiting on the cisco AP's we were told were coming down the line.

BTW in my case pings were going through with crazy high responce times on the client vlan, and timing out for any intervlan traffic. And server to server was normal, but anything to the router was crap.

Seems the powers that be don't want STP (or rather don't know WTF it is.
(we have 3x juniper 4200ex switches but a POS juniper firewall SRX is dooing the intervlan routing.)
--
»Death Star Petition


HELLFIRE
Premium
join:2009-11-25
kudos:18
reply to Some1 SaveMe

Immediate takeaway from this is as follows :

a) change control. I've had it drilled into my head the last 5.5 years to do it everytime. I can't count the number of
times it has saved by backside, while all the other yahoos still run in cowboy mode.

b) STP -- get it implemented ASAP! Also look into securing your layer 2. VLAN attacks, DHCP snooping, no trunk
autonegotiation, BPDU / ROOT guard, etc. I'm trying to visualize your network; it sounds like a collapsed core, but
I'm dreading that you've also collapsed your access layer into the core as well. I'm not sure.

said by Some1 SaveMe :

and these guys don’t offer to contribute to any projects or any troubleshoot anything they just stand around and stare at me.

What's the knowledge base / qualifications? Across the board? Fresh out of school? Handy with a screwdriver but that's
about it? Second Wily_One See Profile in that SOMEone needs to step up and say "we're going to fix this." First thing I can
think of is cross-training. The fact that they just "stare at you" is the first thing that needs fixing. If this is a team, then
anyone should have been able to pick this up and start working on it, with everyone else backing him / her up as needed.
If all they did was stand around and only offer information when you asked for it, "Houston, we have a problem."

A good starting point, after this mess is cleaned up, is sit everyone down and ask "okay, now if I wasn't here, what would
one of you have done in my place?"

My 00000010bits.

Regards

Oedipus

join:2005-05-09
kudos:1
reply to DarkLogix

said by DarkLogix:

Seems the powers that be don't want STP (or rather don't know WTF it is.
(we have 3x juniper 4200ex switches but a POS juniper firewall SRX is dooing the intervlan routing.)

Hopefully that doesn't impact your inter-vlan traffic throughput too much.

HELLFIRE
Premium
join:2009-11-25
kudos:18
reply to Some1 SaveMe

...Just another dumb question Some1 SaveMe... how many users were affected in total?

I've seen outages as few as one person, to multi-site / continent. Key thing is breathe, and go on a date with
Crown Royal afterwards (or whatever it is you do) to keep your balance.

Regards



Some1 SaveMe

@vectro.com
reply to Oedipus

said by Oedipus:

Couple things.

1. Having myself been the troubleshooter on the end of a switch loop, the problem should have been immediately obvious (to you, I must emphasize) based on the lights on the switches. Depending on how many ports you're working with, start unplugging them one by one. Eventually, you'll unplug the problem port and all of the lights will settle down.

2. You have stacked switches but no STP?

3. You want techs that can't physically touch the network? It almost sounds like you don't want them to do anything, so that's why they aren't.

4. God help you when somebody starts backfeeding DHCP into your system. There's no way you'll be able to figure that one out.

1. It wasn't obvious to me that early in the morning (i'm not a coffee person i just need proper hours of sleep which didn't happen the night before).

2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary) so what i had in front of me was 7 10/100 dell switches which i consolidated to 4 48 port switches. It should have been upgraded years ago but I swapped them out 1 by 1 during business hours and moved around connections from old to new, it tooks days to complete.

3. It really be beneficial to have other techs that can make changes as long as they understand the implications of what they are doing. Trust me this is not the first time an issue has cropped up, it's the same everytime even when i have went over procedures.


Some1 SaveMe

@vectro.com
reply to HELLFIRE

said by HELLFIRE:

Immediate takeaway from this is as follows :

a) change control. I've had it drilled into my head the last 5.5 years to do it everytime. I can't count the number of
times it has saved by backside, while all the other yahoos still run in cowboy mode.

b) STP -- get it implemented ASAP! Also look into securing your layer 2. VLAN attacks, DHCP snooping, no trunk
autonegotiation, BPDU / ROOT guard, etc. I'm trying to visualize your network; it sounds like a collapsed core, but
I'm dreading that you've also collapsed your access layer into the core as well. I'm not sure.

said by Some1 SaveMe :

and these guys don’t offer to contribute to any projects or any troubleshoot anything they just stand around and stare at me.

What's the knowledge base / qualifications? Across the board? Fresh out of school? Handy with a screwdriver but that's
about it? Second Wily_One See Profile in that SOMEone needs to step up and say "we're going to fix this." First thing I can
think of is cross-training. The fact that they just "stare at you" is the first thing that needs fixing. If this is a team, then
anyone should have been able to pick this up and start working on it, with everyone else backing him / her up as needed.
If all they did was stand around and only offer information when you asked for it, "Houston, we have a problem."

A good starting point, after this mess is cleaned up, is sit everyone down and ask "okay, now if I wasn't here, what would
one of you have done in my place?"

My 00000010bits.

Regards

What's the knowledge base / qualifications? Across the board? Fresh out of school? Handy with a screwdriver but that's about it?

One of them had prior experience but not sure he has worked with enterprise type applications, server roles/configurations.....but he's good with hardware so i guess he may fall into the handy with a screwdriver category? And the other tech is entry level, had some very basic knowledge but still has ALOT to learn. He's really here because we couldn't find anyone that was qualified.

To answer everything else...well lets just say i have had to document step by step & print out procedures for which they have posted all over their cube walls


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3
reply to Oedipus

said by Oedipus:

said by DarkLogix:

Seems the powers that be don't want STP (or rather don't know WTF it is.
(we have 3x juniper 4200ex switches but a POS juniper firewall SRX is dooing the intervlan routing.)

Hopefully that doesn't impact your inter-vlan traffic throughput too much.

It does but powers that be don't care.

We have a home drive for every office user and it's synced so they have an offline copy, the syncing app often considers the LAN to be to slow to sync.
--
»Death Star Petition


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3
reply to Some1 SaveMe

said by Some1 SaveMe :

2. I couldn't do STP because they wouldn't pay me OT to do it after hours

Your doing it wrong

STP should have been enabled 1st thing, its to easy not to. (unless the powers that be say don't)
--
»Death Star Petition

Oedipus

join:2005-05-09
kudos:1
reply to Some1 SaveMe

said by Some1 SaveMe :

2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary)

I would be in the HR doghouse if I used that as an excuse, especially since your not doing so contributed to the downtime.


DarkLogix
Texan and Proud
Premium
join:2008-10-23
Baytown, TX
kudos:3

1 edit

said by Oedipus:

said by Some1 SaveMe :

2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary)

I would be in the HR doghouse if I used that as an excuse, especially since your not doing so contributed to the downtime.

Ya and salary doesn't mean you quit because its the time to leave, it means you stay and do what you need to.

I'm also salary and I've stayed past midnight before working on something.

You should expect any major change to have the potential to require sleeping in the office.

Frankly if someone implemented new switches and didn't apply STP then when this happened they'd be out the door fast.
--
»Death Star Petition


Badger3k
We Don't Need No Stinkin Badgers
Premium
join:2001-09-27
Franklin, OH
reply to Some1 SaveMe

It sounds like you are just dealing with young guys that are green. Coach them, mold them, and grow them into what your team needs, not exclude them.

Change control is good but you don't want them to not touch anything because then they won't learn. Plus I'm not sure how a change management process would have stopped your loop from happening. Technically the change didn't occur to the equipment, something was plugged in at a user desk (or sounds like it anyway). I've never had change control down to requesting permission to plug in a device to the network. Now if the port was inactive and a config change was required that would make sense.

And no offense, but not doing something because you don't get paid OT is stupid. Do what is needed regardless of when it's needed and try to work in comp time if you are really bothered by it.
--
Team Discovery: Project Hope



rsaturns

join:2004-12-06
Beaverton, OR

And no offense, but not doing something because you don't get paid OT is stupid. Do what is needed regardless of when it's needed and try to work in comp time if you are really bothered by it.
I have to second this. I'm about to kick off a 30+ hour upgrade here at work that I've spent months planning. Unfortunately I can only go as fast as the systems. Will I get OT? No, but being salary gives you negotiation power. I fully expect some comp time in my future and management has no problem with that.

And yes STP is your friend implement it now or die under Layer 2 loops for the rest of your life. So many people fear Layer 2 and I don't know why, it's very simple and solves so many problems.
--
»vinfotech.blogspot.com


tubbynet
reminds me of the danse russe
Premium,MVM
join:2008-01-16
Chandler, AZ
kudos:1

said by rsaturns:

And yes STP is your friend implement it now or die under Layer 2 loops for the rest of your life. So many people fear Layer 2 and I don't know why, it's very simple and solves so many problems.

because it can fail *spectacularly* even when configured properly. the issue is all about scalability and how to ensure that l2 domains don't grow and that the convergence is quick after if-status change.
given the anemic cpus that are put into most desktop switches, its possible for stp bpdu's to get lost/dropped if there is too much activity on the network. proper pruning and vlan scaling, along with stp mode choice can go a long way in helping -- but at the end of the day -- the best way to prevent stp issues is to remove it from the network altogether.

campus networks should be l3 to the access-layer. very few (if any) applications require l2 reachability for end-hosts to function.

data-center networks should be kept in pods, with redundancy if possible. traffic to/from those pods should be routed to core. pod size should be kept as appropriate. stretched l2 requirements inside of a d/c should be provided by technologies such as vpls or otv -- and not pure stp links. mcec should be used as appropriate (or trill implementations by $vendor_of_choice).

q.
--
"...if I in my north room dance naked, grotesquely before my mirror waving my shirt round my head and singing softly to myself..."


donoreo
Premium
join:2002-05-30
North York, ON
reply to Oedipus

said by Oedipus:

said by Some1 SaveMe :

2. I couldn't do STP because they wouldn't pay me OT to do it after hours (i'm salary)

I would be in the HR doghouse if I used that as an excuse, especially since your not doing so contributed to the downtime.

That is WHY they pay salary! Do the job to get it done.
--
The irony of common sense, it is not that common.
I cannot deny anything I did not say.
A kitten dies every time someone uses "then" and "than" incorrectly.
I mock people who give their children odd spelling of names.

HELLFIRE
Premium
join:2009-11-25
kudos:18
reply to Some1 SaveMe

Second Badger3k's comments in their entirety. Assuming everything's cleaned up from this little gongshow, sit down
with your team and walk through it with everyone. I'm hoping they're eager to contribute and learn, but simply scared
as they haven't a lot of practical experience -- we were all like that once upon a time. A walkthrough / postmortem
would be a great chance for them to pick up some pointers.

The more I think about it, I'm guessing "pocket switch" is a small 4 or 8 port switch? STP AND BPDU guard
would've DEFINATELY been your friend. The minute the switch stack saw BPDUs from its ports, down the
port goes... till you unshut it or the rogue device is removed.

Regards


ImpetusEra
Premium
join:2004-05-19
00000
reply to Some1 SaveMe

Fun times indeed, been there and done that. What I'm paid to do is not IT related but since I'm known to have knowledge of such I get tossed into the SHTF situations since I'm physicially there already. Basically end user plugged ip phone into wall jack and then ran a patch cord from the pc port on the phone to other port of the wall jack. Completely crippled the cisco router acting as primary gateway so brought everything to a halt. Equipment is capable of STP and BPDU but all turned off When they were running out of available IP addresses instead of segmenting network they just switched to a 16-bit subnet.



BeanBag

join:2001-12-06
Auburn, WA
reply to Some1 SaveMe

I actually had a TV that was plugged into a network take it down. Started with users losing connection to the network drives and authentication issues. We had just moved the domain from 2003 to 2008 and thought we f'ed something up real bad. Couldn't figure out what was going on and started unplugging machines one by one off the switch to see if it helped, unplugged the port that the TV was on, everything went back to normal. Still don't know what on the TV caused the problem.
--
I thought I knew everything...then I got married...



DC DSL
There's a reason I'm Command.
Premium
join:2000-07-30
Washington, DC
kudos:2
reply to Some1 SaveMe

This. The original and still the best.






Wily_One
Premium
join:2002-11-24
San Jose, CA

PoE?



Chiyo
Save Me Konata-Chan
Premium
join:2003-02-20
Charlotte, NC
kudos:1
reply to Some1 SaveMe

As someone who can relate being in the tech's shoes I would probably stare too because YOU are the person in charge, you are the person I would be looking to for leadership.

I would also keep my mouth shut if I don't know what the issue could be as I'm just a tech I might be right or lucky to be right but honestly I would feel really stupid if I opened my mouth and was way wrong and we just spent wasting time while our entire network is down.

Glad you figured it out.
--
That was the wild boar.... Moo!
My podcast: The Banzai Beat »www.banzaibeat.com


Oedipus

join:2005-05-09
kudos:1

said by Chiyo:

As someone who can relate being in the tech's shoes I would probably stare too because YOU are the person in charge, you are the person I would be looking to for leadership.

I would also keep my mouth shut if I don't know what the issue could be as I'm just a tech I might be right or lucky to be right but honestly I would feel really stupid if I opened my mouth and was way wrong and we just spent wasting time while our entire network is down.

Glad you figured it out.

Early on in my IT career I was sent out to a remote site on a weekend by myself to figure out why their entire (single vlan) network was down. Never having seen a switch loop in person before, I still noticed that something was awry with the lights on the switches. Didn't take me long to figure it out.

If I were one of the techs in the OP's situation, I would be standing there wondering why my boss was having such a hard time coming up with a diagnosis.

HELLFIRE
Premium
join:2009-11-25
kudos:18
reply to BeanBag

@BeanBag
Plug the TV into a home router and run wireshark. Probably spitting out some non-regulation frames / packets.

Regards


amungus
Premium
join:2004-11-26
America
Reviews:
·Cox HSI
·KCH Cable
reply to Some1 SaveMe

Oh the fun.
...it never ends...

STP. Yes. Do it. Do it today.
Plan changes, make them happen. There is only so much a person can do in a day. Make the most of each one, and work steadily towards making things better. Newsflash - you're probably going to have to work during off-times. Welcome to the life of an admin...

As for your techs, get them to write their own procedures, ASAP. Tell them to at re-write the existing ones, file them electronically, and update continually. Have them store said procedures in a knowledgebase of some kind. Make them get in the routine of documenting their work well enough for others to learn.
When I started as a tech where I am now, I had to do it. Don't be google for them.

Change control. I'd like to have more of that someday. We started making sure that some things are done with approval, but there is much to be done on this front.

As for working hard - only you can answer that.
I bust my ass.
I probably have more (and worse) gripes than you, but may or may not be as vocal.
I probably make less than you, and several others here.
I probably have at least as much chaos to deal with as everyone here.
I still think it's worth it for me to do the best I can. The experience alone is priceless.
Sometimes, you just have to suck it up, and keep on going.

Lastly - it can always be worse. Count what blessings you do have, and be thankful for the good things.


kc8jwt

join:2005-10-27
Syracuse, OH
reply to Some1 SaveMe

As other posters have said here, taking ownership of the problem is the first thing as an admin you need to do. There have been many times something was supposed to get done for me and it didn't happen. As I look at it, the only person to blame is me for the issue.

As with the two people that you have working under you, take the the time to try and mentor them. On slow days, inject a small issue that won't harm anything and then set them loose on fixing it. Prod them along in the right direction, but don't show them the problem. I work in a vocational school and from time to time I have students that work with me. A good thing for them to do is I give them a computer with a fairly simple issue to fix. I then tell them to troubleshoot it and then evaluate.

I've had my share of loops here on our network, and it usually takes me 10 minutes tops to find them. STP keeps a lid on them for the most part, but again, educating and mentoring the techs you have will get you freed up so you can work on all of those other projects that you have.