dslreports logo
 
    All Forums Hot Topics Gallery
spc
Search similar:


uniqs
1565

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

[General] SBC California network issues

Been seeing what looks to be major failures across SBCs network here in California since approximately 0045 PDT.

Traceroutes going *OUT* of SBCs network (i.e. DSL-->elsewhere) imply either Redback maintenance or vlan-related problems:

traceroute to pentarou.parodius.com (64.62.145.226), 22 hops max, 40 byte packets
1 gw (192.168.1.254) 0.727 ms 0.669 ms 0.727 ms
2 192.168.0.1 (192.168.0.1) 1.476 ms 1.403 ms 1.466 ms
3 adsl-64-171-255-254.dsl.snfc21.pacbell.net (64.171.255.254) 9.723 ms 9.671 ms 9.466
* ms
4 dist1-vlan50.snfc21.pbi.net (206.171.134.130) 10.091 ms 9.917 ms 9.587 ms
5 * * *
6 * * *
7 * * *

(*) WARNING 1 long line(s) split

Traceroute going *IN* to SBCs network show backbone-related problems, in addition to reconvergence between NorCal and SoCal routers (likely BGP failing over), implying the issue may be specific to NorCal.

I've also been witnessing very bizarre network-related behaviour which seems to imply there's an ARP-related issue on the Redbacks (again, may be maintenance related if this is in fact maintenance).

In the case that this ordeal is maintenance:

Is there any way humanly possible to get SBC to inform customers (either here on BBR or via Email) of upcoming maintenance windows?

This is getting to the point where, grudgingly, I'm starting to look at other DSL providers simply because I'm tired of these incredibly badly-managed maintenances. I don't mean to sound like I'm putting down SBC, but when I was with Speakeasy, they'd inform customers of maintenance windows, ditto with Internap maintenances.

I wouldn't mind these issues so much if I knew when/what to expect 3-7 days beforehand...

Rocky67
Pencil Neck Geek
Premium Member
join:2005-01-13
Orange, CA

Rocky67

Premium Member

Thanks for keeping after this issue Koitsu.

I've been with PacBell since 1995 and have been very satisfied with their service since back in the dialup days, but these infrastructure maintainance nightmares during the last six months or so are starting to change my attitude about SBC.

As to informing us about this stuff - it would be a start if they would just admit it's happening and take it from there.
flotsamm
join:2003-06-08
San Jose, CA

flotsamm to koitsu

Member

to koitsu
Yes I agree, been quite happy with SBC since 1998, but as of late it seems to be getting worse. Today in fact will be my 3rd MPOE meeting and 15th truck roll since July 8th to fix my dropping sync. Its beyond ridiculous.
c P Q a I M
Perfect
join:2002-11-24
San Francisco, CA

c P Q a I M to koitsu

Member

to koitsu
I do remember seeing something about notices for scheduled network maintenance back when I was testing the SBC "Connection Manager" software about a year ago.
Matthew
Premium Member
join:2001-08-03
Emmett, ID

1 edit

Matthew to koitsu

Premium Member

to koitsu
This is a large part of why I have pushed for a Much more informative status page. Those responsible for this type of work would be more visible, and at least stating that work is going to be done during a maintenance window maybe a few days in advance of the work being done (when possible), might help keep call and ticket volumes down.

However, I know that the ISP worked on getting information like this out via e-mail over a year ago. If I recall correctly, at the time it lead to more calls asking about what is this you guys are doing. I also recall that when SBCIS started e-mailing people about "upgrades to be done," suddenly people started pointing to the upgrades and saying "my trouble only started after you guys did these upgrades," and tended to shut out other trouble shooting. God forbid someone add a phone in another room the same night SBCIS was going network upgrades, or that wet trouble happened in the cable, because suddenly the upgrades are at fault. Sort of like when 3 people in Northern California have higher than normal gateway pings, everyone in the state starts thinking that their intermittent connection is related and the conversation turns from being able to trouble shoot the likely cause to "SBC Fix your network." Is there such a thing as too much information?

But lets try to look at this from a service point of view. When I brought one of your prior threads up to someone his point of view was such that you being able to perform the tracerts at all shows that it is not service affecting. I couldn't tell him that you weren't online, that you couldn't get e-mail, or offer him any other clue as to how this was affecting your use of the service. In my experience, this type of information comes from trouble shooting certain types of problems, not mearly for the joy of watching packets. So, what lead to the trouble shooting in the first place?

Without going into what you believe the problem is, how was this service affecting? Were you unable to use the internet? Was it slower than usual? In what way was this affecting your ability to use the DSL? Service affecting trouble can often times have more legs when talking with people who provide service than telling someone how to do their job, or how they aren't.

Please don't read me wrong, I would like to see your problem resolved, but putting it in service affecting terms helps me make your case.

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu

MVM

said by Matthew:

This is a large part of why I have pushed for a Much more informative status page. Those responsible for this type of work would be more visible, and at least stating that work is going to be done during a maintenance window maybe a few days in advance of the work being done (when possible), might help keep call and ticket volumes down.
This would be a fantastic solution to the problem.
said by Matthew:

However, I know that the ISP worked on getting information like this out via e-mail over a year ago. If I recall correctly, at the time it lead to more calls asking about what is this you guys are doing.
When I was much younger, I worked technical support for a couple years. Maintenance Emails we sent out customers, the easy majority of the time, resulted in exactly what you describe -- "what is this thing you sent me, and what does it mean?"

We ended up going with a pure opt-in system, where those who wanted maintenance notifications could choose to sign up for them.
said by Matthew:

I also recall that when SBCIS started e-mailing people about "upgrades to be done," suddenly people started pointing to the upgrades and saying "my trouble only started after you guys did these upgrades," and tended to shut out other trouble shooting. God forbid someone add a phone in another room the same night SBCIS was going network upgrades, or that wet trouble happened in the cable, because suddenly the upgrades are at fault. Sort of like when 3 people in Northern California have higher than normal gateway pings, everyone in the state starts thinking that their intermittent connection is related and the conversation turns from being able to trouble shoot the likely cause to "SBC Fix your network." Is there such a thing as too much information?
I know exactly where you're coming from with this as well. People who aren't entirely clueful (or family to debugging connectivity issues) end up corrolating their own experiences with something someone else has shed light on, ultimately resulting in a completely worthless + confusing thread. I've seen it here on BBR myself... heck, I've probably contributed a few times.
said by Matthew:

But lets try to look at this from a service point of view. When I brought one of your prior threads up to someone his point of view was such that you being able to perform the tracerts at all shows that it is not service affecting. I couldn't tell him that you weren't online, that you couldn't get e-mail, or offer him any other clue as to how this was affecting your use of the service. In my experience, this type of information comes from trouble shooting certain types of problems, not mearly for the joy of watching packets. So, what lead to the trouble shooting in the first place?
Whoever this "he" is needs to become a little more educated with IP -- as I'm sure you could explain to him: "Just because you have circuit frame sync doesn't mean your IP packets are going to reach their destination".

What lead me to automated traceroutes: I grew tired of seeing SBCs core network fall off the face of the planet at seemingly "random" hours of the night (and sometimes during the day, but I assume daytime issues are purely emergency/unexpected situations, which are impossible to predict -- those aren't what I'm complaining about).

I'm both an SBC DSL customer, as well as a co-location customer at Hurricane Electric (who is a customer of yours, or rather, SBC ASIs -- they have direct peering with SBC). My traceroutes go from my DSL line to my co-location facility. This relies COMPLETELY on SBCs network, up until their peering point with HE. It's a much more "valid" test than, say, if I was with a DSL or cable provider who didn't have direct peering with my co-lo provider.

The traceroutes eventually showed me that network reconvergence does not happen in a clean way, particularly when maintenances are being done (since I assume nighttime outages -- especially ones which keep happening over and over -- are maintenace).

I see core routers in SBCs network fall off the net (for whatever reason), and BGP attempts to route around the problem by going through SoCal instead of, say, a router in San Jose, Santa Clara, or SFO. What you end up with is that packets go to SoCal (usually LA or Anaheim), then back up to NorCal again. This seems a bit silly (adding on a good 15-20ms latency solely due to the fact that SBC only has ONE core routing POP in the Bay Area, OR, that BGP is misconfigured), but is still acceptable.

What isn't acceptable are outages which result in my packets being literally dropped. All of the equipment and IP-related applications I use will re-transmit TCP packets for a duration of time. Meaning, if a portion of SBCs core network changes, there may be a delay (stall, etc.) for up to 30 seconds while BGP has a chance to change routes. This is acceptable, as packets still end up making it to their destination.

But the reconvergence problem looks as if no one bothers "cleanly" routing IP traffic before hand. It's like routers are being literally taken down without any pre-maintenance traffic routing, then brought back up as if nothing ever happened (almost as if they spontaneously were rebooting). Packets die/get dropped in this situation.
said by Matthew:

Without going into what you believe the problem is, how was this service affecting? Were you unable to use the internet? Was it slower than usual? In what way was this affecting your ability to use the DSL? Service affecting trouble can often times have more legs when talking with people who provide service than telling someone how to do their job, or how they aren't.
These maintenances/outages are service-affecting for me because I use Remote Desktop to connect to my home PC (which then connects directly to my co-location, usually over a VPN or via SSH) while I'm at work (I work 0000 until 0800 PST/PDT). During this time, I perform maintenances on my own servers/equipment, which requires me to have a stable and non-interrupting network connection. When packets are being blindly dropped, or SBC routers are being mucked with, I end up getting disconnected from not only my Remote Desktop session, but my SSH/VPN connections to my co-location also get severed -- meaning, the problem is within SBCs IP network. My traceroutes show this.

Summary: yes, I am unable to use the Internet during the outages. If by slower you mean "packets can't get out onto SBCs network and therefore reach their destination", then yes.
said by Matthew:

Please don't read me wrong, I would like to see your problem resolved, but putting it in service affecting terms helps me make your case.
No problem dude. If you need assistance in all of this, just let me know. I have no problem being a guinea pig for testing (i.e. "we have this other Redback that's on some other portion of our network, we'd like you to try it to see if your experiences are better/worse during maintenances", or other issues) as well, in case that ever comes up. I'll happily work with you and anyone else to get to the bottom of this.

This sort-of thing is what I do for a living, and have done for nearly 15 years.
Matthew
Premium Member
join:2001-08-03
Emmett, ID

4 edits

Matthew

Premium Member

A Status page or e-mail notification service that you sign up for? Is there a subscription fee for the page that whoever signs up for are granted access to, or is this a service granted for customers? Maybe the efficiencies gained by getting the network health information out pay for the initial cost of development.

Jus to trying to put this into the business case, Is that a subscription service, a service offered to customers, maybe something else? Just what level of monitoring are you looking for?

Basically, I want to be able to querry other lines on the same trunk as the customer that I am talking to. Not sure how best to describe it. I think that is best done with CPE that does basic monitoring and reports back ping times, DNS resolution times, and other information about the connection health to a central server. So, say koiutsu calls in and says "My service dropped the connection", I want to be able to easily compare what other customers sharing the same hardware in the network would likely be reporting as well.

Something that could observe, or try to observe, multiple customers on one path with common network equipment, and report the findings legibly to the person answering the call would be nice, to say the least. But then, how many parts of the network would the software have to know? 4 points, 6? Just how many common points that show identical in some way or address, at what point do existing tools prove just what the problem is to a larger group of people than are reporting the problem, all at the same time?

I believe that cones, representing synch rates over distance and adversity, best represent how to visualize an ADSL circuit or even higher speed products down to one point. The closer to the DSLAM you are, the higher the theoretical rates get, the higher actual throughputs become. But what I do not understand is exactly how this formula works out in the long run as far as dollars invested vs. dollars gained in investing small now when the next monopoly network is in the balance.

Maybe I am being overly critical, but then again maybe I just do not know what the future of the company holds? Hope for the best, plan to survive the worst, and hope the kids you raised come out understanding the world in ways that you never thought of, and share those new found realities with you.

An advanced Status Page would be nice, it may help other registered subs get more info other than what is generally presented.

Edit- tried to clear some of the 3:00 AM fog out of the original text.
Matthew

Matthew to koitsu

Premium Member

to koitsu
said by koitsu:
The traceroutes eventually showed me that network reconvergence does not happen in a clean way, particularly when maintenances are being done (since I assume nighttime outages -- especially ones which keep happening over and over -- are maintenace).

I see core routers in SBCs network fall off the net (for whatever reason), and BGP attempts to route around the problem by going through SoCal instead of, say, a router in San Jose, Santa Clara, or SFO. What you end up with is that packets go to SoCal (usually LA or Anaheim), then back up to NorCal again. This seems a bit silly (adding on a good 15-20ms latency solely due to the fact that SBC only has ONE core routing POP in the Bay Area, OR, that BGP is misconfigured), but is still acceptable.

What isn't acceptable are outages which result in my packets being literally dropped. All of the equipment and IP-related applications I use will re-transmit TCP packets for a duration of time. Meaning, if a portion of SBCs core network changes, there may be a delay (stall, etc.) for up to 30 seconds while BGP has a chance to change routes. This is acceptable, as packets still end up making it to their destination.

But the reconvergence problem looks as if no one bothers "cleanly" routing IP traffic before hand. It's like routers are being literally taken down without any pre-maintenance traffic routing, then brought back up as if nothing ever happened (almost as if they spontaneously were rebooting). Packets die/get dropped in this situation.

In your experience would this problem point more towards the provider or the IOS of the roueters involved?

Just curious.

Boredness
So bored...
Premium Member
join:2005-07-07
Fresno, CA

2 edits

Boredness to Matthew

Premium Member

to Matthew
Hey koitsu, I like your sig! I was born in 1977 and that is my motto as well!

koitsu
MVM
join:2002-07-16
Mountain View, CA
Humax BGW320-500

koitsu to Matthew

MVM

to Matthew
said by Matthew:

In your experience would this problem point more towards the provider or the IOS of the roueters involved?

Just curious.
From my perspective, it looks to be a provider problem. It could also be a hardware-related problem, such as a bad switch port, or a faulty PSU of some sort. I obviously can't debug those sorts-of issues from where I am.

It's probably not a bug in Cisco IOS. I say "probably" because bugs in mission-critical hardware/software do exist. I cannot count how many times a month I see bug reports filed with Juniper and Network Appliance -- both companies that make hardware and software which demand 24x7 reliability...