lilarry Premium Member join:2010-04-06 1 edit |
lilarry
Premium Member
2014-Dec-18 10:04 pm
[Voip.ms] Voip.ms New York Servers DownAll New York Servers down once again - portal too! I've been begging them to get the heck out of Internap. Maybe this time? |
|
taoman Premium Member join:2013-09-13 Seattle, WA |
taoman
Premium Member
2014-Dec-18 10:09 pm
Re: Voip.ms New York Servers Downsaid by lilarry:All New York Servers down once again - portal too! I've been begging them to get the heck out of Internap. Maybe this time? And I just ported in today. Portal is down for me but my SIP server (Seattle2) is still up. |
|
crazyk4952 Premium Member join:2002-02-04 united state Ubiquiti EdgeRouter Lite Ubiquiti UniFi AP-LR Polycom VVX300
|
to lilarry
From their twitter account: New York Data Center issues, Website not accessible, we're working on it with Internap. — VoIP.ms (@voipms) December 19, 2014 |
|
lilarry Premium Member join:2010-04-06 |
lilarry
Premium Member
2014-Dec-18 10:21 pm
Outbound calls on working servers are failing too - as are some inbound calls. This is a big one. |
|
taoman Premium Member join:2013-09-13 Seattle, WA |
to lilarry
Yep. Inbound calls are now all busy and outbound calls are dead air........ |
|
|
to lilarry
Well this sucks, I'm on the Chicago server and I can only dial toll free calls, but incoming is ok. And I just ported in...God dammit. |
|
mts join:2000-10-06 Lansing, MI |
mts
Member
2014-Dec-18 10:31 pm
So now I'm curious... do all outbound calls route through New York on the backend somehow? |
|
mackey Premium Member join:2007-08-20 |
mackey
Premium Member
2014-Dec-18 10:35 pm
said by mts:So now I'm curious... do all outbound calls route through New York on the backend somehow? No, I'm in Los Angeles and the latency on local calls isn't enough for it to be going to NY and back. My guess is something backend (DNS? CDR accounting?) tries to contact the NY servers. /M |
|
lilarry Premium Member join:2010-04-06 |
lilarry to mts
Premium Member
2014-Dec-18 10:35 pm
to mts
said by mts:So now I'm curious... do all outbound calls route through New York on the backend somehow? Just speculating, but I believe the main database and CDR's are at Internap, thus even if calls don't route through New York, data entries need to be made there. |
|
lilarry |
to taoman
said by taoman:And I just ported in today. No worries. This looks like a bad one, but Voip.ms is usually pretty good. |
|
lilarry |
lilarry
Premium Member
2014-Dec-18 11:13 pm
From what I'm seeing on Voip.ms Twitter feed, it looks like Voip.ms may be having some difficulty working with Internap tonight.
Aside from my angst as a powerless reseller with way too many angry customers screaming at me just now, one of the things that is really bugging me about this is that for weeks I've been repeatedly reporting to Voip.ms issues with the New York servers and the portal - via tickets and live chat. I even posted some portal issues here in the past couple of weeks. Pathping tests a couple of weeks ago showed 100% packet loss at Voxel (Internap) on the last hop before the portal server. I'm troubled that while support staff acknowledges my reports, I have no idea what they do with that info. And while I don't know whether the issues I've related to them have anything to do with tonight's fiasco, I do know that Internap genuinely sucks - and I know THEY know that Internap sucks. It is absolutely time for them to get the heck out of there. I wonder if or when they'll take action and move to someplace (anyplace) more reliable. |
|
lilarry |
lilarry
Premium Member
2014-Dec-18 11:31 pm
Re: [Voip.ms] Voip.ms New York Servers DownServers appear back up at 2330 eastern - but outbound calls are still taking 60 seconds or longer to go through. I think they're switching to a mirror site. |
|
|
to lilarry
Outbound calls are still not working here. |
|
MangoUse DMZ and you get a kick in the dick. Premium Member join:2008-12-25 www.toao.net 2 edits
1 recommendation |
Mango
Premium Member
2014-Dec-18 11:36 pm
Thirty-three PoPs and they don't even have proper outbound failover?
EDIT: My post is no longer accurate; see below. |
|
|
And when I hear something that sounds like "Enter Nap" of course I will go to sleep.
That's how Locutus defeated the Borg, with an "Enter Nap" command. |
|
MartinMVoIP.ms Premium Member join:2008-07-21
1 recommendation |
to lilarry
Dad jokes aside,
We're aware of the issue and we're many guys working on that internap mess.
There was one design flaw that caused some pops to fail regardless. We're fixing that and all pop's should be working properly in a few minutes regardless of new York issues. |
|
|
MartinM |
to Mango
said by Mango:Thirty-three PoPs and they don't even have proper outbound failover? Yes Mango. Strike us while we are down. Just kidding. We're up and running. |
|
|
GusHerb
Member
2014-Dec-18 11:48 pm
I see that, I can call out again! Does any of those fixes you guys are working on involve fixing the part where those of us with POP's that didn't fail still couldn't call out? |
|
MartinMVoIP.ms Premium Member join:2008-07-21 |
MartinM
Premium Member
2014-Dec-18 11:50 pm
said by GusHerb:Does any of those fixes you guys are working on involve fixing the fact that those of us with POP's that didn't fail still couldn't call out? Indeed, it exposed major flaws, that will be fixed tomorrow. Each Geographical data centres should never be affected by an individual outage. This will be addressed with a whip if necessary. Let's say some of us are really pissed, pardon my language. |
|
MangoUse DMZ and you get a kick in the dick. Premium Member join:2008-12-25 www.toao.net |
Mango
Premium Member
2014-Dec-18 11:57 pm
said by MartinM:This will be addressed with a whip if necessary. I know you are not amused with the situation right now, but your above quote made me laugh! |
|
lilarry Premium Member join:2010-04-06 |
lilarry
Premium Member
2014-Dec-19 12:00 am
said by Mango:I know you are not amused with the situation right now, but your above quote made me laugh! I found myself chuckling too - and as anyone reading this thread can tell, I'm not necessarily in the greatest mood either. |
|
|
to MartinM
said by MartinM:Indeed, it exposed major flaws, that will be fixed tomorrow As usual, you guys handled this flaw.....well, flawlessly. I'm glad it happened as I know your platform will be improved as a result. |
|
|
to lilarry
At least this happened after the end of the East Coast workday, and before the weekend, and before Christmas. ----- In a few hours everyone can sit down with a Labatt Blue, Anchor Steam Beer, or Montejo. |
|
MartinMVoIP.ms Premium Member join:2008-07-21
5 recommendations |
MartinM
Premium Member
2014-Dec-19 7:34 am
said by PX Eliezer1:At least this happened after the end of the East Coast workday, and before the weekend, and before Christmas Indeed. It wasn't a busy night, well, for us it was. --- I've waited a bit to post this, to post in a calm manner once the storm is over. Let's say that a series of unfortunate events led to this interruption of service. - A core router, in LGA6 Internap DataCenter located in New York went down. This is not a piece of equipment we have control over. - www2-mirror.voip.ms, which is an independent replica, hosted in Chicago, in case New York goes down didn't take over immediately. The DNS took longer to update than expected. We'll be addressing that Monday, in a meeting with the technical staff. This website should take over in a matter of minutes when the main goes down. It's an exact live replica of our whole system. It was deployed years ago, and its hardware regularly updated and kept up to date, to take over for events just like that. The website was eventually up and running on its mirror location. It has served well many times in the past, specifically during the Sandy Storm or a few times when Internap took a nap. (Bad, intended pun) - Regarding other geographical locations that experienced long call delays: Our programmers found a deprecated piece of code that was in place to increase security with our customer accounts. Without going into specifics, this was deprecated in favor of a completely independent "Per-pop" system to ensure that each individual pop doesn't have another point of failure other than itself. Some servers still did use a connection to our old system, located in New York, which was down, preventing outgoing calls. We just spent the whole night with the programming team to ensure no traces are left of this code and that each POP is now fully independent. We'll continue refining and conducting emergency exercises on test-pops next week. We've have many servers that we use to conduct emergency test procedures. - We've moved traffic back to our Main Website in New York, but let's say we'll start moving away all of our core infrastructure out of Internap in January. The datacenter is in Chicago and have had zero uptime in years. (Websites, Wiki, Tickets, Various Databases). As for the New York POP's, we're actively looking for a replacement of Internap. Their Voxel days when they were flawless are long gone. On behalf of the whole team, I truly apologize. Internap's Data CEnter failure should have resulted in a simple, quick relocation of our website to our mirror site, POP redirection and never should have had any kind if impact to other geo-locations. As always, we'll use this incident as free education for all of our staff, including management, and we'll conduct more emergency exercices to reduce events like this to at most, minutes of downtime, not an hour. Regards, |
|
lilarry Premium Member join:2010-04-06 |
lilarry
Premium Member
2014-Dec-19 7:54 am
Thank you Martin for taking the time to elaborate on this. It means a lot. I know you guys have had a long night. This is one of many reasons we route so much of our traffic through VoIP.ms |
|
|
to MartinM
said by MartinM:On behalf of the whole team, I truly apologize. Internap's Data CEnter failure should have resulted in a simple, quick relocation of our website to our mirror site, POP redirection and never should have had any kind if impact to other geo-locations. As always, we'll use this incident as free education for all of our staff, including management, and we'll conduct more emergency exercices to reduce events like this to at most, minutes of downtime, not an hour. It's nice to see companies actually take responsibility when things go wrong. I know "stuff" happens to everybody but not everybody would explain the issue(s) in this kind of detail in a very public forum. Kudos to you and your team! |
|
MangoUse DMZ and you get a kick in the dick. Premium Member join:2008-12-25 www.toao.net |
to MartinM
said by MartinM:Their Voxel days when they were flawless are long gone. It is so frustrating when suppliers used to be awesome. Thanks for the post. |
|
|
said by Mango:It is so frustrating when suppliers used to be awesome. Internap has had issues. Unrelated UPS Failures Cause Three NYC Outages for Internap
In an unfortunate series of unrelated equipment failures, Internap recently experienced three outages at its Manhattan data centers in one weeks time.
The May 16 outage at 111 8th Avenue we reported on earlier was followed by two outages of the hosting service providers data center at 75 Broad Street. All three were caused by component failures in uninterruptible power supply systems.... » www.datacenterknowledge. ··· nternap/Maybe Internap needs some generators. ----- Kudos indeed to Voip.MS' MartinM for the detailed explanations and analysis during a very hard day's night. |
|
|
to lilarry
I think this serves as a reminder that as flawless as we want everything to work in our lives, that a lot of things can (and sometimes do) fail outside of anyone's control.
Kudos to Martin and the staff at voip.ms for being able to deal with this as quickly as you did.
NefCanuck |
|
4 recommendations |
to lilarry
Kudos Martin and the team.
Martin I know this is the wrong time to bring this up, and I feel it's akin to kicking someone while they are down, but I'm hoping this will be a bit of a wakeup call for the team to start working more on redundancy and failovers. DNS redirection is not an acceptable solution as it takes a long time to propagate, let alone I've seen tons of ATAs and PBX software that seems to completely ignore DNS TTL and such. Granted that's not how it's supposed to be, it still is.
I'm glad this nasty piece of code was located, so each server can operate independently should a large portion of the network go down. But there's gotta be something more done for failover in cases like this. I know I'm told many times when I bring this up that "voip.ms offers a robust network with many servers etc etc", essentially "We don't need failover since our network is so strong". It's gotta get looked into guys, and a little faster than the current priorities have it set at.
I'd really like to see automatic failover using tools such as DNS SRV, as well as maybe primary and backup servers for DIDs etc. It's not much use having DNS SRV if you still can't migrate your DID to a different server. Plus you lose voicemail etc in the dance.
I realize this is like asking to completely rebuild the infrastructure, but it's the "Achilles Heel" of the voip.ms network right now. |
|