Wednesday, July 8, 2015

"Hey my map isn't loading" or the tale of the failing switchover

So I'm just going about my normal business as IT (Reddit, "Have you reloaded?", "You didn't reload.") when a user complains of her mapping system not loading. I go over and see that it is in fact not loading and a refresh is certainly not fixing it. I get it to finally load and then head back to my desk.
Not two minutes later, "Hey my phone isn't loading" "Help! Nothing is loading". Cue panic mode. Nothing was keeping a constant connection, calls were dropping left and right, and overall a loud hum of users all thinking they were the first to have an issue. I sit down at a few machines to try and diagnose what was going on but nothing seemed apparent.
I'm back at my desk still churning away at solutions when I hear somebody say, "Hey I can't clock out"... Then it hits me. Our time card system is IP locked so it can only be used when coming from a certain IP. We had been failing over to our secondary Internet connection which in most circumstances would be fine. Well this time it wasn't. I pop open a shell and constantly poll my current external IP
168.xx
168.xx
168.xx
75.xx
75.xx
75.xx
75.xx
75.xx
168.xx
168.xx
168.xx
168.xx
About every ten seconds we had a new external IP. Obviously bad for VOIP communication. We setup most of our users on a backup wifi connection/a couple hotspots just to get them by while we sorted out what to do.
Forty minutes talking to our NOC (network operations center) and emailing back and forth and we still didn't really have an answer. My boss comes over and says, "Why don't we just unplug the primary?" OHMYGODWHYDIDNTWEDOTHIS30MINUTESAGO Secondary became the only connection and although a slower circuit we had a constant connection.
What caused all this? Primary ISP had a fiber line cut in the area. Our routers saw those issues and started falling over to secondary (mapping not really loading/VOIP being dropped). 10 seconds goes by and the routers think that the main connection is back up (which it was intermittently) and switch back over. Rinse lather repeat for forty minutes.
And now I'm back home and have 30 second delays on IRC :| (Same ISP)
TL;DR Internet connections issues, routers can't handle the truth
EDIT: Grammar

- User parkerlreed on Reddit


I found this story very interesting. It is funny how simple a solution, how ever temporary a solution, can be sometimes. 

No comments:

Post a Comment