Ann Arbor Area Business Monthly
Small Business and the Internet
IT: Information Turbulence
By Mike Gould
Eight miles high and when you touch down
You’ll find that it’s stranger than known…
The Byrds 1966
Or maybe not so strange; airline computer systems have been crashing fairly often of late. It’s the height of the tourist travel season, and stuff keeps breaking. We should be used to it now.
As I write this in my cozy underground computer center here at MondoDyne Whirled HQ, a lot of the country’s vacationers are recovering from a major mishap at Delta Airlines. On Monday, August 8, 2016, a power failure and lack of proper backup system behavior brought down the entire Delta digital infrastructure. Monday saw around 1,000 flights cancelled and another 3,000 delayed. The problems lasted through Wednesday, and are currently gone three days later. Until the next time.
A similar event befell Southwest just three weeks before. A failed router was to blame for the Southwest outage, and the Delta debacle was caused by a blown “switchgear” – a piece of equipment like a circuit breaker that is ironically designed to guard against glitches like this.
There have even been problems with the iPads the pilots use to hold their flight plans and airport maps. American Airlines had to delay 74 flights in 2015 when third party software caused bugs in the data needed by their pilots. A work-around ensued whereby the pilots were able to grab paper charts, a solution not possible with wide-scale computer problems.
When Good Gear Attacks
In the above recent cases, the problem was mostly hardware, not software. Which I find interesting because the software that runs most airlines these days is a kludge job of really, really old code that has had various new pieces of functionality grafted, patched, and maybe epoxied onto it. [Geekspeak: kludge - v. or n., is a workaround or quick-and-dirty solution that is clumsy, inelegant, inefficient, difficult to extend and hard to maintain. Per Wikipedia].
The amazing thing is that it mostly works. 8.3 bazillion air travelers put in their eight miles high travels year after year, mile after mile, endless hour after endless hour. Each trip involves plane schedules across multiple states, time zones, and sometimes countries.
Each passenger needs a reservation, a ticket, a boarding pass, a (frequently updated) gate assignment, a seat, luggage checks, and TSA pat-down. And all of this (well, maybe not the pat-down) requires immense amounts of data in enormous databases on big hairy servers in the bowels of who knows where. And then all that data has to be moved back and forth between online booking apps, smart phones, airline terminal terminals and (hopefully) onto big backup servers that offer redundancy when things go ker-blooey. Wheels-up mode. Dead parrot.
And when that happens, it’s like the butterfly effect; a plane late at the gate in Hong Kong causes a ripple effect of schedule snafus around the world. And this was no butterfly. I marvel that they were able to put this Humpty-Dumpty back together at all.
Going Down, Backing Up
Another problem is that a lot of these backups aren’t done in real time; maybe two or three times a day. So if things go South at 3:00, and they have to re-load, they get the state of things three or four hours before. Given the incredibly interlocked-ness of all those plane schedules, it is no wonder that disaster events take quite a while to untangle and repair. The New York Times has a good discussion of this at the URL below.
The Ker-Blooey Problem
This is where the problem was - the failure of an inexpensive part the size of an elongated lunchbox caused a power failure that brought down a boatload of spinning hard drives and their controllers, and chaos ensued. The gear that was supposed to supply the backup power didn’t work (maybe a software failure there – we don’t know), and that was that for a lot of irate customers.
With all those interlocking software pieces, sometimes the only way to fix it is to shut everything down, and bring it back online a piece at a time. This can sometimes take days, involving large crews of highly skilled, totally stressed out IT folk. It is to their credit that they scrambled and fixed it as quickly as they did.
I envisage some old COBOL programmer being roused in the middle of the night by a deeply shaken systems manager: “Carl, Carl, you gotta come down to the center and re-boot the AE-35 Unit – it isn’t recognizing our password!”
Every time this happens, the airlines lose millions of dollars and incalculable amounts of customer good will; you would think they would immediately drop everything and invest a few millions to keep it from happening again. Well, they are pursuing various programs of updates, but they can’t really drop everything to work on this – they have planes in the air and ticket agents to support. It’s like piston replacement done while the motor is running.
And the kind of bottom-up wholesale software modernization that is needed is really, really expensive. Given the financial problems that the airlines are just now recovering from, it is unlikely that major software infrastructure re-writes will happen anytime soon.
The good news is that this wasn’t a terrorist attack on our airline industry. Just the usual human screw-ups, managerial mistakes, and Murphy’s law in full careen. Investigations are underway, and, hopefully, these particular errors won’t happen again at those airlines.
One would like to hope that the airlines exchange trouble reports among themselves, so that they are not all subject to the same failure of the AE-35 unit and if they are, they all have Carl’s number on speed dial.
NY Times article:
Mike Gould isn’t flying anywhere this summer, was a mouse wrangler for the U of M for 20 years, runs the MondoDyne Web Works/Macintosh Training/Digital Photography mega-mall, is a laser artist, performs with the Illuminatus 3.0 Laser Lightshow, and welcomes comments addressed to firstname.lastname@example.org.
Entire Site © 2018, Mike Gould - All Rights Reserved