Case of the Premature Power Outage
Many years ago I worked on a large system with a number of interconnected services that had grown organically over the years. As new services were added, they became dependent on each other in ways that nobody on the team fully understood. When our hosting provider announced a planned outage, we had to figure out how to safely shut the system off, and more importantly, how to safely and quickly start it back up again afterward. We had never done either of these things before, so we put a lot of effort in to make sure it would go smoothly.
I’ll never forget the day. About a week before the planned outage, we were in a big meeting of all the teams. We were talking through our shutdown script, moving things around and making sure everyone agreed on the order. It seemed like we could shut down smoothly if we followed the steps, and one false move could break everything. About 30 minutes into the meeting, someone came running in to announce that the whole system had just gone down.
We rushed back to our desks and started checking. It wasn’t all down, but a bunch of things had gone down at once, and one of those things was our main firewall appliance.
We reached out to the data centre operators to get more information. The planned outage was for an upgrade to one of the two power feeds, and they had electricians preparing the new lines in advance. One of those electricians had accidentally hit a live wire with his drill, knocking out one of the feeds.
Normally this wouldn’t be a problem because of the second, redundant feed. Unfortunately, this was when we learned that the data centre operators had not always connected each device to the separate power sources correctly. Our firewall, for example, had both of its plugs connected to the affected feed. Our database servers and NAS were powered with redundancy, though, so some parts of the system were able to keep working.
We immediately asked for the firewall to be fixed, and as we figured out each remaining server that was down, we asked for them to be corrected as well. Many customers in the data centre had similar issues, so the operators were overwhelmed, and responses got slower and slower. Things were almost working for us again when suddenly everything went down again.
Fifteen minutes later, we got an email (sent half an hour earlier) that the data centre had to be completely shut down. The building’s air conditioning did not have a redundant power supply, and the building had reached 60°C (140°F). Cutting the building power was the only choice they had; the operators weren’t able to safely enter the building anymore.
Several hours later, the data centre was restored. We didn’t know what state things were in, and the operators didn’t have a lot of time to help us, so they just turned everything on, and we took it from there.
It was a late night, but we had everything going by the end of the same day. Thanks to diligent administration, redundant storage, and our design to support regular incremental upgrades, we had no meaningful data loss, and most systems worked correctly on startup. The worst of it was a couple of behind-the-scenes services with interdependencies that had to be restarted a few times.
It was a sobering experience. We had a happy ending, but it could have gone very differently if one of the key systems hadn’t been designed or configured correctly. We had excellent, experienced people on our staff, and that helped a lot, but even experienced people make mistakes.
Regular testing is the only way to know for sure that your redundancy plans are good enough. If it’s something your business needs, you had better make sure you’re testing it. Using upgrade procedures that kill processes without shutdown warnings is a great way to make this a regular practice. Netflix’s Chaos Engineering approach, which involves routinely and automatically breaking things in production, demands major respect also. Recovery drills are tedious and time-consuming, but they too are an important part of ensuring system continuity.