My company runs a 24/7 site with a substantial number of users and connections to partner systems all over the world. We do what we can to make the system fault tolerant, but problems can still appear at any time of day or night. Ideally we would have a technical support team that’s staffed around the clock, but that not in the cards for now.
We used to have a system where anyone that could fix a problem would get a text message, and whoever was closest to a computer would respond. It worked most of the time, but there were some issues. The team creating the tickets would often be unsure of which component was broken, so the issues would hit lots of people who would be unable to help, and it also exposes a quirk of human behaviour that makes us less likely to help when more people are available.
Thankfully the whole plan is being reworked. Now a single volunteer will carry a phone that gets all the support tickets. They will respond to issue within a reasonable amount of time, and if they can’t solve them themselves, find other people to help. The company has also added decent incentives to be on call, and is investing in tools to make the job easier.
As part of the preparation, I’ve been asked to do presentations and create supporting documentation for all of the first responders. I enjoy doing the presentations, and everyone seems to be benefiting from the content, but it’s a lot more work than you might expect. The group of people I’m working with is comprised mostly of heads-down developers. They’re smart people, and all wanting to learn, but most don’t have any IT experience.
My first session covered these topics:
how to think and act like a seasoned IT pro
a brief description of our hosting environment, servers , and where the various applications are deployed
how to recognize network problems, hardware problems, and OS problems
the symptoms to expect when any of our key systems fail
how to safely restart our key systems
It took me about an hour and a half to prepare, and two hours to present. I’ve spent about a day of effort writing it down, but would guess that I’ve covered less than half of the material I talked about. To be fair, the written material is laid out quite differently from the presentations; I am trying to capture a series of troubleshooting guides that can be used in an emergency.
Are you curious about the difference between a developer and an IT pro? The most significant difference I see is that we developers are used to an environment where we can test changes quickly and safely: change a line of code, hit F5, and see what happens. IT pros take the opposite approach because the cost of a bad change in a live production environment can be devastating. They need to be certain what any change will do before it’s made, and be prepared to roll it back if there’s any hint of trouble. It can be frustrating for a developer to work this way because it’s a serious effort to make even simple changes, but it’s this mentality that keeps the world’s servers running.