Many years ago I worked on a large system with a number of interconnected services that had grown organically over the years. As new services were added, they became dependent on each other in ways that nobody on the team fully understood. When our hosting provider announced a planned outage, we had to figure out how to safely shut the system off, and more importantly, how to safely and quickly start it back up again afterward. We had never done either of these things before, so we put a lot of effort in to make sure it would go smoothly.
[Read More]Case of the Slow Matchmaking Routine
The most challenging bug I’ve ever fixed was a performance issue in a matchmaking routine. Matchmaking is the process of finding players to compete against each other in a video game. An excellent matchmaking algorithm doesn’t just stick players together randomly; it tries to make the game more fun by balancing power levels and preventing anyone from waiting too long for a match.
About six weeks before a game I was working on was scheduled to be feature complete, we discovered our routine couldn’t handle our load targets. The rate at which players were being removed from the matchmaking queue started dropping during load tests. Things got bad quickly once it fell below the rate at which we inserted them. Not only would this cause a bad user experience if we didn’t fix it, but it made it impossible for us to drive enough traffic to our game servers to test that they could handle the projected load. The company wasn’t going to release a game that could crash if it was successful, so we had to fix this issue, and we had to fix it quickly.
[Read More]Case of the Appearing Users
A couple of years after solving The Case of The Disappearing Users, I was assigned another high profile bug where new users were being spontaneously created. They were being generated without a name or any profile information, but still filling up space in lists and appearing on schedules. A couple of other developers had tried fixing it but had no luck, so it was assigned to me.
I went through my usual bag of tricks: searched recent changes, searched for insert statements, tried to create empty users manually (and couldn’t). Nothing worked, and it was looking pretty hopeless.
[Read More]Case of the Disappearing Users
Many years ago I worked on a program that had a serious problem: the users in one customer’s system were getting deleted periodically. When a user was deleted, any data linked with them was also deleted. We could restore the data from backups, but it was a difficult process, and having a system that loses data wasn’t great for our reputation, so we wanted to resolve it quickly. Our VP of development tried to find the issue first, but after a day without any progress he assigned the issue to me.
[Read More]