Case of the Slow Matchmaking Routine

The most challenging bug I’ve ever fixed was a performance issue in a matchmaking routine. Matchmaking is the process of finding players to compete against each other in a video game. An excellent matchmaking algorithm doesn’t just stick players together randomly; it tries to make the game more fun by balancing power levels and preventing anyone from waiting too long for a match.

About six weeks before a game I was working on was scheduled to be feature complete, we discovered our routine couldn’t handle our load targets. The rate at which players were being removed from the matchmaking queue started dropping during load tests. Things got bad quickly once it fell below the rate at which we inserted them. Not only would this cause a bad user experience if we didn’t fix it, but it made it impossible for us to drive enough traffic to our game servers to test that they could handle the projected load. The company wasn’t going to release a game that could crash if it was successful, so we had to fix this issue, and we had to fix it quickly.

[Read More]

Case of the Appearing Users

A couple of years after solving The Case of The Disappearing Users, I was assigned another high profile bug where new users were being spontaneously created. They were being generated without a name or any profile information, but still filling up space in lists and appearing on schedules. A couple of other developers had tried fixing it but had no luck, so it was assigned to me.

I went through my usual bag of tricks: searched recent changes, searched for insert statements, tried to create empty users manually (and couldn’t). Nothing worked, and it was looking pretty hopeless.

[Read More]

Is the Bug Fun?

There are many things about producing video games that are surprising, but one of the weirdest has to be the approach to bugs. Like any piece of software, bugs are found through testing or user reports, triaged, then assigned to developers. Unlike normal business software they also ask the question, “is the bug fun?”

There are plenty of unintended features (bugs) in games that became beloved. Attack combos were an accident in Street Fighter II, but they became so popular that they are a part of basically every fighting game now. Rocket jumps are another example. The internet is full of examples.

[Read More]

Case of the Disappearing Users

Many years ago I worked on a program that had a serious problem: the users in one customer’s system were getting deleted periodically. When a user was deleted, any data linked with them was also deleted. We could restore the data from backups, but it was a difficult process, and having a system that loses data wasn’t great for our reputation, so we wanted to resolve it quickly. Our VP of development tried to find the issue first, but after a day without any progress he assigned the issue to me.

[Read More]

How to Fix a Bug

Building applications can be tricky, and it’s inevitable that mistakes will be made. As a result, we programmers spend a lot of time fixing bugs. Sometimes they are easy, but sometimes they can be pretty tough to figure out.

I’ve fixed a lot of bugs in my career, and to be honest with you, I usually enjoy the process. These days I am typically assigned the super urgent bugs that nobody else can figure out, and I kind of like it that way. I don’t get me wrong, I don’t like the bugs being there, but I enjoy being helpful and figuring out tough problems. I also think my successes have helped improve my reputation which is always a good thing.

[Read More]

How to Report a Bug

Nobody likes bugs, least of all programmers. No matter how hard we try to catch them early, some will always escape into circulation. Until computers are smart enough to do what we meant instead of what we said, users are going to keep finding bugs, and we’re going to keep fixing them.

Before a bug is fixed, it needs to be reported. Unfortunately it’s not uncommon to receive incomplete reports. We can spend a lot of time hunting and making guesses, and sometimes that’s enough, but if we can’t figure out the problem it’s pretty hard to fix it. This can be especially unfortunate when the stakes are high, and oddly, this is when it also seems to be the most common.

[Read More]

Reading Server Graphs: Connected Users

I’ve spent the last several years working on multi-user server systems in two different companies. Both those companies had a giant monitor hanging off a wall showing a graph of connected users. It won’t give you detailed diagnostic information, but it is a good indicator for the health of your servers, and your product generally. If you learn to notice certain patterns in your user graph, it can also save you precious time when things go wrong.

[Read More]

InstallUtil and BadImageFormatException - Facepalm

I had a frustrating issue at work this week: one that was easy to fix, but embarrassingly difficult to find. I came pretty close to giving up, which is not a solution I often explore, but in the end we figured it out and got everything working.

A member of our operations team was installing a Windows service I’d built to monitor some stuff in our production environment. I’ve made a few windows services in my day, and installed them many times on many machines. I’d even installed this one on my development machine with no issue. In our staging environment, however, this is what we got:

[Read More]

Doubling Data for Performance Testing

Or: The Most Impressive T-SQL Script I’ve Ever Written

I was recently working on a new application. After three months in the field, users were starting to complain about performance issues. We had done some limited performance tuning for the first release, and more as part of the second release, but new issues were popping up as more data got entered into the system. We could have continued fixing issues as they came up, one release at a time, but we wanted to get ahead of the problem, and the client wanted to know that the system would remain usable without developer intervention for a few years at least.

[Read More]

Teaching IT

My company runs a 24/7 site with a substantial number of users and connections to partner systems all over the world. We do what we can to make the system fault tolerant, but problems can still appear at any time of day or night. Ideally we would have a technical support team that’s staffed around the clock, but that not in the cards for now.

We used to have a system where anyone that could fix a problem would get a text message, and whoever was closest to a computer would respond. It worked most of the time, but there were some issues. The team creating the tickets would often be unsure of which component was broken, so the issues would hit lots of people who would be unable to help, and it also exposes a quirk of human behaviour that makes us less likely to help when more people are available.

[Read More]