Case of the Slow Matchmaking Routine

2023-09-05

The most challenging bug I’ve ever fixed was a performance issue in a matchmaking routine. Matchmaking is the process of finding players to compete against each other in a video game. An excellent matchmaking algorithm doesn’t just stick players together randomly; it tries to make the game more fun by balancing power levels and preventing anyone from waiting too long for a match.

About six weeks before a game I was working on was scheduled to be feature complete, we discovered our routine couldn’t handle our load targets. The rate at which players were being removed from the matchmaking queue started dropping during load tests. Things got bad quickly once it fell below the rate at which we inserted them. Not only would this cause a bad user experience if we didn’t fix it, but it made it impossible for us to drive enough traffic to our game servers to test that they could handle the projected load. The company wasn’t going to release a game that could crash if it was successful, so we had to fix this issue, and we had to fix it quickly.

To make matters worse, we could only reproduce the bug in our production system. We had to simulate hundreds of thousands of users to hit it, which just wasn’t possible on a developer machine. Fortunately, our production system was available for testing. Unfortunately, a team on another continent was running the load tests. It took us a day to deploy a new version and a couple of days for them to run the test. With time zone differences expanding the handoffs, the fastest we could iterate was about once a week.

As I’ve described before, I like to find a bug, understand it, and then fix it. In this case we couldn’t figure it out by looking at the code, even with many smart people trying. This is one of those rare cases where I had to try fixing a bug before understanding it.

The week of iteration time also meant I had about a week to review the results, consider multiple possibilities, and sprinkle logging and profiling code all over the place, so I did that, too.

I plugged away at it week after week, but the problem was stubbornly difficult to find. The failing load tests were visible to our management chain, so I also had a fair number of questions about when it would be fixed (which I couldn’t estimate), how we were fixing it, and what we would do if we couldn’t fix it. I was also helping the rest of the team with their deliverables, which had the same deadline, so it was stressful for everyone. Nevertheless, I stuck to my process, narrowed down the location of the problem, and tested my fixes as best as possible in my development environment. Eventually, I figured it out.

The problem originated in a database query we used to find matches. It used a cartesian join, which takes exponentially longer to complete as more rows are added (O(n²) complexity). At some queue length, the processing time increases from milliseconds to seconds, and then it quickly increases from seconds to minutes. It also didn’t help that our contrived load test meant thousands of users were all attempting to make matches with the same power rating, thus making it harder to distinguish suitable matches.

After all the time it took to find the problem, it only took a day to fix it. I replaced the algorithm with one that processed matches linearly, giving it O(n) complexity.

About a week later, I got an email confirming that the issue was fixed. Since the report came in overnight, it was the first thing I read in the morning. It was exhilarating, but the day is still bittersweet in my memory. About an hour later, I attended a last-minute team-wide meeting where we were all let go. The game was never released.