How to Fix a Bug
Building applications can be tricky, and it’s inevitable that mistakes will be made. As a result, we programmers spend a lot of time fixing bugs. Sometimes they are easy, but sometimes they can be pretty tough to figure out.
I’ve fixed a lot of bugs in my career, and to be honest with you, I usually enjoy the process. These days I am typically assigned the super urgent bugs that nobody else can figure out, and I kind of like it that way. I don’t get me wrong, I don’t like the bugs being there, but I enjoy being helpful and figuring out tough problems. I also think my successes have helped improve my reputation which is always a good thing.
This can take a lot of hard work, but I’ve made the job considerably easier through a set of tricks and a process I’ve developed over my career. I never thought of these as particularly special, or as a secret weapon, but I realize now that they aren’t really taught. They are important though, so I wrote them down.
This is the process I generally follow when I’m approaching a bug. I will obviously not go through a formal process for the super simple and obvious ones, but for the toughest nuts, all the steps here will help me get it over the finish line.
Step 1: Read the Bug Report
When a bug report comes across my desk, I read it and make sure that I understand it. Fixing a bug properly takes time and carries its own risks, so I want to be sure I’m fixing the right bug.
If I can’t understand the bug report, this is also a good place to stop and request more information. I wrote another post about writing good bug reports.
Step 2: Reproduce the Problem
The most important step to fixing a bug is reproducing it. Even if the problem seems obvious, even if I am feeling particularly lazy, I force myself to do it anyway. If I can’t, I have no way to test my fix later. It also means I won’t be able to use a debugger to figure out what is happening when it fails.
Sometimes it’s possible to write an automated test before finding the problem, and if I can, I will. It makes debugging and testing considerably faster. In complicated software this can even be faster than starting up all the required pieces to test an issue manually.
It can be difficult to reproduce an intermittent issue, but I still put in the effort for the same reasons. I’ve sometimes had luck wrapping unit tests in a for loop, but it won’t help if the causes are environmental.
There have been rare occasions when I’ve had to attempt fixing a bug before I can reproduce it, but I will only let this happen in extreme circumstances. When doing this kind of thing I also make sure to communicate it clearly with everyone involved.
Step 3: Find the Problem
This can be the most infuriating part of fixing bugs, but it is essential. There are a bunch of techniques that can work here. For this post I’ll stick to the tricks I use the most.
The first thing I do is read the error details again. If there is an error message in the bug report it can carry a lot of clues. It’s tempting to skim over it, and I have made this mistake plenty of times myself. Now I make sure I not only read the error message, but also understand it. This has saved me a lot of time. System errors can be especially helpful since they tend to be specific. Stack traces are also incredibly valuable. If I don’t understand what an error code means I look it up.
In the absence of a stack trace, I like to narrow down where the problem is occurring. I start by visualizing all the steps through the system that are involved in the malfunctioning operation, then I choose some point in the middle. Using my trusty debugger, I test if it’s failed at that point. I continue along in a binary search pattern until I have narrowed down the problem to a specific spot.
For example: if I’m fixing a bug in a web application where a user’s name is getting saved incorrectly I can start by checking the web request from my browser’s developer tool. If the request is wrong, I know the bug is on the client side. If the request is correct, the bug will be somewhere in the server. Assuming it’s in the server, I might set a breakpoint between my business layer and repository layer to narrow it down further. Continuing in this way I can find the exact location of the bug quickly and reliably.
It works for more than just software problems too. I’ve used the same approach to diagnose load balancer problems, problems with components in Kubernetes, computer hardware, even electrical wiring.
In some cases it’s easier to figure out what change introduced the bug instead. I will use the same kind of binary search pattern but checking out commits between releases. A quick build time and a unit test makes this a lot less painful. I’ve never had a bug where it was a good fit, but you can also try the git bisect
command to automate the process fully.
If I ever get stuck in my investigation, I go back to the beginning and re-check all my assumptions. Did I read the bug right? Was the process in a healthy state when it occurred? Am I looking at the right version of the code? Did the feature ever work? Did I misclassify a success or failure when I did the binary search? Even if all my assumptions were correct, going through the problem again can sometimes spark new theories for investigation. This is also where I start when someone else asks me for help with their bugs.
If all else fails, or even if it hasn’t, searching on the internet can help. This is especially true for third party components or services. Be careful though, I am finding internet resources to be increasingly less helpful but your results may vary. Even though I’ve wasted a lot of time following red herrings from random internet threads, this is still sometimes the best option available.
Step 4: Test My Theory
I find it easy to jump to conclusions when I’m debugging, but my experience has taught me to approach bugs with a scientific kind of skepticism. Once I am pretty sure I know what the problem is, if it’s not a trivial change, I like to isolate it and prove to myself that I understand it.
I will write an automated test that reproduces the problem if at all possible. It might seem quicker to fix the bug first, but starting with the test will make the process simpler. Writing a failing test proves that you understand the bug, and it also proves that the test will fail if you don’t successfully fix it.
If you can’t write a test because that isn’t your team’s practice then you have my sympathies. On the bright side, you’ll be getting a lot more experience fixing bugs!
Step 5: Fix the Problem
I will remove some bad code and / or put some more good code in.
Step 6: Look for Similar Bugs
Sometimes a bug indicates a pattern of bugs. Before I fling my fix back out into the world I like to do a little research to see if the same mistake has been made in other places.
For example: if I was fixing a bug caused by a query operator that isn’t supported by an older database server version, I can do a quick search to see if the operator was used anywhere else. Since I’ve already figured out how to fix it, I can fix them all at the same time and eliminate a whole bunch of bugs.
Step 7: Test My Fix
Since I almost always have automated tests, I can check if I’ve fixed the problem pretty easily.
I’ll usually do a manual test as well to make sure I really have fixed the issue. Sometimes a bug has more than one cause, or is repeated in more than one place. To be honest, I often find this step tedious, so I have to remind myself why it’s important. A lot of time can get wasted if there was some aspect I missed. Also, if I’m going to put my name on a fix, I want people to be able to depend on it actually being fixed.
Step 8: Understand the Bug’s Impact
It depends on the bug, but if there is some damage left behind, I make sure to consider its impact. Sometimes this takes a bit of experimentation, but it is important. Sometimes a cleanup script is necessary, or sometimes manual steps can be provided to correct the issue. At the very least I want to make sure I can tell my stakeholders what the impact was.
Step 9: Bug Retrospective (Post-mortem)
Once in a while I take a bit of time to reflect on the bugs I’ve encountered. How did the bug escape in the first place? Is it likely that similar bugs will be introduced again? Can I introduce tools or change processes to make this class of bug less likely to occur?
Some organizations have a formal post mortem process for impactful issues. This is a great way to ensure a team is learning from its mistakes. I have introduced this process in a few of my teams and highly recommend it.
Even for bugs with less impact it can be worth spending a bit of time thinking about this. It’s not always feasible to prevent some types of bugs, but as craftspeople we should be trying!