The Hardest Bug I Have Had To Fix
A topic appeared on Quora about the hardest bugs people have had to work with. This was mine.
I spent years blaming compilers, libraries, SDKs, etc for my bugs only to find out it was my own fault. This ingrained in me the habit of always blaming myself first. That ended up hurting me with this one bug that did end up being a third party.
Context: I inherited this major system where the original developer was no longer with the company. It used reflection everywhere so the code was difficult to trace, and it relied on long running PHP processes (intentional infinite loops) on a version of PHP where not all errors could be caught. There was also a home rolled watchdog process to restart those long running PHP processes whenever they failed. Oh, and there were settings files that had to be edited directly on production servers because we didn’t have config management in place.
Needless to say, the system was extremely fragile and I had to deal with production failures often. I refactored bits and pieces of it as much as I could in the time given me, but the system still had some core architectural issues. One day, all the long running PHP processes started hanging. Not dying, or the watchdog would have restarted them. Hanging, so the process looked fine from the watchdog perspective. I tried restarting the processes manually and that worked… for all of 15 minutes.
Another developer came by to help me out and we tried all sorts of things. We traced through the code as best we could. We threw debug statements in the code and traced through the production logs. Then we kept adding debug statements in the hopes that something would help. We had data corruptions in the past so we went through and examined all the data we could find on the causes of the hanging processes. We tried to replicate that data in our development environments to reproduce. Eventually we had to set up a cron that restarted the processes every 15 minutes just so we could go home and get some sleep.
The next day the processes started hanging every 5 minutes. At that point, cron was no good anymore so we had to solve it that day. Did I mention that if these processes didn’t run, revenue couldn’t flow into the company? Well there is that.
Neither of us wanted to do it, but we were forced to start using strace on these dead processes. I also forgot to mention that the processes would kill themselves off after a number of successful jobs (at which point the watchdog restarts them), so we couldn’t assume that a process had hung when strace stopped. We would have to detect the process restart and run strace again. It was the right call though because strace told us that the issue was a race condition in some bad socket connection code in our third party library that was reading messages from our message queue. I was surprised because this library had been working fine for 3 years. Since we could not reproduce in our development environments, we regression tested an upgrade and then released in blind hopes that it would work. Fortunately it did, though the nature of race conditions means the problem could still be there. We wanted to do a big refactor, but everyone else saw that the fix worked and thought it was time to build more features!
I keep this blog around for posterity, but have since moved on. An explanation can be found here
I still write though and if you'd like to read my more recent work, feel free to subscribe to my substack.