2018-02-12

Building For Resilience

All software systems will fail. This is unfortunately a fact of life. Not even the biggest companies talk about 100% uptime. They talk about the number of ‘9’s of uptime a system has (e.g. 3 9s is 99.9% and 5 9s is 99.999%).

Being able to claim a large number of 9s is a point of pride. While all software fails, developers can show off their skills by claiming to have a large number of 9s. Four 9s would mean ~1 hour of downtime in a year, five 9s would mean a system is only down ~5.5 minutes in a year, and six 9s would mean 30 seconds of downtime in a year. Each 9 provides a smaller and smaller benefit. Yet it is also exponentially much more difficult to achieve.

The same is true of bugs. Bugs aren’t as easy to quantify as uptime, but the idea still applies. Preventing 99.9% of bugs is much much easier than preventing 99.99%. And while it is important to prevent bugs, we are still stuck with the reality that some bugs will always be in the system. Is it wise to spend exponentially greater resources in trying to prevent a smaller and smaller number of bugs?

At a certain point, there is something much more important than preventing bugs: making a system resilient to bugs.

This means being able to detect bugs when they happen and being able to recover from them. One type of situation that this often applies to is preventing data loss.

Let’s say you have a ToDo List application. There is a feature to delete todo list items automatically to help make it easier for users to clean out their todo list.

Making sure this feature works is obviously very important. However, there is always a slight chance that a bug will appear that will delete things improperly. There are a number of things that can be done to make this feature more resilient to those bugs.

The first is backups of the datastore that contains those todo list items. A backup is a complete copy of the entire datastore at a certain point in time. While this is usually done to account for system/hardware issues, these backups can be used to restore data for a single user if necessary. Gmail had an issue in 2011 where data was lost for 40,000 users and they restored user data from backups on tape drives. (Also note: if data loss can happen to Gmail, it can happen to you. You are most likely not better at making software bug free)

There are unfortunately multiple problems with using backups in this way. The first is that it is expensive. The backup has to be loaded on separate hardware and then a developer will have to go in, extract the right data, and put it in the production database. All without making any human errors.

Another issue is that backups tend to be run hourly or daily. Imagine a user spending 20 minutes typing in details for their todo list item. When a bug deletes it, the company’s response is “Oops, we can’t restore your work because we haven’t run a backup in the past 20 minutes.”

That’s going to be one angry user!

Another tactic that can be used in addition to backups is the soft delete. This means that instead of deleting data, a timestamp is stored for that data. If the timestamp is missing/null/zero, then the data is not deleted. If the timestamp is a real timestamp, then the data is considered “deleted”.

There is no visible effect here for the user. Everything still functions the same. The only difference is that the data is never permanently deleted from the system. If a bug occurs that deletes more data than it should, that bug would really be setting more timestamps than it should. Fixing the bug is as simple as finding out what should be restored and deleting the timestamp.

The problem here is that some countries have regulations on making sure user data is actually deleted when the user wants it to be. They usually provide 24 hours of leniency though. This means we can keep data soft deleted for 24 hours and then permanently delete it. 24 hours doesn’t seem like a lot of time, but it’s plenty if bugs are detected early.

That leads to the next tactic: robust logging.

Often times when an error occurs, the code will create a log message that has the error message. This error message will tell you where in the code the error occurred. What it does not tell you is who the error occurred to and what data was affected.

This means you would be completely reliant on all users who encounter a bug to contact you and actually provide all the details you need to track down the issue. Those of us who have spent time in support know that that is not always the case. Including relevant data with errors is critical in being able to recover from those errors.

The nice thing about all these tactics is that they are relatively easy to implement. They aren’t substitutes for trying to ensure high quality software. Recovering from errors using these tactics is quite time consuming. Low quality software means spending most if not all of your time handling customer support issues. However, these are important fail-safes in the likely event your code has bugs in it.

Hi there! I hope you enjoyed this post.

I keep this blog around for posterity, but have since moved on. An explanation can be found here

I still write though and if you'd like to read my more recent work, feel free to subscribe to my substack.

Professor Beekums Blog

Building For Resilience