AWS Went Down. Now What?
Amazon Web Services went down last Tuesday. It doesn’t happen often, but it is a harrowing experience when it does.
Understandably, people get angry during these times. They have come to rely on AWS to provide services necessary for their business. Who isn’t going to be angry when their business stops because of something they can’t control? My day job was certainly impacted as well as my businesses.
So what do we do?
There are a lot of options:
- Use another cloud service
- Go multi-cloud
- Go multi-region in AWS
- Go on-prem (use your own data center or colo)
- Do nothing
I’ll be upfront and say I am in the do nothing camp. To understand why, we should go into each of the other options.
The first is to use another cloud service. That’s a perfectly fine solution. At the end of my post on switching costs, I advocate for trying Google Cloud mostly because we want a market with a lot of players.
But that’s a big move for one outage. If a cloud provider was going down multiple times a year, or even every year, I’d consider a move.
Widespread AWS outages are not that frequent. There may be some event every year, but it will be with a service here or there. Even this past one, EC2 and RDS were operational the entire time. The laboratory in my day job was not impacted at all. Eight One Books was not impacted at all. Dynomantle, which heavily uses S3, was only impacted for the latter half of the outage.
AWS outages are noticeable. That’s because of how many big services use them. That’s not necessarily a bad thing though. AWS can’t hide their outages because of their size. Even if they don’t update their status page in a timely maner, everyone knows when they go down.
But every cloud will go down at some point. The scale and complexity of the services involved guarantee it. What’s important is making those outages as infrequent as possible and handling things well when they do.
I have definitely experienced a cloud provider go down and pretend it didn’t (NOTE: this was not Google Cloud). The outage was localized to a handful of services and they weren’t as big as AWS so they just quietly fixed it after a while. Yet during that time, my team and I went crazy trying to figure out what was up. We combed through all our configurations and code thinking it was us when it wasn’t.
I’m not saying AWS is filled with saints. I’m saying their size means they can’t hide even if they wanted to. That’s a feature.
The multi-cloud and multi-region strategies are also very valid strategies. They are not free strategies however. There’s no checkbox to instantly give you this. It takes a lot of time to do it right. Every cloud has data transfer costs. AWS charges you for going between regions. Running parallel infrastructure can get expensive as well.
And of course, a failover environment doesn’t really provide much peace of mind unless you periodically test that failover environment. Doing that well is also a big time investment.
More importantly, multi-cloud adds a huge amount of complexity to your infrastructure. If you implement it well it can provide you more reliability. That’s a big if. Not handling that complexity well can result in less reliability and more outages.
Multi-cloud and multi-region makes sense for a lot of bigger businesses. But you really need to do a thorough cost-benefit analysis to see whether it is right for you.
The on-prem strategy is… well this one I’ve got some personal trauma around so I’ve got a lot of bias here.
Fun story for anyone thinking of going on-prem after the #AWS outage:
— Professor Beekums (@PBeekums) December 7, 2021
I once saw a data center go down because some construction workers cut the primary networking cables and the backups didn't have enough bandwidth.
There are very few situations where going on-prem makes sense. Dropbox makes sense because they offer storage at a cheaper rate than S3. Hybrid cloud makes sense if you have a physical component to your product such as a warehouse or a laboratory.
But for most other situations, going on prem is much much more expensive than you think.
I could start listing all the things you have to take care of, such as cosmic rays or making sure it doesn’t rain inside. For every one, someone will go “That’s easy to solve! Git gud noob.”
The problem is not that going on prem gives you hard problems to solve. The problem is the quantity of problems you have to deal with going on prem. There are literally so many that I doubt any one person can possibly know them all. I’ve seen people claim they can and the results were… the opposite of spectacular.
You’re going to need a team. A large one. With good ops people. That get paid very well.
That bit is often the most underestimated when I’ve seen people do a cost benefit analysis of cloud vs on prem. You may get away with a skeleton crew for the initial build. The number of people you need to prepare for every possible scenario that can take down your data center is an order of magnitude (maybe two) greater.
You can of course try to get lucky and not do all that preparation. Maybe you’ll be fine for a few years. But when you see an outage, and you will see an outage, you’re unlikely to fix it in 24-48 hours like AWS. You’ll be lucky if you can fix it in weeks.
That moves us into the do nothing option.
For those of us using AWS, what did we do during this outage?
In my first AWS outage, I tried to do quite a bit. I frivously tried to access all our systems to no avail. I doomscrolled and refreshed the AWS status page constantly. I stayed awake the entire time so I could check all our systems when AWS came back. This was over 10 years ago.
With this last outage, I relaxed. I periodically checked the status page. Talked to some peers dealing with the same issue. Told my team not to worry. Made a list of things to sanity check. Went to sleep. Ran the sanity checks the next day. Everything recovered fine.
The team at AWS did the hard work. They did the prep. They handled the stress of the situation. They fixed it. They made sure everything came back online in a good state. And they do the hard work of minimizing outages in the future.
I have never been through an AWS outage where any of my actions during the outage actually mattered. Sure I run a sanity check of the systems, but I’ve never seen something not come back online ok. Never had data loss or corruption. The output of my actions was to make me feel better about doing something. I never really need to do anything though.
Ironically, it is big tech companies like Amazon that has made our entire profession obsess over the number of 9s of availability. It feels like it matters a lot.
Sure if you’re down all the time you will lose customers. But how many customers will you actually lose from going down 1-2 days every few years?
How many people do you think unsubscribed from Netflix because of this AWS outage?
How many people do you think will stop going to McDonald’s because of this AWS outage?
How many people do you think uninstalled Venmo because of this AWS outage?
World of Warcraft used to go down every Tuesday for maintenance the first few years of launch. That product saw over 15 years of success.
If you have a product that people want, an occassional outage is not going to cost you much.
See you at the next AWS outage.
I keep this blog around for posterity, but have since moved on. An explanation can be found here
I still write though and if you'd like to read my more recent work, feel free to subscribe to my substack.