The Startup Slows Down
This is a continuation of the last post on the challenges a hypothetical startup starts seeing as it starts gaining more and more users. We left off having solved the problems of an overloaded web server and putting images in an appropriate storage solution. Unfortunately, the startup’s problems have just gotten started.
While the startup may have fixed things for users, a costly side effect has been introduced. Developers tend to write code on a single computer. Yet now the startup has a number of separate systems in place:
- Load balancer
- Web servers
- Database
- File storage
This means that there will be some bugs that won’t be caught during development. They will only appear once the feature is in front of users! How terrifying.
There is a solution to this of course. There are many ways to create a simulated environment on a developer’s computer to make it act like the production system that serves users. Unfortunately, this takes quite a bit time to set up. This setup rarely goes flawlessly either. There are usually kinks that take time to work out. Things also get tougher if some developers prefer to work in different operating systems. Linux, Mac, and Windows all have their own quirks when attempting something as complex as simulating multiple machines.
All said and done the effort here is probably weeks of work for a couple of people to do the setup. Add in a few hours/days of work over a few months for each developer on the team to work out the issues they see with it. This is a lot of work, but it is well worth catching bugs during development instead of letting users see them.
Speaking of users. A successful product has lots of users. Lots of them. A very diverse group of them too. Users on Macbooks. Users on Windows. Users with Firefox. Users with Chrome. Users with iPhones. Users with Android phones. Users with great broadband internet connections. Users in mobile dead zones.
Unfortunately, some bugs will only appear for some of these users. That’s because the startup’s product is software that interacts with other software: operating systems, browsers, etc. That introduces the potential for more bugs because those different operating systems and browsers… work differently.
A small startup may have a limited focus because not having a product at all is worse than having a product that only works in Chrome. With a product that has a growing user base, more attention needs to be paid to making sure the product works everywhere. Growth can’t continue if the product doesn’t work for a large number of potential users. Most people would rather not use a product rather than switch their browser.
Testing in all those different operating systems and browser takes time though. May only be minutes/hours per feature, but that quickly adds up if there are a lot of features.
A large amount of users also means a large number of people using a product differently than was intended. That opens the possibility of users using the product in a different way than was tested by the startup’s team. That almost certainly means more bugs. Bugs that the users will hopefully report to the startup.
While having users report bugs is a good thing, it now creates another challenge. With lots of users comes lots of bugs. Spending time fixing bugs means not spending that time building new features. Someone at the startup needs to do cost/benefit analyses in order to make sure time is spent in a way that bring the most value to all users.
In short: time is spent figuring out when to fix a bug and time is spent fixing the bug. A well engineered product will need to spend less time doing these things, but having a perfect product with no bugs is unrealistic.
With ever larger success comes ever larger problems. One of the most time consuming tasks is fixing a database that can’t handle a certain number of users. Maybe the startup needs to optimize how it has set up its existing database. Maybe it needs to switch database technologies entirely. Either way, the startup is going to be spending a lot of time on it.
There are a number of things that eat up time here. The first is that the downtime for users should be minimized. A large enough data set could take weeks to migrate to a new database. How many successful products you know decided to go down for a few weeks?
Not many right? Trying to fundamentally change how a product stores data alone is going to be a huge effort. Doing so while allowing users to continue to use the product will easily add a 10x multiplier to the effort. Probably more.
Sounds over the top? Think about it a little. Imagine a really large Google Sheets document. Thousands of rows over multiple sheets. You realize you need to restructure everything and move data around. How do you do this while other people are constantly adding new data or modifying the existing data?
Probably won’t go well in a Google Sheets document. You can always ask everyone to stop editing. Startups can’t ask users to stop using their product though. Developers need to put in the time to make it possible for them to use the product while the database is being modified.
Another thing that takes time is the fact that data is precious. Would you appreciate it if for some reason half your emails disappeared? Or you lost half your friends on Facebook? Users like their data. Any type of switch of a database setup needs to be nearly flawless. Anything less will result in lost users because they won’t trust the product anymore. Flawless work is expensive though.
Lastly we have to deal with two interesting points. As mentioned above, it takes time to switch database setups. But the decision to switch is only taken if the application is starting to fail because the existing database setup can’t handle that many users.
So… what does the startup do while the new database setup is being developed?
Whatever it takes to keep things running. Lots and lots of software duct tape. This part is a serious struggle. Hopefully with success, the startup has enough revenue to hire a separate team to develop the new database setup and another to add duct tape to the existing setup. Having the same team do both will most likely result in the developers not doing their best work and even more time spent in the development because the team members will be tired.
For reference, Yandex spent 3 calendar years and over 10 person-years in order to do this kind of work.
That’s a lot of time spent to build zero features!
While it is easy to associate big companies with bureaucracy that slows them down, it is important to understand that there is more to it. Startups have a disadvantage in that they have fewer resources, but they have a huge advantage in that they don’t have to worry about most of the things a larger company has to worry about. This allows them to move faster at first. Yet most startups have a goal of becoming that big company. As that happens, they will find that success requires a lot of work to maintain. They will find themselves slowing down.
I keep this blog around for posterity, but have since moved on. An explanation can be found here
I still write though and if you'd like to read my more recent work, feel free to subscribe to my substack.