I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.
This past Tuesday, I posted about my codepad.org project on news.YC. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.
First, a couple details about the architecture: codepad.org's backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.
At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.
Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.
As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.
Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.
Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?