Friday, March 7, 2008

Scaling at 2am with EC2

I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.

This past Tuesday, I posted about my codepad.org project on news.YC. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.

First, a couple details about the architecture: codepad.org's backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.

At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.

Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.

As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.

Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.

Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?

8 comments:

Unknown said...

As soon as a site becomes big enough to run across more than one instance (which isn't really *that* much), this can definately bring benefits if you have a flexible system for adding and removing servers.

Nobody's quite got automating it right just yet (RightScale is doing interesting things with EC2 in that respect), and we (at FlexiScale) are planning on releasing this functionality for our own platform later this year.

It can however be done relatively easily on an individual basis if you have taken scaling into account in your application, by setting up something simple to watch the load on the servers and use the relevant API to launch or shutdown servers.

Thorsten von Eicken said...

Very cool post Steven, and congrats for surviving the traffic onslaught! Scripting the whole boot process through getting the app running and tied into the deployment is definitely the way to go. As toons mentioned (thanks), RightScale is all about making this easier through a web dashboard and extensive scripting and server template support. And most of it is actually free...

Regarding your question about scaling up and down with load: almost all web sites have a 3x-10x variance between peaks and troughs. But you need enough traffic to require more than 1-2 machines at peak, else there's not much to scale down. Looking at your use-case, the way we'd implement it in RightScale is by controlling the scaling from the monitoring data. We'd write a rule that says "if >50% of your execution servers have >50% cpu busy time, then launch 2 additional execution servers". A similar rule can be used to prune the number of servers when they become mostly idle. This could definitely save you a pile of money.

Todd Hoff said...

I was wondering if you had a number on the cost of spinning up all those instances?

Steven Hazel said...

Thorsten:

Do you have a source on that 3x-10x range for variance? That's good info. In conversation about this entry last night, the point came up that diurnal cycles should have lower amplitude for sites with a more international user base, since their users aren't all sleeping at the same time. But only down to a certain level, because of the effect of the size of the Pacific Ocean!

The load-based rules that both you and toons allude to are of course what I'm planning to implement.

Steven Hazel said...

Todd:

I'm using all small instances, which cost $0.10/hr (plus bandwidth). The time I spent with 10 instances running probably cost me between $25 and $50.

Your Correspondent said...

If you can use a fraction of a machine-month, can you use 10% of a machine for a month?

I was under the impression that EC2 billed for whole machine-months or fraction thereof.

Can I run a teeny app for a teeny price on 1% of a machine for a month and pay 1% of the monthly price?

This would be nice for testing and staging. Also demos.

Happy Bear said...

1000x traffic sounds impressive but as you said it was still a modest amount in total. I'd be curious to hear what that number was.

I think EC2 is vastly over-rated, because people are simply used to bloated shared server environments with 1,000 other customers on the same machine. You get yourself a dedicated server and it can handle a lot more than you would expect. I have a $1500 server that handles almost 10 million hits per day to Apache/PHP.

I don't need no stinking EC2 :P

trungson said...

Any creative idea how do you scale down if there is existing data on that instance?