Thursday, March 13, 2008

Tree Lists as a Default List Data Structure

I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.

Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.

Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):

  • Linked lists are the popular built-in list type in functional programming languages like Scheme and Haskell. They allow fast O(1) inserts and deletes at any point in the list. Unfortunately, lookup is O(n), which makes many common insert and delete operations (those that don't start from an iterator in the correct position in the list) O(n) as well. Memory overhead for linked lists involves at least one link pointer for every element in the list. All in all, they're an algebraically fundamental but pragmatically somewhat specialized type of list.

  • Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.

Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.

Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.

BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.

Friday, March 7, 2008

Scaling at 2am with EC2

I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.

This past Tuesday, I posted about my project on news.YC. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.

First, a couple details about the architecture:'s backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.

At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.

Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.

As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.

Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.

Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?

Monday, March 3, 2008

Murmur and Hash Tables for Scalability

Murmur is a new (non-cryptographic) hash function. Tanjent says: "Way faster than Hsieh, better mixing than Jenkins, as simple as FNV." He's also made some speed and mixing improvements since its initial release.

This got me thinking about hash-based load distribution, in which the modulus of a hashed identifier is used to route a request to the machine that will be responsible for handling it, so that load is distributed in a way that allows server machines to keep local state across a series of requests (many load balancers call this "affinity"). Under heavy load, fast hash functions with poor mixing can route too many requests to one machine, overwhelming it while the others have plenty of unused capacity. Cryptographic hash functions will eliminate that problem by distributing requests more evenly, but are sometimes too CPU-intensive on the router. Functions like Hsieh and Jenkins are useful in cases like that, and Murmur might be even better.

I was talking about this with my brother, and he came up with a way to trade away memory for CPU in applications where the number of possible identifiers is small enough: use a very fast hash function as the basis of a large hash table on the router, mapping identifiers to slow-computed hashes which can in turn be used to select servers. The first time you see an identifier, you perform a slower hash, store it in your local hash table, and use the slow-computed hash value to pick a server. Next time you see the same identifier, you'll find the slow-computed hash in your table at the reduced CPU cost of a much quicker hash. If your identifiers are unlimited, though, your hash table will just keep growing, unless you turn it into an LRU cache by culling old entries.

Another approach is to generate a random number instead of using a slow-computed hash. The hash function and random number generation are fast, and a pseudo-random number generator should get a very even distribution. An identifier will only map the the same machine for as long as your router stays up, though, so this approach is only appropriate if the cost of having requests sent to the wrong machines is low.

Update: Murmur now has a web site.