Monday, March 3, 2008

Murmur and Hash Tables for Scalability

Murmur is a new (non-cryptographic) hash function. Tanjent says: "Way faster than Hsieh, better mixing than Jenkins, as simple as FNV." He's also made some speed and mixing improvements since its initial release.

This got me thinking about hash-based load distribution, in which the modulus of a hashed identifier is used to route a request to the machine that will be responsible for handling it, so that load is distributed in a way that allows server machines to keep local state across a series of requests (many load balancers call this "affinity"). Under heavy load, fast hash functions with poor mixing can route too many requests to one machine, overwhelming it while the others have plenty of unused capacity. Cryptographic hash functions will eliminate that problem by distributing requests more evenly, but are sometimes too CPU-intensive on the router. Functions like Hsieh and Jenkins are useful in cases like that, and Murmur might be even better.

I was talking about this with my brother, and he came up with a way to trade away memory for CPU in applications where the number of possible identifiers is small enough: use a very fast hash function as the basis of a large hash table on the router, mapping identifiers to slow-computed hashes which can in turn be used to select servers. The first time you see an identifier, you perform a slower hash, store it in your local hash table, and use the slow-computed hash value to pick a server. Next time you see the same identifier, you'll find the slow-computed hash in your table at the reduced CPU cost of a much quicker hash. If your identifiers are unlimited, though, your hash table will just keep growing, unless you turn it into an LRU cache by culling old entries.

Another approach is to generate a random number instead of using a slow-computed hash. The hash function and random number generation are fast, and a pseudo-random number generator should get a very even distribution. An identifier will only map the the same machine for as long as your router stays up, though, so this approach is only appropriate if the cost of having requests sent to the wrong machines is low.

Update: Murmur now has a web site.

3 comments:

Emil Dotchevski said...

Similar approach to the one suggested by your brother is used in algorithms such as rsync, where some kind of fast hash with relatively high collision rate is used to rule out most misses, followed by a slower, cryptographic-quality hash to eliminate collisions.

Emil Dotchevski
http://www.revergestudios.com/reblog/index.php?n=ReCode

ltbarcly said...

There is no way even a 'slow' hash function is too slow to route web requests. Just dealing with connection overhead is hundreds of times more expensive than an 'expensive' hash function. Plus, latency is on average like 100ms over the internet. Saving a few dozen CPU instructions per request will have absolutely no effect.

Steven Hazel said...

ltbarcly,

You're right, this kind of thing shouldn't matter for web request load balancing. The case where I've seen it be an issue was in doing backend lookups that happen many times per request, over a persistent connection. Even then, it might not have been a big deal if the routing machines had not been near capacity when we were considering a switch to MD5.