hackerdashery: 2008

Monday, September 1, 2008

Google Chrome and the Future of Browsers

The big news today in the nerd world is that Google is going to ship an open-source browser, called Google Chrome.

Chrome incorporates some of the same ideas Bram and I wrote about recently: each site runs in its own process, so that memory can be bulk-reclaimed when you navigate away from a domain or close a tab, and it provides a task-manager-like interface that lets you see which site is using all you CPU or RAM. It also uses a more sophisticated Javascript VM called V8, which features precise, incremental garbage collection, and dynamic compilation.

Between these good design ideas and Google's ability (and intention) to test against their enormous, ranked web cache, I think Chrome is pretty much guaranteed to be a substantially better browser than IE, Firefox, and Safari are today.

By treating web apps more like regular applications, Chrome, together with Gears, constitutes a serious bid by Google to grow browsers into decently capable application platforms almost on the scale of operating systems. That means Google is slowly but surely working to erode Microsoft's monopoly. If browsers provide a high-quality, universally popular, OS-independent platform for applications, OS vendors no longer have any particular advantage in the desktop applications market. You don't have to look too far beyond Chrome to see operating systems becoming completely interchangeable commodities, just invisible layers that provide hardware support and enough infrastructure to run a browser and maybe some games.

Maybe Google will finally be able to make good on Marc Andreessen's declaration over a decade ago that the browser would make the operating system irrelevant.

Tuesday, July 15, 2008

codepad vim plugin

Nicolas Weber has written a handy Vim plugin that adds commands to paste the contents of your current buffer into codepad, and optionally run it. You can get it here.

I spend most of my time in Emacs, so if people find this useful, maybe I'll write an Emacs version. Although, I've been considering a switch to Vim. I'm thinking maybe its multi-keystroke commands will be easier on my wrists than the control- and meta- chording I do so much of in Emacs. I can't decide whether breaking with a decade of muscle-memory training will be more or less painful.

Thursday, June 19, 2008

Web Browsers and Memory Fragmentation

I've been following Stuart Parmenter's blog posts about improving memory usage in Firefox 3. The most interesting part has been his excellent work on reducing memory fragmentation. By making pretty pictures of memory fragmentation under various malloc implementations, he was able to identify jemalloc as the best option for Firefox, and the improvement made by switching was pretty impressive.

But I suspect that ultimately it won't be enough. The advent of Ajax is dramatically changing the way browsers use memory. Rather than loading a fairly static DOM and occasionally doing a few small manipulations in Javascript as the site is used, modern web apps mutate the DOM much more extensively, adding and removing sections repeatedly as new data is received from the server. And they're going to be doing that kind of thing more and more.

There's an interesting problem here which is unique to browsers. Javascript doesn't give programmers direct access to pointers, so it's free to move objects around in memory however it pleases as it executes code. Modern Javascript implementations take advantage of that abstraction, and manage memory allocation with generational garbage collectors that compact memory by relocating objects as they operate. But the DOM, which also has to be manipulated by browser code in C or C++, is reference-counted. The reference counter can't migrate objects between regions of memory, so it depends on the underlying malloc implementation to control memory fragmentation. That's not always possible even for a very smart malloc to accomplish. DOMs in particular should be challenging, because they contain large numbers of strings of varying sizes.

Reference counting introduces memory fragmentation problems in any language, but the situation is particularly bad in browsers. Javascript programs and DOMs on different web pages share the same process memory space. The fragmentation caused by one page can interact with the memory used by other pages, and even after a problematic page is closed, the fragmented memory often can't be reclaimed. With ordinary applications, fragmented memory is all neatly reclaimed when the process terminates. Right now, that doesn't work for web apps.

But this is a problem for browsers, not for web apps. Individual web app authors won't be motivated to care if your browser uses a lot of memory and causes your machine to start swapping once or twice a day, especially if the blame is shared among many popular sites. Unless one site is much worse than its peers, high memory usage makes the browser look bad, not the app.

So I think browsers will eventually experience pressure to fix this problem. As the correct metaphor for a web page moves farther from "document" and closer to "application", maybe it makes sense for browsers to act more like operating systems. Local memory for a web page could be allocated in dedicated regions, and could all be bulk-reclaimed when the page is closed.

That model is simple, but it still allows individual pages to fragment their own memory and consume more and more over time. Maybe a better answer is to find a way to use a generational garbage collector on the DOM. The familiar OS model of per-process virtual memory management was designed for running programs written in languages like C and C++, and that's why it's appealing for handling browser DOMs. But Javascript is a high-level, functional, garbage-collected language. If browsers are little operating systems that run Javascript applications, maybe they should operate more like Lisp machine operating systems than like Unix.

Tuesday, June 17, 2008

Private Pastes (and Projects) for codepad.org

I've noticed that some pastes on codepad.org include confidentiality notices in their copyright boilerplate. You own (or your employer owns) the copyright on code you paste on codepad.org, but it is a public forum! Pasted code appears on the "Recent Pastes" page, for example, where anyone can see it, and search engines can index it.

It makes sense that some people might want to use the site without revealing their code to the public, though. To support that kind of usage, I've made it possible to flag a paste as private, so that it won't be publicly linked on the site, and will include a noindex meta tag to advise search engines not to index it. Just in case, I went and flagged all the pastes in the database that included the words "confidential" or "copyright" as private pastes.

While I was at it, I added a similar feature for codepad projects. Now you can set up a private codepad project to collaborate on code that you'd rather not show the whole world.

Since paste URLs are randomly generated, they shouldn't be vulnerable to the kind of URL-guessing attacks SmugMug had problems with earlier this year. Still, these measures aren't really good enough for protecting important trade secrets — especially for projects, where the name might be guessable. If it seems like there's demand for it, I'll consider adding access control lists in the future.

Thursday, May 15, 2008

How Not to Handle Streams

Let me tell you about what I did yesterday. I set out to write a little HTTP proxy server that would transfer large (> 1GB) files to and from Amazon S3, doing a tiny bit of processing on the beginning of the data as it came through. Sounds pretty simple, right? You're probably thinking that's just a matter of gluing some HTTP libraries to some S3 libraries. I thought so too.

I've been enjoying Haskell lately, so my first thought was to try to write it in Haskell, even if it meant producing my own Haskell S3 library in the process. But I soon discovered that Haskell's HTTP library doesn't have any facility for streaming large amounts of data over HTTP: it reads the entire request or response into a String in memory.

This is an extremely common problem. Library authors often don't consider applications involving large amounts of data, and assume that everything will fit in memory. This is a particularly sad state of affairs in Haskell, because the language itself supports streaming behavior via lazy evaluation — there's no need to even change the API to support streaming in Haskell. Just make your functions do lazy IO, and those Strings can be used to stream data as-needed. (In fact, there's been some work on this in Haskell.)

Note also that reading and writing as you process data is supported by TCP. A lot of people seem to be at least slightly confused about this situation, particularly in the case of receiving data from a remote sender. If the other side of a TCP connection is sending data faster than you can process it, nothing terrible happens. Your kernel's TCP buffer will fill up, it will advertise a 0-sized receiver window, and the remote client will stop being able to send. The remote TCP buffer will fill, and the socket will correctly block, or stop signaling poll(), until it can write again. To the other side, it won't look any different from a slow, bursty network connection.

Anyway, simple string-based APIs are disappointing, but it gets worse. Even when authors do consider the need to deal with large amounts of data, they often get the API wrong.

I eventually gave up on Haskell, and switched to Python. Things started out okay. BaseHTTPServer and urllib2 both let you stream data to and from file-like objects. But the example Python S3 library doesn't provide a stream API, it reads everything into strings. Well, okay, it's just an example. So I took a look at boto.

Boto has a variety of methods, some with names that sound promising. Like get_contents_to_file, that'll take a file-like object and write the data... wait. I'm getting the data from S3. I want to read it from a file-like object, not pass in a file-like object for boto to write to. Boto goes out of its way here to do the wrong thing! Rather than just handing me the wrapped socket object it no doubt has internally, it reads and writes all the data in a loop, converting the read operations I want to do into write operations I'll have to handle. To do any processing on this data, I'd have to construct a file-like object that can handle writes from boto!

Another thing I considered was writing my own S3 interface using PycURL. But PycURL requires you to give it a "write function" callback, which it will call periodically with data until the entire response body has been read. It's the same situation as with boto: you have to handle writes where you would prefer to perform reads. If you wanted to turn that into a readable file-like object, you'd have to run PycURL in a thread, and handle writes by putting data into a queue, and blocking when the queue is full. Then you could read out of the other side of the queue in the main thread.

For some reason this kind of read/write inversion is almost as common as string-based interfaces. Sometimes API users are required to provide a file-like object, and other times it's a callback function. Either way, the user ends up writing code to handle IO as it is performed by the API function, rather than performing the IO themselves. The most convenient read/write direction for an API to provide is almost always the one provided by the underlying IO: the one that allows the user to perform the IO directly, as they need it.

Frustrated with the situation in Python, I considered using Ruby. Ruby's AWS::S3 has a nicer interface than boto, and allows me to stream data reasonably. (It actually does the same kind of read/write inversion that PycURL does, but Ruby's block syntax makes that somewhat less annoying.) And Mongrel — the web server everybody uses to host their Rails apps — can be used as a library for writing HTTP servers, a lot like Python's BaseHTTPServer. Oh, but wait. Mongrel delivers the HTTP request body either as a string, or, if it exceeds some configurable size, as a tempfile. So it'll let me read from a file-like object, but insists on writing the data to an actual file on disk first!

Perhaps Mongrel is trying to be too smart, to anticipate my needs rather than exposing the more generally useful API underneath. In any case, it didn't work for my application.

The expedient solution for me was to hack up boto to make it not deliberately break the functionality I wanted. I'll be submitting a patch, so perhaps one of the five unhelpful APIs I encountered yesterday will someday be fixed. Why does this seem to be so difficult for library authors to get right?

Sunday, May 11, 2008

codepad.org projects

codepad.org now has project pastebins, which allow groups of developers to work with their own history of recently pasted code. Here's an example. I'm thinking about ways to expand the project functionality. If you're considering using it and have any suggestions, I'd love to hear them.

Thursday, May 8, 2008

The Terrible Legacy of Legacy

Out of curiosity, I recently started reading the haskell-prime mailing list. Haskell' is the interim name for the language standard that will supersede Haskell98, like C++0x is to C++03. There's been some discussion on the list about where to break backward compatibility with Haskell98, and which features are worth the trouble. It's interesting to see a language just starting to have to deal with these issues. On one side, you've got researchers and language designers angling to improve the language as much as possible before a new standard is nailed down. On the other are mainly industry users, pushing for less change and more backward compatibility, so they don't have to spend time and money upgrading their codebase or working with old and unsupported tools.

The influence of backward compatibility on software is hard to overestimate. Windows Vista is still binary-compatible with many MS-DOS programs dating back to 1981, and the DOS API was in turn meant to make it easy to port CP/M programs from the late ’70s. Meanwhile, Mac OS X and Linux are both Unix variants, with some core API features dating back to the early ’70s.

The situation with processor instruction sets is similar: the processors in today's PCs (and Macs!) are backward-compatible with Intel's original 8086, which itself was designed so that assembly for the 8008 — the first 8-bit microprocessor, released in 1972 — could be converted automatically.

This means that there are 30-year-old programs which would require very little modification to run on today's operating systems. And there's no reason to expect that in another 30 years we won't still be using systems with an unbroken 60-year-long chain of backward compatibility.

It's not that these technologies were so ingenious that we haven't managed to think of anything better in the intervening decades. Rather, when the quality of a software interface gets good enough to become a widespread standard, the value of any given improvement on that interface is dwarfed by the value of compatibility. Progress in any direction that would require a break in compatibility slows dramatically. The bar for de-facto standards isn't "great", but merely "good enough".

What this means is that an increasing number of design features in the software systems we use every day are attributable to historical reasons. That's the terrible legacy of legacy code. The crushing gravity of installed bases eventually pulls even the best-designed systems down into a mire of hard-to-learn, hard-to-use arcana.

A lot of programming languages from the ’90s are feeling that pressure today, and the result is a number of planned backward-incompatible major revisions, including Python 3000, Ruby 2, and Perl 6. I'm going to go out on a limb and claim that those languages have more users and larger existing codebases than Haskell does. If they can make backward-incompatible changes just to clean house, surely a primarily research-oriented language like Haskell can.

Don't get me wrong, there are good reasons to maintain backward compatibility in a wide variety of situations. But if you don't have those reasons, why in the world would you subject your programming language to the mangling, bloating influence of backward compatibility? While backward compatibility is a great default position when you don't have any improvements to make, giving up too much to maintain compatibility is bad for everyone.

The question is, how does the cost to existing users compare to the value to all future users? If your language is still growing in popularity, the number of future users can easily exceed the number of current users by orders of magnitude. If you don't break compatibility to fix a problem, you're hurting all the users who will have to live with the problem for who knows how long, in order to avoid hurting the few who will have to upgrade now. And if you don't fix it, someone else will, in a new language, so your users get stuck in COBOL-support hell eventually anyway. That's if we're lucky. If we're not lucky, your language will become so popular that the value of compatibility will outweigh the value of better languages, and your language will be a drag on progress. It's practically a moral imperative to fix it while you still can.

C++ is an excellent example of a language that has valued backward compatibility over simplicity and consistency. It's a popular, practical tool, but few people consider it beautiful, easy to learn, or easy to work with. As Bjarne Stroustrup put it:

I consider it reasonably obvious that in the absence of compatibility requirements, a new language on this model can be much smaller, simpler, more expressive, and more amenable to tool use than C++, without loss of performance or restriction of application domains. How much smaller? Say 10% of the size of C++ in definition and similar in front-end compiler size. In the "Retrospective" chapter of D&E, I expressed that idea as "Inside C++, there is a much smaller and cleaner language struggling to get out". Most of the simplification would come from generalization — from eliminating the mess of special cases that makes C++ so hard handle to rather than restriction or moving work from compile time to run time.

What Stroustrup originally said was that "Within C++, there is a much smaller and cleaner language struggling to get out," which "would [...] have been an unimportant cult language." I'm not sure I agree with that last part. Java, for example, is a syntactically similar language whose designers did decide to give up compatibility with C in order to achieve greater simplicity and consistency. Even though Java is virtual-machine-based and unsuitable for a wide variety of systems programming tasks, its popularity has by some metrics exceeded that of C++.

What has compatibility bought C++? Java shows that it wasn't a requirement for popularity. And despite being largely backward-compatible with C, C++ has had difficulty supplanting it for many tasks. Indeed, there's more than one place where a break with backward compatibility might have simplified and improved C++ at little cost. The modern C++ style that has become prominent in recent years bears little relation to the way C++ was written in the ’90s. How much value is there, really, in being able to compile both dialects with the same compilers?

So that's my position: backward compatibility at the expense of simplification is only appropriate when you can't gain acceptance any other way. If a given backward-incompatible improvement isn't going to cause a fork in the language, its value probably outweighs the value of compatibility.

Saturday, April 26, 2008

Debugging Science

Computer Science is badly misnamed. For the most part, I don't see how it's a science at all. If you look at the curriculum of a typical undergraduate CS program, about half of it looks like a branch of mathematics. There are classes on abstractions like automata, algorithms, and computational complexity — areas in which many important results were worked out by mathematicians long before the first computers were built. Interspersed with the mathematics are classes on software engineering; classes about how to build things, like operating systems, compilers, and networks. There aren't many labs, because the field isn't too big on experiments. How many people in CS departments even consider themselves scientists?

But there's an important component of software engineering that does qualify as a science: debugging computer programs. When you find a bug in your program, your mental model of the program has been invalidated. The program didn't behave as you expected. Your task in debugging is to find out why. Or, put another way, to find out how your program actually behaves — just because you see unexpected results doesn't mean that you understand exactly what your program did to produce them. Experiment has invalidated your old theory, and you're about to perform additional experiments to help you find a new one. Science!

Debugging is particularly interesting because it's such a simple example of science. The relevant experiments can be performed rapidly and without expensive equipment. The measurements are precise. The problems can be simple, or very complex. It seems like an ideal model for investigating how science works. If you want to learn something about effective scientific technique, look at the neat little microcosm of debugging technique.

For example, some people are very bad at debugging. These people typically "guess and check": they guess at what must be going wrong, "fix" their assumed problem, and then check to see whether the program's behavior improves. Sometimes you get lucky this way, but if there are a lot of possibilities, you can spend a long time guessing. You might even forget what you've already tried and end up testing potential solutions more than once. Or, worse, you could think you've tried something you haven't.

This is poor scientific method, but it's also the form of the scientific method that you were probably taught in grade school. It has all the components of a science fair project: problem, hypothesis, experiment, and conclusion. But if you've ever watched someone debug this way, you know what a disaster it is.

The problem is that science as described above is only a filter. Really, all it describes are the inputs and outputs of experimentation. Given a situation you don't understand (problem) and a guess about what is going on (hypothesis), you perform some procedure (experiment) to determine whether your guess is plausible (conclusion). This explains how to distinguish the plausible guesses from the implausible ones, but doesn't provide any guidance about which guesses you should make, or in what order.

In contrast, here's how effective software engineers debug, in my experience:

They find a very general open question about the problem. They don't jump to conclusions about exactly what might be going wrong, like "should the loop in function f in module Foo start at 1 rather than 0?". Instead, they start with more general questions, like "which modules are involved?".
They come up with an experiment that will give as definitive an answer as is practical. Sometimes this is a potential fix, but often it's some more direct type of data-gathering, like tracing, assertions, or inspection at a breakpoint in a debugger.
They perform the experiment.
They repeat the above steps, narrowing down their questions until they have an accurate idea of what must be going wrong.

What's going on here? The middle steps could be part of the same flailing experimental process that even the poorest programmers use. The important parts are really steps 1 and 4, the identification of open questions that cover a broad range of possibilities, and the continual narrowing of those questions as you hone in on the answer.

While the "guess and check" method resembled a science fair project, this method more closely resembles John R. Platt's description of what he called "strong inference" in his famous speech on effective scientific technique, The New Baconians:

Devising alternative hypotheses;
Devising a crucial experiment (or several of them), with alternative possible outcomes, each of which will, as nearly is possible, exclude one or more of the hypotheses;
Carrying out the experiment so as to get a clean result;
Recycling the procedure, making subhypotheses or sequential hypotheses to refine the possibilities that remain, and so on.

Where the scientific method you were taught in grade school is a filter, strong inference is a search. The idea goes back to Francis Bacon, who described science as tree traversal using the metaphor of "fingerposts which are set up where roads part, to indicate the several directions". Using the filter of experimentation to distinguish valid explanations from invalid ones, strong inference describes a method of ordering the space of possible explanations, and sort of binary-searching (actually, n-ary-searching) your way down to the correct explanation.

Platt didn't talk about starting with general open questions, although he got close with his comment about "subhypotheses". Part of the reason Platt's speech was so popular is that effective scientific technique isn't easy to pin down. Like the algorithm you use when searching through the phone book, debugging and other science may be easier to do than it is to precisely specify. (Hint: searching the phone book is sort of an interpolation search.) But debugging at least provides more convenient guiding examples than the slower, more complex sciences.

So how does this effective debugging procedure work, in detail? The fundamental technique is to arrange the search space into a tree based on the generality of the assertions each hypothesis makes. Every hypothesis makes at least one assertion, which can be viewed as a boolean question: is the assertion true, or false? Your tree branches for each of those possibilities. A more general hypothesis like "the problem is in function f" appears higher in your search tree than one that makes more specific assertions, like "the problem is in function f, and f's local variable foo can be null". Then you traverse the tree, doing an experiment at each level to select the correct branch.

One variant on this process has been automated in git bisect. When you know that a bug used to be absent from your code, you can start with a general hypothesis that the bug was introduced somewhere in a large range of versions, and narrow that hypothesis down to smaller ranges and eventually a specific version. Given a test case and an old revision that doesn't contain the bug, you can literally binary search your way to the exact revision in which the bug was introduced. From there, you may need to do additional searching to get down to the specific problem.

Sometimes a single experiment can resolve more than one binary question at a time, so it's useful to look at this process as an n-ary search through a tree, rather than just a binary search. At any given level of the tree, you might have branches for more than two alternative hypotheses.

Similarly, sometimes you run into a number of equally-general hypotheses. For example, if you already know that the problem must be in exactly one of three functions, it might not be clear which one to test first. One useful ordering for equally-general hypotheses is by the probability of successful identification of a plausible branch. If you know that functions f and g are old and have been subject to a fair amount of testing, but function h is much newer, you might test h first. The idea of arranging multiple levels of a search tree according to that principle reminds me of Huffman coding; you organize your tree so that the most probable hypotheses can be investigated in the least time.

Of course, questions aren't all created equal, and there isn't always an easy experiment to resolve every question. You might find that the obtainable evidence supports more than one hypothesis at some level in your tree. This happens all the time in multithreaded programming, where complex race conditions and deadlocks can be extremely difficult to reproduce. If you can't devise an experiment to determine which of two branches is correct, you have to take both branches, and hope one of them is invalidated further down. You can do this one branch at a time: if the branch you choose first is invalidated, you backtrack to focus on the correct branch. This means that in science, having multiple camps in support of divergent theories is perhaps a good thing: it's parallel processing! On the other hand, there's something to be said for the idea that a single scientist should be able to find more than one incomplete explanation plausible simultaneously.

Worse than merely taking the wrong branch first is what happens if you fail to consider all the possible explanations at some level in your tree. You could wind up following down some initially plausible but ultimately incorrect path without even being aware of the alternative. In debugging, when you realize you've done this, it's not too dramatic. Maybe you slap yourself on the forehead. But in more complex subjects like physics or chemistry, major backtracks to previously unimagined possibilities are what Thomas Kuhn called "paradigm shifts" — these are upsets like special relativity, and the abandonment of the phlogiston theory of combustion.

The fact that Kuhn wrote about paradigm shift in the 1960s, hundreds of years after Francis Bacon wrote about conditional inductive trees, shows how our understanding of scientific technique has lagged behind our ability to use it. It might make sense to use debugging as a model for teaching science. Students could develop methodological intuition on a simplified model, rather than trying to accumulate it over years of much slower work with a harder problem. Of course, fledgling scientists would have to learn some computer programming to be able to play with debugging tasks. There's a fair argument that tomorrow's scientists would do well to pick up a little bit of computer programming anyway. But, simplified computational models of physics or chemistry could provide just as nice a model as debugging. In fact, in computational biology, serious scientific experiments are already being done on computer models of physics, rather than the real thing! Why not start with something simpler, for educational purposes?

Update: I originally asserted that Francis Bacon, "using modern CS terminology three hundred years ahead of time, called [strong inference] a 'conditional inductive tree'". I was confused — those were Platt's words, not Bacon's.

Sunday, April 13, 2008

PHP and Lua Support for codepad.org

I spent last week making some fairly substantial improvements to the codepad.org back-end. The highlight for the moment is that codepad now speaks PHP and Lua.

Thursday, March 13, 2008

Tree Lists as a Default List Data Structure

I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.

Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.

Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):

Linked lists are the popular built-in list type in functional programming languages like Scheme and Haskell. They allow fast O(1) inserts and deletes at any point in the list. Unfortunately, lookup is O(n), which makes many common insert and delete operations (those that don't start from an iterator in the correct position in the list) O(n) as well. Memory overhead for linked lists involves at least one link pointer for every element in the list. All in all, they're an algebraically fundamental but pragmatically somewhat specialized type of list.
Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.

Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.

Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.

BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.

Friday, March 7, 2008

Scaling at 2am with EC2

I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.

This past Tuesday, I posted about my codepad.org project on news.YC. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.

First, a couple details about the architecture: codepad.org's backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.

At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.

Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.

As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.

Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.

Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?

Monday, March 3, 2008

Murmur and Hash Tables for Scalability

Murmur is a new (non-cryptographic) hash function. Tanjent says: "Way faster than Hsieh, better mixing than Jenkins, as simple as FNV." He's also made some speed and mixing improvements since its initial release.

This got me thinking about hash-based load distribution, in which the modulus of a hashed identifier is used to route a request to the machine that will be responsible for handling it, so that load is distributed in a way that allows server machines to keep local state across a series of requests (many load balancers call this "affinity"). Under heavy load, fast hash functions with poor mixing can route too many requests to one machine, overwhelming it while the others have plenty of unused capacity. Cryptographic hash functions will eliminate that problem by distributing requests more evenly, but are sometimes too CPU-intensive on the router. Functions like Hsieh and Jenkins are useful in cases like that, and Murmur might be even better.

I was talking about this with my brother, and he came up with a way to trade away memory for CPU in applications where the number of possible identifiers is small enough: use a very fast hash function as the basis of a large hash table on the router, mapping identifiers to slow-computed hashes which can in turn be used to select servers. The first time you see an identifier, you perform a slower hash, store it in your local hash table, and use the slow-computed hash value to pick a server. Next time you see the same identifier, you'll find the slow-computed hash in your table at the reduced CPU cost of a much quicker hash. If your identifiers are unlimited, though, your hash table will just keep growing, unless you turn it into an LRU cache by culling old entries.

Another approach is to generate a random number instead of using a slow-computed hash. The hash function and random number generation are fast, and a pseudo-random number generator should get a very even distribution. An identifier will only map the the same machine for as long as your router stays up, though, so this approach is only appropriate if the cost of having requests sent to the wrong machines is low.

Update: Murmur now has a web site.

hackerdashery