hackerdashery

P vs. NP and the Computational Complexity Zoo

2014-08-27T23:02:00.001-07:00

At long last, my second video is finished!

Hackerdashery #2: P vs. NP and the Computational Complexity Zoo

Hackerdashery YouTube Channel

2013-04-21T19:22:00.001-07:00

I'm back! This time in video.

If you liked my blog back in 2008, you might like my new YouTube channel. I'll be talking about programming and all things tangentially related. Check it out and subscribe for more.

Google Chrome and the Future of Browsers

2008-09-01T12:26:00.001-07:00

The big news today in the nerd world is that Google is going to ship an open-source browser, called Google Chrome.

Chrome incorporates some of the same ideas Bram and I wrote about recently: each site runs in its own process, so that memory can be bulk-reclaimed when you navigate away from a domain or close a tab, and it provides a task-manager-like interface that lets you see which site is using all you CPU or RAM. It also uses a more sophisticated Javascript VM called V8, which features precise, incremental garbage collection, and dynamic compilation.

Between these good design ideas and Google's ability (and intention) to test against their enormous, ranked web cache, I think Chrome is pretty much guaranteed to be a substantially better browser than IE, Firefox, and Safari are today.

By treating web apps more like regular applications, Chrome, together with Gears, constitutes a serious bid by Google to grow browsers into decently capable application platforms almost on the scale of operating systems. That means Google is slowly but surely working to erode Microsoft's monopoly. If browsers provide a high-quality, universally popular, OS-independent platform for applications, OS vendors no longer have any particular advantage in the desktop applications market. You don't have to look too far beyond Chrome to see operating systems becoming completely interchangeable commodities, just invisible layers that provide hardware support and enough infrastructure to run a browser and maybe some games.

Maybe Google will finally be able to make good on Marc Andreessen's declaration over a decade ago that the browser would make the operating system irrelevant.

codepad vim plugin

2008-07-15T18:03:00.000-07:00

Nicolas Weber has written a handy Vim plugin that adds commands to paste the contents of your current buffer into codepad, and optionally run it. You can get it here.

I spend most of my time in Emacs, so if people find this useful, maybe I'll write an Emacs version. Although, I've been considering a switch to Vim. I'm thinking maybe its multi-keystroke commands will be easier on my wrists than the control- and meta- chording I do so much of in Emacs. I can't decide whether breaking with a decade of muscle-memory training will be more or less painful.

Web Browsers and Memory Fragmentation

2008-06-19T11:35:00.000-07:00

I've been following Stuart Parmenter's blog posts about improving memory usage in Firefox 3. The most interesting part has been his excellent work on reducing memory fragmentation. By making pretty pictures of memory fragmentation under various malloc implementations, he was able to identify jemalloc as the best option for Firefox, and the improvement made by switching was pretty impressive.

But I suspect that ultimately it won't be enough. The advent of Ajax is dramatically changing the way browsers use memory. Rather than loading a fairly static DOM and occasionally doing a few small manipulations in Javascript as the site is used, modern web apps mutate the DOM much more extensively, adding and removing sections repeatedly as new data is received from the server. And they're going to be doing that kind of thing more and more.

There's an interesting problem here which is unique to browsers. Javascript doesn't give programmers direct access to pointers, so it's free to move objects around in memory however it pleases as it executes code. Modern Javascript implementations take advantage of that abstraction, and manage memory allocation with generational garbage collectors that compact memory by relocating objects as they operate. But the DOM, which also has to be manipulated by browser code in C or C++, is reference-counted. The reference counter can't migrate objects between regions of memory, so it depends on the underlying malloc implementation to control memory fragmentation. That's not always possible even for a very smart malloc to accomplish. DOMs in particular should be challenging, because they contain large numbers of strings of varying sizes.

Reference counting introduces memory fragmentation problems in any language, but the situation is particularly bad in browsers. Javascript programs and DOMs on different web pages share the same process memory space. The fragmentation caused by one page can interact with the memory used by other pages, and even after a problematic page is closed, the fragmented memory often can't be reclaimed. With ordinary applications, fragmented memory is all neatly reclaimed when the process terminates. Right now, that doesn't work for web apps.

But this is a problem for browsers, not for web apps. Individual web app authors won't be motivated to care if your browser uses a lot of memory and causes your machine to start swapping once or twice a day, especially if the blame is shared among many popular sites. Unless one site is much worse than its peers, high memory usage makes the browser look bad, not the app.

So I think browsers will eventually experience pressure to fix this problem. As the correct metaphor for a web page moves farther from "document" and closer to "application", maybe it makes sense for browsers to act more like operating systems. Local memory for a web page could be allocated in dedicated regions, and could all be bulk-reclaimed when the page is closed.

That model is simple, but it still allows individual pages to fragment their own memory and consume more and more over time. Maybe a better answer is to find a way to use a generational garbage collector on the DOM. The familiar OS model of per-process virtual memory management was designed for running programs written in languages like C and C++, and that's why it's appealing for handling browser DOMs. But Javascript is a high-level, functional, garbage-collected language. If browsers are little operating systems that run Javascript applications, maybe they should operate more like Lisp machine operating systems than like Unix.

Private Pastes (and Projects) for codepad.org

2008-06-17T16:28:00.000-07:00

I've noticed that some pastes on codepad.org include confidentiality notices in their copyright boilerplate. You own (or your employer owns) the copyright on code you paste on codepad.org, but it is a public forum! Pasted code appears on the "Recent Pastes" page, for example, where anyone can see it, and search engines can index it.

It makes sense that some people might want to use the site without revealing their code to the public, though. To support that kind of usage, I've made it possible to flag a paste as private, so that it won't be publicly linked on the site, and will include a noindex meta tag to advise search engines not to index it. Just in case, I went and flagged all the pastes in the database that included the words "confidential" or "copyright" as private pastes.

While I was at it, I added a similar feature for codepad projects. Now you can set up a private codepad project to collaborate on code that you'd rather not show the whole world.

Since paste URLs are randomly generated, they shouldn't be vulnerable to the kind of URL-guessing attacks SmugMug had problems with earlier this year. Still, these measures aren't really good enough for protecting important trade secrets — especially for projects, where the name might be guessable. If it seems like there's demand for it, I'll consider adding access control lists in the future.

How Not to Handle Streams

2008-05-15T17:58:00.000-07:00

Let me tell you about what I did yesterday. I set out to write a little HTTP proxy server that would transfer large (> 1GB) files to and from Amazon S3, doing a tiny bit of processing on the beginning of the data as it came through. Sounds pretty simple, right? You're probably thinking that's just a matter of gluing some HTTP libraries to some S3 libraries. I thought so too.

I've been enjoying Haskell lately, so my first thought was to try to write it in Haskell, even if it meant producing my own Haskell S3 library in the process. But I soon discovered that Haskell's HTTP library doesn't have any facility for streaming large amounts of data over HTTP: it reads the entire request or response into a String in memory.

This is an extremely common problem. Library authors often don't consider applications involving large amounts of data, and assume that everything will fit in memory. This is a particularly sad state of affairs in Haskell, because the language itself supports streaming behavior via lazy evaluation — there's no need to even change the API to support streaming in Haskell. Just make your functions do lazy IO, and those Strings can be used to stream data as-needed. (In fact, there's been some work on this in Haskell.)

Note also that reading and writing as you process data is supported by TCP. A lot of people seem to be at least slightly confused about this situation, particularly in the case of receiving data from a remote sender. If the other side of a TCP connection is sending data faster than you can process it, nothing terrible happens. Your kernel's TCP buffer will fill up, it will advertise a 0-sized receiver window, and the remote client will stop being able to send. The remote TCP buffer will fill, and the socket will correctly block, or stop signaling poll(), until it can write again. To the other side, it won't look any different from a slow, bursty network connection.

Anyway, simple string-based APIs are disappointing, but it gets worse. Even when authors do consider the need to deal with large amounts of data, they often get the API wrong.

I eventually gave up on Haskell, and switched to Python. Things started out okay. BaseHTTPServer and urllib2 both let you stream data to and from file-like objects. But the example Python S3 library doesn't provide a stream API, it reads everything into strings. Well, okay, it's just an example. So I took a look at boto.

Boto has a variety of methods, some with names that sound promising. Like get_contents_to_file, that'll take a file-like object and write the data... wait. I'm getting the data from S3. I want to read it from a file-like object, not pass in a file-like object for boto to write to. Boto goes out of its way here to do the wrong thing! Rather than just handing me the wrapped socket object it no doubt has internally, it reads and writes all the data in a loop, converting the read operations I want to do into write operations I'll have to handle. To do any processing on this data, I'd have to construct a file-like object that can handle writes from boto!

Another thing I considered was writing my own S3 interface using PycURL. But PycURL requires you to give it a "write function" callback, which it will call periodically with data until the entire response body has been read. It's the same situation as with boto: you have to handle writes where you would prefer to perform reads. If you wanted to turn that into a readable file-like object, you'd have to run PycURL in a thread, and handle writes by putting data into a queue, and blocking when the queue is full. Then you could read out of the other side of the queue in the main thread.

For some reason this kind of read/write inversion is almost as common as string-based interfaces. Sometimes API users are required to provide a file-like object, and other times it's a callback function. Either way, the user ends up writing code to handle IO as it is performed by the API function, rather than performing the IO themselves. The most convenient read/write direction for an API to provide is almost always the one provided by the underlying IO: the one that allows the user to perform the IO directly, as they need it.

Frustrated with the situation in Python, I considered using Ruby. Ruby's AWS::S3 has a nicer interface than boto, and allows me to stream data reasonably. (It actually does the same kind of read/write inversion that PycURL does, but Ruby's block syntax makes that somewhat less annoying.) And Mongrel — the web server everybody uses to host their Rails apps — can be used as a library for writing HTTP servers, a lot like Python's BaseHTTPServer. Oh, but wait. Mongrel delivers the HTTP request body either as a string, or, if it exceeds some configurable size, as a tempfile. So it'll let me read from a file-like object, but insists on writing the data to an actual file on disk first!

Perhaps Mongrel is trying to be too smart, to anticipate my needs rather than exposing the more generally useful API underneath. In any case, it didn't work for my application.

The expedient solution for me was to hack up boto to make it not deliberately break the functionality I wanted. I'll be submitting a patch, so perhaps one of the five unhelpful APIs I encountered yesterday will someday be fixed. Why does this seem to be so difficult for library authors to get right?

codepad.org projects

2008-05-11T14:33:00.000-07:00

codepad.org now has project pastebins, which allow groups of developers to work with their own history of recently pasted code. Here's an example. I'm thinking about ways to expand the project functionality. If you're considering using it and have any suggestions, I'd love to hear them.

The Terrible Legacy of Legacy

2008-05-08T13:13:00.000-07:00

Out of curiosity, I recently started reading the haskell-prime mailing list. Haskell' is the interim name for the language standard that will supersede Haskell98, like C++0x is to C++03. There's been some discussion on the list about where to break backward compatibility with Haskell98, and which features are worth the trouble. It's interesting to see a language just starting to have to deal with these issues. On one side, you've got researchers and language designers angling to improve the language as much as possible before a new standard is nailed down. On the other are mainly industry users, pushing for less change and more backward compatibility, so they don't have to spend time and money upgrading their codebase or working with old and unsupported tools.

The influence of backward compatibility on software is hard to overestimate. Windows Vista is still binary-compatible with many MS-DOS programs dating back to 1981, and the DOS API was in turn meant to make it easy to port CP/M programs from the late ’70s. Meanwhile, Mac OS X and Linux are both Unix variants, with some core API features dating back to the early ’70s.

The situation with processor instruction sets is similar: the processors in today's PCs (and Macs!) are backward-compatible with Intel's original 8086, which itself was designed so that assembly for the 8008 — the first 8-bit microprocessor, released in 1972 — could be converted automatically.

This means that there are 30-year-old programs which would require very little modification to run on today's operating systems. And there's no reason to expect that in another 30 years we won't still be using systems with an unbroken 60-year-long chain of backward compatibility.

It's not that these technologies were so ingenious that we haven't managed to think of anything better in the intervening decades. Rather, when the quality of a software interface gets good enough to become a widespread standard, the value of any given improvement on that interface is dwarfed by the value of compatibility. Progress in any direction that would require a break in compatibility slows dramatically. The bar for de-facto standards isn't "great", but merely "good enough".

What this means is that an increasing number of design features in the software systems we use every day are attributable to historical reasons. That's the terrible legacy of legacy code. The crushing gravity of installed bases eventually pulls even the best-designed systems down into a mire of hard-to-learn, hard-to-use arcana.

A lot of programming languages from the ’90s are feeling that pressure today, and the result is a number of planned backward-incompatible major revisions, including Python 3000, Ruby 2, and Perl 6. I'm going to go out on a limb and claim that those languages have more users and larger existing codebases than Haskell does. If they can make backward-incompatible changes just to clean house, surely a primarily research-oriented language like Haskell can.

Don't get me wrong, there are good reasons to maintain backward compatibility in a wide variety of situations. But if you don't have those reasons, why in the world would you subject your programming language to the mangling, bloating influence of backward compatibility? While backward compatibility is a great default position when you don't have any improvements to make, giving up too much to maintain compatibility is bad for everyone.

The question is, how does the cost to existing users compare to the value to all future users? If your language is still growing in popularity, the number of future users can easily exceed the number of current users by orders of magnitude. If you don't break compatibility to fix a problem, you're hurting all the users who will have to live with the problem for who knows how long, in order to avoid hurting the few who will have to upgrade now. And if you don't fix it, someone else will, in a new language, so your users get stuck in COBOL-support hell eventually anyway. That's if we're lucky. If we're not lucky, your language will become so popular that the value of compatibility will outweigh the value of better languages, and your language will be a drag on progress. It's practically a moral imperative to fix it while you still can.

C++ is an excellent example of a language that has valued backward compatibility over simplicity and consistency. It's a popular, practical tool, but few people consider it beautiful, easy to learn, or easy to work with. As Bjarne Stroustrup put it:

I consider it reasonably obvious that in the absence of compatibility requirements, a new language on this model can be much smaller, simpler, more expressive, and more amenable to tool use than C++, without loss of performance or restriction of application domains. How much smaller? Say 10% of the size of C++ in definition and similar in front-end compiler size. In the "Retrospective" chapter of D&E, I expressed that idea as "Inside C++, there is a much smaller and cleaner language struggling to get out". Most of the simplification would come from generalization — from eliminating the mess of special cases that makes C++ so hard handle to rather than restriction or moving work from compile time to run time.

What Stroustrup originally said was that "Within C++, there is a much smaller and cleaner language struggling to get out," which "would [...] have been an unimportant cult language." I'm not sure I agree with that last part. Java, for example, is a syntactically similar language whose designers did decide to give up compatibility with C in order to achieve greater simplicity and consistency. Even though Java is virtual-machine-based and unsuitable for a wide variety of systems programming tasks, its popularity has by some metrics exceeded that of C++.

What has compatibility bought C++? Java shows that it wasn't a requirement for popularity. And despite being largely backward-compatible with C, C++ has had difficulty supplanting it for many tasks. Indeed, there's more than one place where a break with backward compatibility might have simplified and improved C++ at little cost. The modern C++ style that has become prominent in recent years bears little relation to the way C++ was written in the ’90s. How much value is there, really, in being able to compile both dialects with the same compilers?

So that's my position: backward compatibility at the expense of simplification is only appropriate when you can't gain acceptance any other way. If a given backward-incompatible improvement isn't going to cause a fork in the language, its value probably outweighs the value of compatibility.

Debugging Science

2008-04-26T09:55:00.000-07:00

Computer Science is badly misnamed. For the most part, I don't see how it's a science at all. If you look at the curriculum of a typical undergraduate CS program, about half of it looks like a branch of mathematics. There are classes on abstractions like automata, algorithms, and computational complexity — areas in which many important results were worked out by mathematicians long before the first computers were built. Interspersed with the mathematics are classes on software engineering; classes about how to build things, like operating systems, compilers, and networks. There aren't many labs, because the field isn't too big on experiments. How many people in CS departments even consider themselves scientists?

But there's an important component of software engineering that does qualify as a science: debugging computer programs. When you find a bug in your program, your mental model of the program has been invalidated. The program didn't behave as you expected. Your task in debugging is to find out why. Or, put another way, to find out how your program actually behaves — just because you see unexpected results doesn't mean that you understand exactly what your program did to produce them. Experiment has invalidated your old theory, and you're about to perform additional experiments to help you find a new one. Science!

Debugging is particularly interesting because it's such a simple example of science. The relevant experiments can be performed rapidly and without expensive equipment. The measurements are precise. The problems can be simple, or very complex. It seems like an ideal model for investigating how science works. If you want to learn something about effective scientific technique, look at the neat little microcosm of debugging technique.

For example, some people are very bad at debugging. These people typically "guess and check": they guess at what must be going wrong, "fix" their assumed problem, and then check to see whether the program's behavior improves. Sometimes you get lucky this way, but if there are a lot of possibilities, you can spend a long time guessing. You might even forget what you've already tried and end up testing potential solutions more than once. Or, worse, you could think you've tried something you haven't.

This is poor scientific method, but it's also the form of the scientific method that you were probably taught in grade school. It has all the components of a science fair project: problem, hypothesis, experiment, and conclusion. But if you've ever watched someone debug this way, you know what a disaster it is.

The problem is that science as described above is only a filter. Really, all it describes are the inputs and outputs of experimentation. Given a situation you don't understand (problem) and a guess about what is going on (hypothesis), you perform some procedure (experiment) to determine whether your guess is plausible (conclusion). This explains how to distinguish the plausible guesses from the implausible ones, but doesn't provide any guidance about which guesses you should make, or in what order.

In contrast, here's how effective software engineers debug, in my experience:

They find a very general open question about the problem. They don't jump to conclusions about exactly what might be going wrong, like "should the loop in function f in module Foo start at 1 rather than 0?". Instead, they start with more general questions, like "which modules are involved?".
They come up with an experiment that will give as definitive an answer as is practical. Sometimes this is a potential fix, but often it's some more direct type of data-gathering, like tracing, assertions, or inspection at a breakpoint in a debugger.
They perform the experiment.
They repeat the above steps, narrowing down their questions until they have an accurate idea of what must be going wrong.

What's going on here? The middle steps could be part of the same flailing experimental process that even the poorest programmers use. The important parts are really steps 1 and 4, the identification of open questions that cover a broad range of possibilities, and the continual narrowing of those questions as you hone in on the answer.

While the "guess and check" method resembled a science fair project, this method more closely resembles John R. Platt's description of what he called "strong inference" in his famous speech on effective scientific technique, The New Baconians:

Devising alternative hypotheses;
Devising a crucial experiment (or several of them), with alternative possible outcomes, each of which will, as nearly is possible, exclude one or more of the hypotheses;
Carrying out the experiment so as to get a clean result;
Recycling the procedure, making subhypotheses or sequential hypotheses to refine the possibilities that remain, and so on.

Where the scientific method you were taught in grade school is a filter, strong inference is a search. The idea goes back to Francis Bacon, who described science as tree traversal using the metaphor of "fingerposts which are set up where roads part, to indicate the several directions". Using the filter of experimentation to distinguish valid explanations from invalid ones, strong inference describes a method of ordering the space of possible explanations, and sort of binary-searching (actually, n-ary-searching) your way down to the correct explanation.

Platt didn't talk about starting with general open questions, although he got close with his comment about "subhypotheses". Part of the reason Platt's speech was so popular is that effective scientific technique isn't easy to pin down. Like the algorithm you use when searching through the phone book, debugging and other science may be easier to do than it is to precisely specify. (Hint: searching the phone book is sort of an interpolation search.) But debugging at least provides more convenient guiding examples than the slower, more complex sciences.

So how does this effective debugging procedure work, in detail? The fundamental technique is to arrange the search space into a tree based on the generality of the assertions each hypothesis makes. Every hypothesis makes at least one assertion, which can be viewed as a boolean question: is the assertion true, or false? Your tree branches for each of those possibilities. A more general hypothesis like "the problem is in function f" appears higher in your search tree than one that makes more specific assertions, like "the problem is in function f, and f's local variable foo can be null". Then you traverse the tree, doing an experiment at each level to select the correct branch.

One variant on this process has been automated in git bisect. When you know that a bug used to be absent from your code, you can start with a general hypothesis that the bug was introduced somewhere in a large range of versions, and narrow that hypothesis down to smaller ranges and eventually a specific version. Given a test case and an old revision that doesn't contain the bug, you can literally binary search your way to the exact revision in which the bug was introduced. From there, you may need to do additional searching to get down to the specific problem.

Sometimes a single experiment can resolve more than one binary question at a time, so it's useful to look at this process as an n-ary search through a tree, rather than just a binary search. At any given level of the tree, you might have branches for more than two alternative hypotheses.

Similarly, sometimes you run into a number of equally-general hypotheses. For example, if you already know that the problem must be in exactly one of three functions, it might not be clear which one to test first. One useful ordering for equally-general hypotheses is by the probability of successful identification of a plausible branch. If you know that functions f and g are old and have been subject to a fair amount of testing, but function h is much newer, you might test h first. The idea of arranging multiple levels of a search tree according to that principle reminds me of Huffman coding; you organize your tree so that the most probable hypotheses can be investigated in the least time.

Of course, questions aren't all created equal, and there isn't always an easy experiment to resolve every question. You might find that the obtainable evidence supports more than one hypothesis at some level in your tree. This happens all the time in multithreaded programming, where complex race conditions and deadlocks can be extremely difficult to reproduce. If you can't devise an experiment to determine which of two branches is correct, you have to take both branches, and hope one of them is invalidated further down. You can do this one branch at a time: if the branch you choose first is invalidated, you backtrack to focus on the correct branch. This means that in science, having multiple camps in support of divergent theories is perhaps a good thing: it's parallel processing! On the other hand, there's something to be said for the idea that a single scientist should be able to find more than one incomplete explanation plausible simultaneously.

Worse than merely taking the wrong branch first is what happens if you fail to consider all the possible explanations at some level in your tree. You could wind up following down some initially plausible but ultimately incorrect path without even being aware of the alternative. In debugging, when you realize you've done this, it's not too dramatic. Maybe you slap yourself on the forehead. But in more complex subjects like physics or chemistry, major backtracks to previously unimagined possibilities are what Thomas Kuhn called "paradigm shifts" — these are upsets like special relativity, and the abandonment of the phlogiston theory of combustion.

The fact that Kuhn wrote about paradigm shift in the 1960s, hundreds of years after Francis Bacon wrote about conditional inductive trees, shows how our understanding of scientific technique has lagged behind our ability to use it. It might make sense to use debugging as a model for teaching science. Students could develop methodological intuition on a simplified model, rather than trying to accumulate it over years of much slower work with a harder problem. Of course, fledgling scientists would have to learn some computer programming to be able to play with debugging tasks. There's a fair argument that tomorrow's scientists would do well to pick up a little bit of computer programming anyway. But, simplified computational models of physics or chemistry could provide just as nice a model as debugging. In fact, in computational biology, serious scientific experiments are already being done on computer models of physics, rather than the real thing! Why not start with something simpler, for educational purposes?

Update: I originally asserted that Francis Bacon, "using modern CS terminology three hundred years ahead of time, called [strong inference] a 'conditional inductive tree'". I was confused — those were Platt's words, not Bacon's.

PHP and Lua Support for codepad.org

2008-04-13T17:45:00.001-07:00

I spent last week making some fairly substantial improvements to the codepad.org back-end. The highlight for the moment is that codepad now speaks PHP and Lua.

Tree Lists as a Default List Data Structure

2008-03-13T17:06:00.000-07:00

I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.

Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.

Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):

Linked lists are the popular built-in list type in functional programming languages like Scheme and Haskell. They allow fast O(1) inserts and deletes at any point in the list. Unfortunately, lookup is O(n), which makes many common insert and delete operations (those that don't start from an iterator in the correct position in the list) O(n) as well. Memory overhead for linked lists involves at least one link pointer for every element in the list. All in all, they're an algebraically fundamental but pragmatically somewhat specialized type of list.
Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.

Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.

Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.

BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.

Scaling at 2am with EC2

2008-03-07T12:52:00.000-08:00

I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.

This past Tuesday, I posted about my codepad.org project on news.YC. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.

First, a couple details about the architecture: codepad.org's backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.

At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.

Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.

As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.

Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.

Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?

Murmur and Hash Tables for Scalability

2008-03-03T08:11:00.000-08:00

Murmur is a new (non-cryptographic) hash function. Tanjent says: "Way faster than Hsieh, better mixing than Jenkins, as simple as FNV." He's also made some speed and mixing improvements since its initial release.

This got me thinking about hash-based load distribution, in which the modulus of a hashed identifier is used to route a request to the machine that will be responsible for handling it, so that load is distributed in a way that allows server machines to keep local state across a series of requests (many load balancers call this "affinity"). Under heavy load, fast hash functions with poor mixing can route too many requests to one machine, overwhelming it while the others have plenty of unused capacity. Cryptographic hash functions will eliminate that problem by distributing requests more evenly, but are sometimes too CPU-intensive on the router. Functions like Hsieh and Jenkins are useful in cases like that, and Murmur might be even better.

I was talking about this with my brother, and he came up with a way to trade away memory for CPU in applications where the number of possible identifiers is small enough: use a very fast hash function as the basis of a large hash table on the router, mapping identifiers to slow-computed hashes which can in turn be used to select servers. The first time you see an identifier, you perform a slower hash, store it in your local hash table, and use the slow-computed hash value to pick a server. Next time you see the same identifier, you'll find the slow-computed hash in your table at the reduced CPU cost of a much quicker hash. If your identifiers are unlimited, though, your hash table will just keep growing, unless you turn it into an LRU cache by culling old entries.

Another approach is to generate a random number instead of using a slow-computed hash. The hash function and random number generation are fast, and a pseudo-random number generator should get a very even distribution. An identifier will only map the the same machine for as long as your router stays up, though, so this approach is only appropriate if the cost of having requests sent to the wrong machines is low.

Update: Murmur now has a web site.

Code Blocks and C-like Lambda Expressions

2006-10-07T23:50:00.000-07:00

A while back, I started using Ruby for some personal software projects. I was immediately impressed with its "code block" syntax. It's a fairly natural Algol-like syntax for lambda expressions, and is used to great effect in the language. I'd like to advocate wider adoption of this kind of syntax, and suggest some improvements.

Briefly, Ruby functions can take a special "code block" parameter, which appears in their scope as a function called "yield":

def twice()
    yield()
    yield()
end

The magic yield function is then specified using the code block syntax, when twice is called:

twice() do
    print "hello"
end

Output:

hello
hello

(Ruby also allows the use of the more C-like {} delimiters -- which I generally prefer -- in place of do/end, but do/end matches the Algol-like syntax of Ruby's built-in constructs, so I prefer it in that context.)

Like regular function declarations, code blocks can take parameters:

def with(a, b)
    yield(a, b)
end

with(7, 3) do |x, y|
    print x+y
end

Output:

Ruby uses this for all kinds of familiar things, both from procedural and functional languages. Here are some examples:

# a foreach loop over a list
[1, 2, 3].each do |x|
    print x
end

# mapping
squares = [1, 2, 3].map do |x|
    x*x  # note the handy auto-return here
end

# like what some languages call "using" or "with"
File::open('foo.txt') do |fd|
    # the file will automatically be closed when we exit this scope
    print fd.read()
end

First of all, I think this is a great syntax for lambda expressions in Algol- and C-like languages. I prefer the C-like variant in general, so I'll use that from now on. Also, in imitation of C/C++/Java/etc function declaration, I would pull the |x, y| signature definitions out in front of the {}s. (I think I like the vertical bars better than overloading (x, y) to handle both invocation and declaration the way C does. Recruiting || for delimiters strikes me as an okay idea, but there are some good reasons to prefer distinct symbols for open and close -- you could also overload (), [], or even <>.) So I think the standard syntax for (lambda (x y) code) in C-like languages should be something like |x, y| { code }. This has the fun consequence that function declaration could be done like this:

foo = |x, y| {
    do_stuff(x*y)
}

Or, for a statically-typed language, maybe something like:

int func foo = |int x, float y| {
    do_stuff(x*y)
}

Some of the things Ruby does with its code blocks are built-in constructs (like foreach and using) in other Algol-like languages. This got me thinking: it would be nice to have a language in which all the control structures were expressible as functions that take code blocks. In Ruby (using the {} syntax) you could write an "if" function and call it like this:

if (a > b) {
    do_stuff()
}

But not like this:

if (a > b) {
    do_stuff()
} else {
    other_stuff()
}

Ruby's code block syntax is based on a similar idea in Smalltalk, which is powerful enough to express if/else (although not in C-like syntax):

(a > b)
ifTrue: [ doStuff ]
ifFalse: [ otherStuff ].

The key concept employed in the Smalltalk code above is just named parameters. There's no reason named parameters couldn't work with the C-like code block syntax -- instead of setting the magic "yield" variable the way Ruby does, just allow function-type parameters to be declared, say, at the end of the parameter list, and allow parameters to be specified by name in much the same way Ruby and Python already do. It would be cute to write a language in which "if" could be implemented like this:

if = |cond, block, else=nil| {
    # using logic operators for control flow is ugly
    (cond and (block() or true)) or (else and (else() or false))
}

and called like this:

if (cond) {
    do_stuff()
} else {
    other_stuff()
}

One interesting thing about a syntax like that is that it won't allow you to implement a standard while (cond) { code } construct, because the condition doesn't work like a parameter to a function: it can't be evaluated at call time, it needs to be re-evaluated each time the loop restarts. Instead, you'd get something like while {cond} { code }, which I find I prefer -- () and {} have clear and separate meanings, so we'd be clarifying a situation that sometimes surprises beginners by using the correct brackets for the semantics!

Close-block Delimiter Symbols Considered Helpful

2006-09-10T18:50:00.000-07:00

One of the first things one notices about Python is that it has "significant whitespace"; in particular, its syntax relies on indentation to delimit blocks of code. Most languages make indentation optional, but a few, like Python and Haskell, require consistent indentation. Some people immediately hate this feature, and some people immediately love it. My own initial reaction was that this was not a good idea, but I didn't feel strongly about it, and I gave it a chance.

Now I do feel strongly about it.

Over time I have come to feel that this is the worst single mistake in the design of the Python language. What's particularly bad about this mistake is that it might be propagated into future languages -- it could conceivably irritate programmers like me for decades to come. So this is my attempt to explain why Python should not be imitated in this respect.

First, let me separate out two aspects of this feature. The first is required indentation. Required indentation is pretty much okay with me. I don't think it's particularly wise to make your language interact badly with the tabs-vs-spaces quagmire, but who knows? Maybe a language feature like this, in a popular language, could finally end that holy war. Anyway, that's not what I'm talking about here. The second aspect, the one I'm interested in, is doing without a close-block delimiter symbol. (An example of block delimiter symbols, if that phrase isn't self-explanitory: Java's block delimiter symbols are "{" and "}".) Note that this is entirely unrelated to the first aspect -- a language could require consistent indentation just like Python does, and also have an explicit close-block delimiter.

Oh, you say, but that would be redundant. How horrible and ugly! Well, contrary to popular opinion, Python is not entirely without block delimiter symbols. In fact, every Python block must begin with the open-block delimiter, ":", like this:

if a == b:
    do_stuff()

Compare this to the equivalent Ruby code, which has a close-block delimiter, "end", but no explicit open-block delimiter:

if a == b
    do_stuff()
end

So there is no ideological purity involved in Python's abandonment of the close-block delimiter; it already has a redundant block delimiter, it just has the wrong one! Why the "wrong" one? Because, perhaps unlike open-block delimiters, close-block delimiter symbols are useful. They're so useful that even Python can't get along without one: it has a keyword that it uses a lot like a restricted close-block delimiter symbol for a few special cases where it just can't get around the need for such a thing! It's called "pass", and it's used in these cases:

# I don't care whether this works or not
try:
    this_might_fail()
except:
    pass

# make this method do nothing in this subclass
def override_method():
    pass

Again, some Ruby for comparison:

# make this method do nothing in this subclass
def override_method()
end

Okay, so technically, pass isn't a close-block delimiter, it's a no-op. Other lines can follow it at the same indentation level. But it's required for otherwise-empty blocks, and there's no reason for it to exist other than to make clear the existence of such a block, and in practice it is used very much like a close-block delimiter. The reasoning here is interesting. When a block contains lines of code, Python's designers presumably feel that the meaning of the code is clear without a close-block delimiter. But when a block contains no code, it's somehow too weird; one "def" statement directly before another, or an "except" with unindented lines after it would be too confusing. So an indication that the block is empty and over is desirable, even though it's not strictly necessary in order for the program to be parsable.

I also find it interesting that the case with open-block delimiters is not so weird -- in languages which lack them, as Ruby does, the only consequence is that while some blocks start with something like "if a == b", others start with a more generic symbol, like "begin". Many languages do that anyway, even when they have an open-block delimiter, as with the C/C++/Java "do {} while();".

Speaking of which, you may have noticed that Python lacks a do-while loop. Why? I think it's because it would be awkward:

do:
    some_stuff()
while a > b
foo()

Because of the lack of a close-block delimiter symbol, "while a > b" above looks like an ordinary Python statement, or the start of a new while loop, after the end of the block. Tellingly, the proposal for adding a do-while loop to Python suggests that it be integrated with the existing while loop, so that the "while" clause could open a new block, sort of like the way if/else and try/except work. The proposed syntax for the example I gave above would employ "pass" to visually close the block, like this:

do:
    some_stuff()
while a > b:
    pass
foo()

(This combination of "do-while" and "while" into a single looping construct is actually very appealing for other reasons -- I encourage you to read the reasoning in the PEP to see why more languages should have a construct like Python's proposed do-while-else.)

Aside from simple readability, there are some important reasons why close-block delimiter symbols are useful. Let me point out what a big deal that is. The whole question about how to do block-delimiting is usually about readability. The questions are things like "does {} look better than begin/end?", and "is (foo bar) clearer than <foo>bar</foo>?", and "which lines should the delimiters go on?" -- questions about what the code looks like. When you use indentation as your only close-block delimiter, though, there are some problems that go beyond readability. Here are the ones I'm aware of:

Many technologies are bad at preserving indentation in text. Pasting Java or Ruby code into a chat message or web page (without <pre>) might make it look uglier than you'd like, but pasting Python code into the same places is likely to destroy information, rendering the code uninterpretable or just subtly incorrect. I see this happen pretty frequently.
When moving a section of code from one nesting level to another, or changing the indentation scheme of a block of code to match the local convention, close-block delimiter symbols preserve the information about where blocks exist in the moving code section. Indentation is very bad at this, because its addition or removal can be done in a non-atomic way, affecting some lines before others. If you indent by hand, it's easy to lose track, as you move down through the pasted section, of the intended containing block for the next line of code. If you get this wrong in a language like Java, your code may look confusing, but it will behave correctly. In a language like Python, your code can quietly do the wrong thing! (This has actually happened to me.) Getting this right without undue mental effort requires special editor features, but even those can't always save you. My editor happens to have those features, and I still occasionally get bitten by this when someone else's editor interprets tab characters differently from mine, or when a nesting-level change is not a simple copy-and-paste operation (for example, when an enclosing conditional block is being removed, or when two-space indentation is being converted to four-space indentation.)

Sure, these are not such horrible problems. They don't irritate me most days. But they do irritate me sometimes, and there's no reason why I should ever have to tolerate them! I get nothing in return. Redundant close-block delimiter symbols have no associated cost, and some tangible benefits. Please, when designing new a language -- even a language that requires consistent indentation -- include a close-block delimiter symbol!