codepad.org now has project pastebins, which allow groups of developers to work with their own history of recently pasted code. Here's an example. I'm thinking about ways to expand the project functionality. If you're considering using it and have any suggestions, I'd love to hear them.
Sunday, May 11, 2008
Thursday, May 8, 2008
The Terrible Legacy of Legacy
Out of curiosity, I recently started reading the haskell-prime mailing list. Haskell' is the interim name for the language standard that will supersede Haskell98, like C++0x is to C++03. There's been some discussion on the list about where to break backward compatibility with Haskell98, and which features are worth the trouble. It's interesting to see a language just starting to have to deal with these issues. On one side, you've got researchers and language designers angling to improve the language as much as possible before a new standard is nailed down. On the other are mainly industry users, pushing for less change and more backward compatibility, so they don't have to spend time and money upgrading their codebase or working with old and unsupported tools.
The influence of backward compatibility on software is hard to overestimate. Windows Vista is still binary-compatible with many MS-DOS programs dating back to 1981, and the DOS API was in turn meant to make it easy to port CP/M programs from the late ’70s. Meanwhile, Mac OS X and Linux are both Unix variants, with some core API features dating back to the early ’70s.
The situation with processor instruction sets is similar: the processors in today's PCs (and Macs!) are backward-compatible with Intel's original 8086, which itself was designed so that assembly for the 8008 — the first 8-bit microprocessor, released in 1972 — could be converted automatically.
This means that there are 30-year-old programs which would require very little modification to run on today's operating systems. And there's no reason to expect that in another 30 years we won't still be using systems with an unbroken 60-year-long chain of backward compatibility.
It's not that these technologies were so ingenious that we haven't managed to think of anything better in the intervening decades. Rather, when the quality of a software interface gets good enough to become a widespread standard, the value of any given improvement on that interface is dwarfed by the value of compatibility. Progress in any direction that would require a break in compatibility slows dramatically. The bar for de-facto standards isn't "great", but merely "good enough".
What this means is that an increasing number of design features in the software systems we use every day are attributable to historical reasons. That's the terrible legacy of legacy code. The crushing gravity of installed bases eventually pulls even the best-designed systems down into a mire of hard-to-learn, hard-to-use arcana.
A lot of programming languages from the ’90s are feeling that pressure today, and the result is a number of planned backward-incompatible major revisions, including Python 3000, Ruby 2, and Perl 6. I'm going to go out on a limb and claim that those languages have more users and larger existing codebases than Haskell does. If they can make backward-incompatible changes just to clean house, surely a primarily research-oriented language like Haskell can.
Don't get me wrong, there are good reasons to maintain backward compatibility in a wide variety of situations. But if you don't have those reasons, why in the world would you subject your programming language to the mangling, bloating influence of backward compatibility? While backward compatibility is a great default position when you don't have any improvements to make, giving up too much to maintain compatibility is bad for everyone.
The question is, how does the cost to existing users compare to the value to all future users? If your language is still growing in popularity, the number of future users can easily exceed the number of current users by orders of magnitude. If you don't break compatibility to fix a problem, you're hurting all the users who will have to live with the problem for who knows how long, in order to avoid hurting the few who will have to upgrade now. And if you don't fix it, someone else will, in a new language, so your users get stuck in COBOL-support hell eventually anyway. That's if we're lucky. If we're not lucky, your language will become so popular that the value of compatibility will outweigh the value of better languages, and your language will be a drag on progress. It's practically a moral imperative to fix it while you still can.
C++ is an excellent example of a language that has valued backward compatibility over simplicity and consistency. It's a popular, practical tool, but few people consider it beautiful, easy to learn, or easy to work with. As Bjarne Stroustrup put it:
I consider it reasonably obvious that in the absence of compatibility requirements, a new language on this model can be much smaller, simpler, more expressive, and more amenable to tool use than C++, without loss of performance or restriction of application domains. How much smaller? Say 10% of the size of C++ in definition and similar in front-end compiler size. In the "Retrospective" chapter of D&E, I expressed that idea as "Inside C++, there is a much smaller and cleaner language struggling to get out". Most of the simplification would come from generalization — from eliminating the mess of special cases that makes C++ so hard handle to rather than restriction or moving work from compile time to run time.
What Stroustrup originally said was that "Within C++, there is a much smaller and cleaner language struggling to get out," which "would [...] have been an unimportant cult language." I'm not sure I agree with that last part. Java, for example, is a syntactically similar language whose designers did decide to give up compatibility with C in order to achieve greater simplicity and consistency. Even though Java is virtual-machine-based and unsuitable for a wide variety of systems programming tasks, its popularity has by some metrics exceeded that of C++.
What has compatibility bought C++? Java shows that it wasn't a requirement for popularity. And despite being largely backward-compatible with C, C++ has had difficulty supplanting it for many tasks. Indeed, there's more than one place where a break with backward compatibility might have simplified and improved C++ at little cost. The modern C++ style that has become prominent in recent years bears little relation to the way C++ was written in the ’90s. How much value is there, really, in being able to compile both dialects with the same compilers?
So that's my position: backward compatibility at the expense of simplification is only appropriate when you can't gain acceptance any other way. If a given backward-incompatible improvement isn't going to cause a fork in the language, its value probably outweighs the value of compatibility.
Saturday, April 26, 2008
Debugging Science
Computer Science is badly misnamed. For the most part, I don't see how it's a science at all. If you look at the curriculum of a typical undergraduate CS program, about half of it looks like a branch of mathematics. There are classes on abstractions like automata, algorithms, and computational complexity — areas in which many important results were worked out by mathematicians long before the first computers were built. Interspersed with the mathematics are classes on software engineering; classes about how to build things, like operating systems, compilers, and networks. There aren't many labs, because the field isn't too big on experiments. How many people in CS departments even consider themselves scientists?
But there's an important component of software engineering that does qualify as a science: debugging computer programs. When you find a bug in your program, your mental model of the program has been invalidated. The program didn't behave as you expected. Your task in debugging is to find out why. Or, put another way, to find out how your program actually behaves — just because you see unexpected results doesn't mean that you understand exactly what your program did to produce them. Experiment has invalidated your old theory, and you're about to perform additional experiments to help you find a new one. Science!
Debugging is particularly interesting because it's such a simple example of science. The relevant experiments can be performed rapidly and without expensive equipment. The measurements are precise. The problems can be simple, or very complex. It seems like an ideal model for investigating how science works. If you want to learn something about effective scientific technique, look at the neat little microcosm of debugging technique.
For example, some people are very bad at debugging. These people typically "guess and check": they guess at what must be going wrong, "fix" their assumed problem, and then check to see whether the program's behavior improves. Sometimes you get lucky this way, but if there are a lot of possibilities, you can spend a long time guessing. You might even forget what you've already tried and end up testing potential solutions more than once. Or, worse, you could think you've tried something you haven't.
This is poor scientific method, but it's also the form of the scientific method that you were probably taught in grade school. It has all the components of a science fair project: problem, hypothesis, experiment, and conclusion. But if you've ever watched someone debug this way, you know what a disaster it is.
The problem is that science as described above is only a filter. Really, all it describes are the inputs and outputs of experimentation. Given a situation you don't understand (problem) and a guess about what is going on (hypothesis), you perform some procedure (experiment) to determine whether your guess is plausible (conclusion). This explains how to distinguish the plausible guesses from the implausible ones, but doesn't provide any guidance about which guesses you should make, or in what order.
In contrast, here's how effective software engineers debug, in my experience:
- They find a very general open question about the problem. They don't jump to conclusions about exactly what might be going wrong, like "should the loop in function f in module Foo start at 1 rather than 0?". Instead, they start with more general questions, like "which modules are involved?".
- They come up with an experiment that will give as definitive an answer as is practical. Sometimes this is a potential fix, but often it's some more direct type of data-gathering, like tracing, assertions, or inspection at a breakpoint in a debugger.
- They perform the experiment.
- They repeat the above steps, narrowing down their questions until they have an accurate idea of what must be going wrong.
What's going on here? The middle steps could be part of the same flailing experimental process that even the poorest programmers use. The important parts are really steps 1 and 4, the identification of open questions that cover a broad range of possibilities, and the continual narrowing of those questions as you hone in on the answer.
While the "guess and check" method resembled a science fair project, this method more closely resembles John R. Platt's description of what he called "strong inference" in his famous speech on effective scientific technique, The New Baconians:
- Devising alternative hypotheses;
- Devising a crucial experiment (or several of them), with alternative possible outcomes, each of which will, as nearly is possible, exclude one or more of the hypotheses;
- Carrying out the experiment so as to get a clean result;
- Recycling the procedure, making subhypotheses or sequential hypotheses to refine the possibilities that remain, and so on.
Where the scientific method you were taught in grade school is a filter, strong inference is a search. The idea goes back to Francis Bacon, who described science as tree traversal using the metaphor of "fingerposts which are set up where roads part, to indicate the several directions". Using the filter of experimentation to distinguish valid explanations from invalid ones, strong inference describes a method of ordering the space of possible explanations, and sort of binary-searching (actually, n-ary-searching) your way down to the correct explanation.
Platt didn't talk about starting with general open questions, although he got close with his comment about "subhypotheses". Part of the reason Platt's speech was so popular is that effective scientific technique isn't easy to pin down. Like the algorithm you use when searching through the phone book, debugging and other science may be easier to do than it is to precisely specify. (Hint: searching the phone book is sort of an interpolation search.) But debugging at least provides more convenient guiding examples than the slower, more complex sciences.
So how does this effective debugging procedure work, in detail? The fundamental technique is to arrange the search space into a tree based on the generality of the assertions each hypothesis makes. Every hypothesis makes at least one assertion, which can be viewed as a boolean question: is the assertion true, or false? Your tree branches for each of those possibilities. A more general hypothesis like "the problem is in function f" appears higher in your search tree than one that makes more specific assertions, like "the problem is in function f, and f's local variable foo can be null". Then you traverse the tree, doing an experiment at each level to select the correct branch.
One variant on this process has been automated in git bisect. When you know that a bug used to be absent from your code, you can start with a general hypothesis that the bug was introduced somewhere in a large range of versions, and narrow that hypothesis down to smaller ranges and eventually a specific version. Given a test case and an old revision that doesn't contain the bug, you can literally binary search your way to the exact revision in which the bug was introduced. From there, you may need to do additional searching to get down to the specific problem.
Sometimes a single experiment can resolve more than one binary question at a time, so it's useful to look at this process as an n-ary search through a tree, rather than just a binary search. At any given level of the tree, you might have branches for more than two alternative hypotheses.
Similarly, sometimes you run into a number of equally-general hypotheses. For example, if you already know that the problem must be in exactly one of three functions, it might not be clear which one to test first. One useful ordering for equally-general hypotheses is by the probability of successful identification of a plausible branch. If you know that functions f and g are old and have been subject to a fair amount of testing, but function h is much newer, you might test h first. The idea of arranging multiple levels of a search tree according to that principle reminds me of Huffman coding; you organize your tree so that the most probable hypotheses can be investigated in the least time.
Of course, questions aren't all created equal, and there isn't always an easy experiment to resolve every question. You might find that the obtainable evidence supports more than one hypothesis at some level in your tree. This happens all the time in multithreaded programming, where complex race conditions and deadlocks can be extremely difficult to reproduce. If you can't devise an experiment to determine which of two branches is correct, you have to take both branches, and hope one of them is invalidated further down. You can do this one branch at a time: if the branch you choose first is invalidated, you backtrack to focus on the correct branch. This means that in science, having multiple camps in support of divergent theories is perhaps a good thing: it's parallel processing! On the other hand, there's something to be said for the idea that a single scientist should be able to find more than one incomplete explanation plausible simultaneously.
Worse than merely taking the wrong branch first is what happens if you fail to consider all the possible explanations at some level in your tree. You could wind up following down some initially plausible but ultimately incorrect path without even being aware of the alternative. In debugging, when you realize you've done this, it's not too dramatic. Maybe you slap yourself on the forehead. But in more complex subjects like physics or chemistry, major backtracks to previously unimagined possibilities are what Thomas Kuhn called "paradigm shifts" — these are upsets like special relativity, and the abandonment of the phlogiston theory of combustion.
The fact that Kuhn wrote about paradigm shift in the 1960s, hundreds of years after Francis Bacon wrote about conditional inductive trees, shows how our understanding of scientific technique has lagged behind our ability to use it. It might make sense to use debugging as a model for teaching science. Students could develop methodological intuition on a simplified model, rather than trying to accumulate it over years of much slower work with a harder problem. Of course, fledgling scientists would have to learn some computer programming to be able to play with debugging tasks. There's a fair argument that tomorrow's scientists would do well to pick up a little bit of computer programming anyway. But, simplified computational models of physics or chemistry could provide just as nice a model as debugging. In fact, in computational biology, serious scientific experiments are already being done on computer models of physics, rather than the real thing! Why not start with something simpler, for educational purposes?
Update: I originally asserted that Francis Bacon, "using modern CS terminology three hundred years ahead of time, called [strong inference] a 'conditional inductive tree'". I was confused — those were Platt's words, not Bacon's.
Sunday, April 13, 2008
PHP and Lua Support for codepad.org
I spent last week making some fairly substantial improvements to the codepad.org back-end. The highlight for the moment is that codepad now speaks PHP and Lua.
Thursday, March 13, 2008
Tree Lists as a Default List Data Structure
I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.
Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.
Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):
- Linked
lists are the popular built-in list type in functional programming
languages like Scheme and Haskell. They allow fast O(1) inserts and
deletes at any point in the list. Unfortunately, lookup is O(n),
which makes many common insert and delete operations (those that don't
start from an iterator in the correct position in the list) O(n) as
well. Memory overhead for linked lists involves at least one link
pointer for every element in the list. All in all, they're an
algebraically fundamental but pragmatically somewhat specialized type of
list.
- Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.
Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.
Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.
BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.
Friday, March 7, 2008
Scaling at 2am with EC2
I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.
This past Tuesday, I posted about my codepad.org project on news.YC. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.
First, a couple details about the architecture: codepad.org's backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.
At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.
Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.
As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.
Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.
Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?
Monday, March 3, 2008
Murmur and Hash Tables for Scalability
Murmur is a new (non-cryptographic) hash function. Tanjent says: "Way faster than Hsieh, better mixing than Jenkins, as simple as FNV." He's also made some speed and mixing improvements since its initial release.
This got me thinking about hash-based load distribution, in which the modulus of a hashed identifier is used to route a request to the machine that will be responsible for handling it, so that load is distributed in a way that allows server machines to keep local state across a series of requests (many load balancers call this "affinity"). Under heavy load, fast hash functions with poor mixing can route too many requests to one machine, overwhelming it while the others have plenty of unused capacity. Cryptographic hash functions will eliminate that problem by distributing requests more evenly, but are sometimes too CPU-intensive on the router. Functions like Hsieh and Jenkins are useful in cases like that, and Murmur might be even better.
I was talking about this with my brother, and he came up with a way to trade away memory for CPU in applications where the number of possible identifiers is small enough: use a very fast hash function as the basis of a large hash table on the router, mapping identifiers to slow-computed hashes which can in turn be used to select servers. The first time you see an identifier, you perform a slower hash, store it in your local hash table, and use the slow-computed hash value to pick a server. Next time you see the same identifier, you'll find the slow-computed hash in your table at the reduced CPU cost of a much quicker hash. If your identifiers are unlimited, though, your hash table will just keep growing, unless you turn it into an LRU cache by culling old entries.
Another approach is to generate a random number instead of using a slow-computed hash. The hash function and random number generation are fast, and a pseudo-random number generator should get a very even distribution. An identifier will only map the the same machine for as long as your router stays up, though, so this approach is only appropriate if the cost of having requests sent to the wrong machines is low.
Update: Murmur now has a web site.