Saturday, April 26, 2008

Debugging Science

Computer Science is badly misnamed. For the most part, I don't see how it's a science at all. If you look at the curriculum of a typical undergraduate CS program, about half of it looks like a branch of mathematics. There are classes on abstractions like automata, algorithms, and computational complexity — areas in which many important results were worked out by mathematicians long before the first computers were built. Interspersed with the mathematics are classes on software engineering; classes about how to build things, like operating systems, compilers, and networks. There aren't many labs, because the field isn't too big on experiments. How many people in CS departments even consider themselves scientists?

But there's an important component of software engineering that does qualify as a science: debugging computer programs. When you find a bug in your program, your mental model of the program has been invalidated. The program didn't behave as you expected. Your task in debugging is to find out why. Or, put another way, to find out how your program actually behaves — just because you see unexpected results doesn't mean that you understand exactly what your program did to produce them. Experiment has invalidated your old theory, and you're about to perform additional experiments to help you find a new one. Science!

Debugging is particularly interesting because it's such a simple example of science. The relevant experiments can be performed rapidly and without expensive equipment. The measurements are precise. The problems can be simple, or very complex. It seems like an ideal model for investigating how science works. If you want to learn something about effective scientific technique, look at the neat little microcosm of debugging technique.

For example, some people are very bad at debugging. These people typically "guess and check": they guess at what must be going wrong, "fix" their assumed problem, and then check to see whether the program's behavior improves. Sometimes you get lucky this way, but if there are a lot of possibilities, you can spend a long time guessing. You might even forget what you've already tried and end up testing potential solutions more than once. Or, worse, you could think you've tried something you haven't.

This is poor scientific method, but it's also the form of the scientific method that you were probably taught in grade school. It has all the components of a science fair project: problem, hypothesis, experiment, and conclusion. But if you've ever watched someone debug this way, you know what a disaster it is.

The problem is that science as described above is only a filter. Really, all it describes are the inputs and outputs of experimentation. Given a situation you don't understand (problem) and a guess about what is going on (hypothesis), you perform some procedure (experiment) to determine whether your guess is plausible (conclusion). This explains how to distinguish the plausible guesses from the implausible ones, but doesn't provide any guidance about which guesses you should make, or in what order.

In contrast, here's how effective software engineers debug, in my experience:

  1. They find a very general open question about the problem. They don't jump to conclusions about exactly what might be going wrong, like "should the loop in function f in module Foo start at 1 rather than 0?". Instead, they start with more general questions, like "which modules are involved?".
  2. They come up with an experiment that will give as definitive an answer as is practical. Sometimes this is a potential fix, but often it's some more direct type of data-gathering, like tracing, assertions, or inspection at a breakpoint in a debugger.
  3. They perform the experiment.
  4. They repeat the above steps, narrowing down their questions until they have an accurate idea of what must be going wrong.

What's going on here? The middle steps could be part of the same flailing experimental process that even the poorest programmers use. The important parts are really steps 1 and 4, the identification of open questions that cover a broad range of possibilities, and the continual narrowing of those questions as you hone in on the answer.

While the "guess and check" method resembled a science fair project, this method more closely resembles John R. Platt's description of what he called "strong inference" in his famous speech on effective scientific technique, The New Baconians:

  1. Devising alternative hypotheses;
  2. Devising a crucial experiment (or several of them), with alternative possible outcomes, each of which will, as nearly is possible, exclude one or more of the hypotheses;
  3. Carrying out the experiment so as to get a clean result;
  4. Recycling the procedure, making subhypotheses or sequential hypotheses to refine the possibilities that remain, and so on.

Where the scientific method you were taught in grade school is a filter, strong inference is a search. The idea goes back to Francis Bacon, who described science as tree traversal using the metaphor of "fingerposts which are set up where roads part, to indicate the several directions". Using the filter of experimentation to distinguish valid explanations from invalid ones, strong inference describes a method of ordering the space of possible explanations, and sort of binary-searching (actually, n-ary-searching) your way down to the correct explanation.

Platt didn't talk about starting with general open questions, although he got close with his comment about "subhypotheses". Part of the reason Platt's speech was so popular is that effective scientific technique isn't easy to pin down. Like the algorithm you use when searching through the phone book, debugging and other science may be easier to do than it is to precisely specify. (Hint: searching the phone book is sort of an interpolation search.) But debugging at least provides more convenient guiding examples than the slower, more complex sciences.

So how does this effective debugging procedure work, in detail? The fundamental technique is to arrange the search space into a tree based on the generality of the assertions each hypothesis makes. Every hypothesis makes at least one assertion, which can be viewed as a boolean question: is the assertion true, or false? Your tree branches for each of those possibilities. A more general hypothesis like "the problem is in function f" appears higher in your search tree than one that makes more specific assertions, like "the problem is in function f, and f's local variable foo can be null". Then you traverse the tree, doing an experiment at each level to select the correct branch.

One variant on this process has been automated in git bisect. When you know that a bug used to be absent from your code, you can start with a general hypothesis that the bug was introduced somewhere in a large range of versions, and narrow that hypothesis down to smaller ranges and eventually a specific version. Given a test case and an old revision that doesn't contain the bug, you can literally binary search your way to the exact revision in which the bug was introduced. From there, you may need to do additional searching to get down to the specific problem.

Sometimes a single experiment can resolve more than one binary question at a time, so it's useful to look at this process as an n-ary search through a tree, rather than just a binary search. At any given level of the tree, you might have branches for more than two alternative hypotheses.

Similarly, sometimes you run into a number of equally-general hypotheses. For example, if you already know that the problem must be in exactly one of three functions, it might not be clear which one to test first. One useful ordering for equally-general hypotheses is by the probability of successful identification of a plausible branch. If you know that functions f and g are old and have been subject to a fair amount of testing, but function h is much newer, you might test h first. The idea of arranging multiple levels of a search tree according to that principle reminds me of Huffman coding; you organize your tree so that the most probable hypotheses can be investigated in the least time.

Of course, questions aren't all created equal, and there isn't always an easy experiment to resolve every question. You might find that the obtainable evidence supports more than one hypothesis at some level in your tree. This happens all the time in multithreaded programming, where complex race conditions and deadlocks can be extremely difficult to reproduce. If you can't devise an experiment to determine which of two branches is correct, you have to take both branches, and hope one of them is invalidated further down. You can do this one branch at a time: if the branch you choose first is invalidated, you backtrack to focus on the correct branch. This means that in science, having multiple camps in support of divergent theories is perhaps a good thing: it's parallel processing! On the other hand, there's something to be said for the idea that a single scientist should be able to find more than one incomplete explanation plausible simultaneously.

Worse than merely taking the wrong branch first is what happens if you fail to consider all the possible explanations at some level in your tree. You could wind up following down some initially plausible but ultimately incorrect path without even being aware of the alternative. In debugging, when you realize you've done this, it's not too dramatic. Maybe you slap yourself on the forehead. But in more complex subjects like physics or chemistry, major backtracks to previously unimagined possibilities are what Thomas Kuhn called "paradigm shifts" — these are upsets like special relativity, and the abandonment of the phlogiston theory of combustion.

The fact that Kuhn wrote about paradigm shift in the 1960s, hundreds of years after Francis Bacon wrote about conditional inductive trees, shows how our understanding of scientific technique has lagged behind our ability to use it. It might make sense to use debugging as a model for teaching science. Students could develop methodological intuition on a simplified model, rather than trying to accumulate it over years of much slower work with a harder problem. Of course, fledgling scientists would have to learn some computer programming to be able to play with debugging tasks. There's a fair argument that tomorrow's scientists would do well to pick up a little bit of computer programming anyway. But, simplified computational models of physics or chemistry could provide just as nice a model as debugging. In fact, in computational biology, serious scientific experiments are already being done on computer models of physics, rather than the real thing! Why not start with something simpler, for educational purposes?

Update: I originally asserted that Francis Bacon, "using modern CS terminology three hundred years ahead of time, called [strong inference] a 'conditional inductive tree'". I was confused — those were Platt's words, not Bacon's.

Sunday, April 13, 2008

PHP and Lua Support for codepad.org

I spent last week making some fairly substantial improvements to the codepad.org back-end. The highlight for the moment is that codepad now speaks PHP and Lua.