tag:blogger.com,1999:blog-37731449639273688382024-02-20T12:30:57.040-08:00hackerdasherySteven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comBlogger16125tag:blogger.com,1999:blog-3773144963927368838.post-34831175725330072742014-08-27T23:02:00.001-07:002014-08-27T23:02:01.411-07:00P vs. NP and the Computational Complexity ZooAt long last, my second video is finished! <br />
<br />
<h4>Hackerdashery #2: P vs. NP and the Computational Complexity Zoo</h4><br />
<p><iframe width="560" height="315" src="//www.youtube.com/embed/YX40hbAHx3s" frameborder="0" allowfullscreen></iframe>Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comtag:blogger.com,1999:blog-3773144963927368838.post-9024231735468190012013-04-21T19:22:00.001-07:002013-04-21T19:22:05.506-07:00Hackerdashery YouTube ChannelI'm back! This time in video.<br />
<br />
<iframe width="560" height="315" src="http://www.youtube.com/embed/AmySxYHqQCQ" frameborder="0" allowfullscreen></iframe><br />
<br />
If you liked my blog back in 2008, you might like <a href="http://youtube.com/hackerdashery">my new YouTube channel</a>. I'll be talking about programming and all things tangentially related. Check it out and subscribe for more.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comtag:blogger.com,1999:blog-3773144963927368838.post-43690931877319859442008-09-01T12:26:00.001-07:002008-09-01T15:47:52.408-07:00Google Chrome and the Future of BrowsersThe big news today in the nerd world is that <a href="http://blogoscoped.com/archive/2008-09-01-n47.html">Google is going to ship an open-source browser</a>, called <a href="http://googleblog.blogspot.com/2008/09/fresh-take-on-browser.html">Google Chrome</a>.
<p>
Chrome incorporates some of the same ideas <a href="http://bramcohen.livejournal.com/51080.html">Bram</a> and I <a href="http://www.hackerdashery.com/2008/06/web-browsers-and-memory-fragmentation.html">wrote about recently</a>: each site runs in its own process, so that memory can be bulk-reclaimed when you navigate away from a domain or close a tab, and it provides a task-manager-like interface that lets you see which site is using all you CPU or RAM. It also uses a more sophisticated Javascript VM called V8, which features precise, incremental garbage collection, and dynamic compilation.
<p>
Between these good design ideas and Google's ability (and intention) to test against their enormous, ranked web cache, I think Chrome is pretty much guaranteed to be a substantially better browser than IE, Firefox, and Safari are today.
<p>
By treating web apps more like regular applications, Chrome, together with <a href="http://gears.google.com/">Gears</a>, constitutes a serious bid by Google to grow browsers into decently capable application platforms almost on the scale of operating systems. That means Google is slowly but surely working to erode Microsoft's monopoly. If browsers provide a high-quality, universally popular, OS-independent platform for applications, OS vendors no longer have any particular advantage in the desktop applications market. You don't have to look too far beyond Chrome to see operating systems becoming completely interchangeable commodities, just invisible layers that provide hardware support and enough infrastructure to run a browser and maybe some games.
<p>
Maybe Google will finally be able to make good on Marc Andreessen's declaration over a decade ago that the browser would make the operating system irrelevant.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comtag:blogger.com,1999:blog-3773144963927368838.post-19691681405720947192008-07-15T18:03:00.000-07:002008-07-15T18:17:46.797-07:00codepad vim pluginNicolas Weber has written a handy Vim plugin that adds commands to paste the contents of your current buffer into <a href="http://codepad.org/">codepad</a>, and optionally run it. You can get it <a href="http://www.vim.org/scripts/script.php?script_id=2298">here</a>.
<p>
I spend most of my time in Emacs, so if people find this useful, maybe I'll write an Emacs version. Although, I've been considering a switch to Vim. I'm thinking maybe its multi-keystroke commands will be easier on my wrists than the control- and meta- chording I do so much of in Emacs. I can't decide whether breaking with a decade of muscle-memory training will be more or less painful.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comtag:blogger.com,1999:blog-3773144963927368838.post-49895981490936389022008-06-19T11:35:00.000-07:002008-06-19T11:38:49.834-07:00Web Browsers and Memory FragmentationI've been following <a href="http://blog.pavlov.net/">Stuart Parmenter's blog posts</a> about <a href="http://blog.pavlov.net/2008/03/11/firefox-3-memory-usage/">improving memory usage in Firefox 3</a>. The most interesting part has been his excellent work on reducing <a href="http://blog.pavlov.net/2007/11/10/memory-fragmentation/">memory fragmentation</a>. By making <a href="http://blog.pavlov.net/2007/12/04/vlad-and-analysis-of-dtrace-was-used/">pretty pictures</a> of memory fragmentation under various malloc implementations, he was able to identify <a href="http://people.freebsd.org/~jasone/jemalloc/">jemalloc</a> as the best option for Firefox, and the improvement made by switching was pretty impressive.
<p>
But I suspect that ultimately it won't be enough. The advent of Ajax
is dramatically changing the way browsers use memory. Rather than
loading a fairly static DOM and occasionally doing a few small
manipulations in Javascript as the site is used, modern web apps
mutate the DOM much more extensively, adding and removing sections
repeatedly as new data is received from the server. And they're going
to be doing that kind of thing more and more.
<p>
There's an interesting problem here which is unique to browsers.
Javascript doesn't give programmers direct access to pointers, so it's
free to move objects around in memory however it pleases as it
executes code. Modern Javascript implementations take advantage of
that abstraction, and manage memory allocation with generational
garbage collectors that compact memory by relocating objects as they
operate. But the DOM, which also has to be manipulated by browser
code in C or C++, is reference-counted. The reference counter can't
migrate objects between regions of memory, so it depends on the
underlying malloc implementation to control memory fragmentation.
That's not always possible even for a very smart malloc to accomplish.
DOMs in particular should be challenging, because they contain large
numbers of strings of varying sizes.
<p>
Reference counting introduces memory fragmentation problems in any
language, but the situation is particularly bad in browsers.
Javascript programs and DOMs on different web pages share the same
process memory space. The fragmentation caused by one page can
interact with the memory used by other pages, and even after a
problematic page is closed, the fragmented memory often can't be
reclaimed. With ordinary applications, fragmented memory is all
neatly reclaimed when the process terminates. Right now, that doesn't
work for web apps.
<p>
But this is a problem for browsers, not for web apps. Individual web
app authors won't be motivated to care if your browser uses a lot of
memory and causes your machine to start swapping once or twice a day,
especially if the blame is shared among many popular sites. Unless
one site is much worse than its peers, high memory usage makes
the browser look bad, not the app.
<p>
So I think browsers will eventually experience pressure to fix this
problem. As the correct metaphor for a web page moves farther from
"document" and closer to "application", maybe it makes sense for
browsers to act more like operating systems. Local memory for a web
page could be allocated in dedicated regions, and could all be
bulk-reclaimed when the page is closed.
<p>
That model is simple, but it still allows individual pages to fragment
their own memory and consume more and more over time. Maybe a better
answer is to find a way to use a generational garbage collector on the
DOM. The familiar OS model of per-process virtual memory management
was designed for running programs written in languages like C and C++,
and that's why it's appealing for handling browser DOMs. But
Javascript is a high-level, functional, garbage-collected language.
If browsers are little operating systems that run Javascript
applications, maybe they should operate more like <a href="http://en.wikipedia.org/wiki/Lisp_machine">Lisp machine</a> <a href="http://lispm.dyndns.org/genera-concepts/genera.html">operating systems</a> than like Unix.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comtag:blogger.com,1999:blog-3773144963927368838.post-10370659774664501592008-06-17T16:28:00.000-07:002008-06-17T17:08:25.067-07:00Private Pastes (and Projects) for codepad.orgI've noticed that some pastes on <a href="http://codepad.org/">codepad.org</a> include confidentiality notices in their copyright boilerplate. You own (or your employer owns) the copyright on code you paste on codepad.org, but it is a public forum! Pasted code appears on the "Recent Pastes" page, for example, where anyone can see it, and search engines can index it.
<p>
It makes sense that some people might want to use the site without revealing their code to the public, though. To support that kind of usage, I've made it possible to flag a paste as <a href="http://codepad.org/help/private-pastes">private</a>, so that it won't be publicly linked on the site, and will include a <a href="http://en.wikipedia.org/wiki/Noindex">noindex meta tag</a> to advise search engines not to index it. Just in case, I went and flagged all the pastes in the database that included the words "confidential" or "copyright" as private pastes.
<p>
While I was at it, I added a similar feature for <a href="http://codepad.org/mkproj">codepad projects</a>. Now you can set up a private codepad project to collaborate on code that you'd rather not show the whole world.
<p>
Since paste URLs are randomly generated, they shouldn't be vulnerable to the kind of URL-guessing attacks <a href="http://blogoscoped.com/archive/2008-01-28-n59.html">SmugMug had problems with earlier this year</a>. Still, these measures aren't really good enough for protecting important trade secrets — especially for projects, where the name might be guessable. If it seems like there's demand for it, I'll consider adding access control lists in the future.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.comtag:blogger.com,1999:blog-3773144963927368838.post-40591485579611617992008-05-15T17:58:00.000-07:002008-05-15T19:09:16.295-07:00How Not to Handle Streams<p>
Let me tell you about what I did yesterday. I set out to write a
little HTTP proxy server that would transfer large (> 1GB) files to and
from Amazon S3, doing a tiny bit of processing on the beginning of the
data as it came through. Sounds pretty simple, right? You're
probably thinking that's just a matter of gluing some HTTP libraries
to some S3 libraries. I thought so too.
<p>
I've been enjoying Haskell lately, so my first thought was to try to
write it in Haskell, even if it meant producing my own Haskell S3
library in the process. But I soon discovered that <a href="http://www.haskell.org/http/">Haskell's HTTP library</a> doesn't
have any facility for streaming large amounts of data over HTTP: it
reads the entire request or response into a <tt>String</tt> in memory.
<p>
This is an extremely common problem. Library authors often don't
consider applications involving large amounts of data, and assume that
everything will fit in memory. This is a particularly sad state of
affairs in Haskell, because the language itself supports streaming
behavior via lazy evaluation — there's no need to even change
the API to support streaming in Haskell. Just make your functions do
lazy IO, and those <tt>Strings</tt> can be used to stream data
as-needed. (In fact, there's been <a href="http://nominolo.blogspot.com/2007/05/networkhttp-bytestrings.html">some
work</a> on this in Haskell.)
<p>
Note also that reading and writing as you process data is supported by
TCP. A lot of people seem to be at least slightly confused about this
situation, particularly in the case of receiving data from a remote
sender. If the other side of a TCP connection is sending data faster
than you can process it, nothing terrible happens. Your kernel's TCP
buffer will fill up, it will advertise a 0-sized receiver window, and
the remote client will stop being able to send. The remote TCP buffer
will fill, and the socket will correctly block, or stop signaling <tt>poll()</tt>,
until it can write again. To the other side, it won't look any
different from a slow, bursty network connection.
<p>
Anyway, simple string-based APIs are disappointing, but it gets worse.
Even when authors do consider the need to deal with large amounts of
data, they often get the API wrong.
<p>
I eventually gave up on Haskell, and switched to Python. Things
started out okay. <a href="http://www.python.org/doc/lib/module-BaseHTTPServer.html">BaseHTTPServer</a>
and <a href="http://docs.python.org/lib/module-urllib2.html">urllib2</a> both
let you stream data to and from file-like objects. But the <a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=134">example
Python S3 library</a> doesn't provide a stream API, it reads
everything into strings. Well, okay, it's just an example. So I took
a look at <a href="http://code.google.com/p/boto/">boto</a>.
<p>
Boto has a variety of methods, some with names that sound promising.
Like <tt>get_contents_to_file</tt>, that'll take a file-like object
and write the data... wait. I'm <i>getting</i> the data <i>from</i>
S3. I want to <i>read</i> it from a file-like object, not pass in a
file-like object for boto to write to. Boto goes out of its way
here to do the wrong thing! Rather than just handing me the wrapped
socket object it no doubt has internally, it reads and writes all the
data in a loop, converting the read operations I want to do into write
operations I'll have to handle. To do any processing on
this data, I'd have to construct a file-like object that can handle
writes from boto!
<p>
Another thing I considered was writing my own S3 interface using <a href="http://pycurl.sourceforge.net/">PycURL</a>. But PycURL requires
you to give it a "write function" callback, which it will call
periodically with data until the entire response body has been read.
It's the same situation as with boto: you have to handle writes where
you would prefer to perform reads. If you wanted to turn that into a
readable file-like object, you'd have to run PycURL in a thread, and
handle writes by putting data into a queue, and blocking when the
queue is full. Then you could read out of the other side of the queue
in the main thread.
<p>
For some reason this kind of read/write inversion is almost as common
as string-based interfaces. Sometimes API users are required to
provide a file-like object, and other times it's a callback function.
Either way, the user ends up writing code to handle IO as it is
performed by the API function, rather than performing the IO
themselves. The most convenient read/write direction for an API to
provide is almost always the one provided by the underlying IO: the
one that allows the user to perform the IO directly, as they need it.
<p>
Frustrated with the situation in Python, I considered using Ruby.
Ruby's <a href="http://amazon.rubyforge.org/">AWS::S3</a> has a nicer
interface than boto, and allows me to stream data reasonably. (It actually does the same kind of read/write inversion that PycURL does, but Ruby's block syntax makes that somewhat less annoying.) And <a href="http://mongrel.rubyforge.org/">Mongrel</a> — the web
server everybody uses to host their Rails apps — can be used as
a library for writing HTTP servers, a lot like Python's
BaseHTTPServer. Oh, but wait. Mongrel delivers the HTTP request body
either as a string, or, if it exceeds some configurable size, as a
tempfile. So it'll let me read from a file-like object, but insists
on writing the data to an actual file on disk first!
<p>
Perhaps Mongrel is trying to be too smart, to anticipate my needs
rather than exposing the more generally useful API underneath. In any
case, it didn't work for my application.
<p>
The expedient solution for me was to hack up boto to make it not
deliberately break the functionality I wanted. I'll be submitting a
patch, so perhaps one of the <i>five</i> unhelpful APIs I encountered
yesterday will someday be fixed. Why does this seem to be so
difficult for library authors to get right?Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com12tag:blogger.com,1999:blog-3773144963927368838.post-26590409549171626402008-05-11T14:33:00.000-07:002008-05-11T15:04:50.390-07:00codepad.org projectscodepad.org now has <a href="http://codepad.org/mkproj">project pastebins</a>, which allow groups of developers to work with their own history of recently pasted code. Here's an <a href="http://example.codepad.org/">example</a>. I'm thinking about ways to expand the project functionality. If you're considering using it and have any suggestions, I'd love to hear them.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com4tag:blogger.com,1999:blog-3773144963927368838.post-50749700393364216272008-05-08T13:13:00.000-07:002008-05-08T13:17:30.549-07:00The Terrible Legacy of LegacyOut of curiosity, I recently started reading the <a
href="http://www.haskell.org/pipermail/haskell-prime/">haskell-prime
mailing list</a>. <a href="http://hackage.haskell.org/trac/haskell-prime/">Haskell'</a> is
the interim name for the language standard that will supersede
Haskell98, like <a href="http://www.artima.com/cppsource/cpp0x.html">C++0x</a> is to
C++03. There's been some discussion on the list about where to break
backward compatibility with Haskell98, and which features are worth
the trouble. It's interesting to see a language just starting to have
to deal with these issues. On one side, you've got researchers and
language designers angling to improve the language as much as possible
before a new standard is nailed down. On the other are mainly
industry users, pushing for less change and more backward
compatibility, so they don't have to spend time and money upgrading
their codebase or working with old and unsupported tools.
<p>
The influence of backward compatibility on software is hard to
overestimate. Windows Vista is still binary-compatible with many
MS-DOS programs dating back to 1981, and the <a href="http://en.wikipedia.org/wiki/86-DOS">DOS API</a> was in turn
meant to make it easy to port CP/M programs from the late ’70s.
Meanwhile, Mac OS X and Linux are both Unix variants, with some core
API features dating back to the early ’70s.
<p>
The situation with processor instruction sets is similar: the
processors in today's PCs (and Macs!) are backward-compatible with
Intel's original 8086, which itself was designed so that assembly for
the 8008 — the first 8-bit microprocessor, released in 1972
— could be converted automatically.
<p>
This means that there are 30-year-old programs which would require
very little modification to run on today's operating systems. And
there's no reason to expect that in another 30 years we won't still be
using systems with an unbroken 60-year-long chain of backward
compatibility.
<p>
It's not that these technologies were so <a href="http://web.mit.edu/~simsong/www/ugh.pdf">ingenious</a> that we
haven't managed to think of anything better in the intervening
decades. Rather, when the quality of a software interface gets good
enough to become a widespread standard, the value of any given
improvement on that interface is dwarfed by the value of
compatibility. Progress in any direction that would require a break
in compatibility slows dramatically. The bar for de-facto standards
isn't "great", but merely "<a href="http://www.jwz.org/doc/worse-is-better.html">good enough</a>".
<p>
What this means is that an increasing number of design features in the
software systems we use every day are attributable to <a href="http://www.catb.org/jargon/html/H/hysterical-reasons.html">historical
reasons</a>. That's the terrible legacy of legacy code. The crushing
gravity of installed bases eventually pulls even the best-designed
systems down into a mire of hard-to-learn, hard-to-use arcana.
<p>
A lot of programming languages from the ’90s are feeling that
pressure today, and the result is a number of planned
backward-incompatible major revisions, including <a href="http://docs.python.org/dev/3.0/whatsnew/3.0.html">Python
3000</a>, <a href="http://www.rubyist.net/~matz/slides/rc2003/">Ruby
2</a>, and <a href="http://www.pugscode.org/">Perl 6</a>. I'm going
to go out on a limb and claim that those languages have more users and
larger existing codebases than Haskell does. If they can make
backward-incompatible changes just to clean house, surely a primarily
research-oriented language like Haskell can.
<p>
Don't get me wrong, there are <a href="http://www.joelonsoftware.com/articles/APIWar.html">good
reasons</a> to maintain backward compatibility in a wide variety of
situations. But if you don't have those reasons, why in the world
would you subject your programming language to the mangling, bloating
influence of backward compatibility? While backward compatibility is
a great default position when you don't have any improvements to make,
giving up too much to maintain compatibility is bad for everyone.
<p>
The question is, how does the cost to existing users compare to the
value to <i>all future users</i>? If your language is still growing
in popularity, the number of future users can easily exceed the number
of current users by orders of magnitude. If you don't break
compatibility to fix a problem, you're hurting all the users who will
have to live with the problem for who knows how long, in order to
avoid hurting the few who will have to upgrade now. And if you don't
fix it, someone else will, in a new language, so your users get stuck
in COBOL-support hell eventually anyway. That's if we're lucky. If
we're not lucky, your language will become so popular that the value
of compatibility will outweigh the value of better languages, and your
language will be a drag on progress. It's practically a moral
imperative to fix it <a href="http://glyf.livejournal.com/61347.html">while you still can</a>.
<p>
C++ is an excellent example of a language that has valued backward
compatibility over simplicity and consistency. It's a popular,
practical tool, but few people consider it beautiful, easy to learn,
or easy to work with. As Bjarne Stroustrup <a href="http://www.research.att.com/~bs/hopl-almost-final.pdf">put
it</a>:
<blockquote type="cite">
I consider it reasonably obvious that in the absence of compatibility
requirements, a new language on this model can be much smaller,
simpler, more expressive, and more amenable to tool use than C++,
without loss of performance or restriction of application domains. How
much smaller? Say 10% of the size of C++ in definition and similar in
front-end compiler size. In the "Retrospective" chapter of D&E, I
expressed that idea as "Inside C++, there is a much smaller and
cleaner language struggling to get out". Most of the simplification
would come from generalization — from eliminating the mess of
special cases that makes C++ so hard handle to rather than restriction
or moving work from compile time to run time.
</blockquote>
<p>
What Stroustrup originally said was that "Within C++, there is a much
smaller and cleaner language struggling to get out," which "would
[...] have been an unimportant cult language." I'm not sure I agree
with that last part. Java, for example, is a syntactically similar
language whose designers did decide to give up compatibility with C in
order to achieve greater simplicity and consistency. Even though Java
is virtual-machine-based and unsuitable for a wide variety of systems
programming tasks, its popularity has by <a href="http://www.langpop.com/">some metrics</a> exceeded that of C++.
<p>
What has compatibility bought C++? Java shows that it wasn't a
requirement for popularity. And despite being largely
backward-compatible with C, C++ has had <a href="http://article.gmane.org/gmane.comp.version-control.git/57918">difficulty</a>
supplanting it for many tasks. Indeed, there's more than one place
where a break with backward compatibility might have simplified and
improved C++ at little cost. The <a href="http://en.wikipedia.org/wiki/Modern_C%2B%2B_Design">modern
C++</a> style that has become prominent in recent years bears little
relation to the way C++ was written in the ’90s. How much value
is there, really, in being able to compile both dialects with the same
compilers?
<p>
So that's my position: backward compatibility at the expense of
simplification is only appropriate when you can't gain acceptance any
other way. If a given backward-incompatible improvement isn't going
to cause a fork in the language, its value probably outweighs the
value of compatibility.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com1tag:blogger.com,1999:blog-3773144963927368838.post-10800696710556538932008-04-26T09:55:00.000-07:002008-04-26T12:37:26.604-07:00Debugging ScienceComputer Science is badly misnamed. For the most part, I don't see
how it's a science at all. If you look at the curriculum of a typical
undergraduate CS program, about half of it looks like a branch of
mathematics. There are classes on abstractions like automata,
algorithms, and computational complexity — areas in which many
important results were worked out by mathematicians long before the
first computers were built. Interspersed with the mathematics are
classes on software engineering; classes about how to build things,
like operating systems, compilers, and networks. There aren't many
labs, because the field isn't too big on experiments. How many people
in CS departments even consider themselves scientists?
<p>
But there's an important component of software engineering that does
qualify as a science: debugging computer programs. When you find a
bug in your program, your mental model of the program has been
invalidated. The program didn't behave as you expected. Your task in
debugging is to find out why. Or, put another way, to find out how
your program actually behaves — just because you see unexpected
results doesn't mean that you understand exactly what your program did
to produce them. Experiment has invalidated your old theory, and
you're about to perform additional experiments to help you find a new
one. Science!
<p>
Debugging is particularly interesting because it's such a simple
example of science. The relevant experiments can be performed rapidly
and without expensive equipment. The measurements are precise. The
problems can be simple, or very complex. It seems like an ideal model
for investigating how science works. If you want to learn something
about effective scientific technique, look at the neat little
microcosm of debugging technique.
<p>
For example, some people are very bad at debugging. These people
typically "guess and check": they guess at what must be going wrong,
"fix" their assumed problem, and then check to see whether the
program's behavior improves. Sometimes you get lucky this way, but if
there are a lot of possibilities, you can spend a long time guessing.
You might even forget what you've already tried and end up testing
potential solutions more than once. Or, worse, you could think you've
tried something you haven't.
<p>
This is poor scientific method, but it's also the form of the
scientific method that you were probably taught in grade school. It
has all the components of a science fair project: problem, hypothesis,
experiment, and conclusion. But if you've ever watched someone debug
this way, you know what a disaster it is.
<p>
The problem is that science as described above is only a
<i>filter</i>. Really, all it describes are the inputs and outputs of
experimentation. Given a situation you don't understand (problem) and
a guess about what is going on (hypothesis), you perform some
procedure (experiment) to determine whether your guess is plausible
(conclusion). This explains how to distinguish the plausible guesses
from the implausible ones, but doesn't provide any guidance about
which guesses you should make, or in what order.
<p>In contrast, here's how effective software engineers debug, in my
experience:
<ol>
<li>They find a very general open question about the problem. They don't jump to conclusions about exactly what might be going wrong, like <i>"should the loop in function <tt>f</tt> in module <tt>Foo</tt> start at 1 rather than 0?"</i>. Instead, they start with more general questions, like <i>"which modules are involved?"</i>.
<li>They come up with an experiment that will give as definitive an
answer as is practical. Sometimes this is a potential fix, but often
it's some more direct type of data-gathering, like tracing,
assertions, or inspection at a breakpoint in a debugger.
<li>They perform the experiment.
<li>They repeat the above steps, narrowing down their questions until
they have an accurate idea of what must be going wrong.
</ol>
<p>
What's going on here? The middle steps could be part of the same
flailing experimental process that even the poorest programmers use.
The important parts are really steps 1 and 4, the identification of
open questions that cover a broad range of possibilities, and the
continual narrowing of those questions as you hone in on the answer.
<p>
While the "guess and check" method resembled a science fair project,
this method more closely resembles John R. Platt's description of what
he called "strong inference" in his famous speech on effective
scientific technique, <a href="http://256.com/gray/docs/strong_inference.html">The New Baconians</a>:
<blockquote type="cite">
<ol>
<li> Devising alternative hypotheses;
<li> Devising a crucial experiment (or several of them), with alternative possible outcomes, each of which will, as nearly is possible, exclude one or more of the hypotheses;
<li> Carrying out the experiment so as to get a clean result;
<li> Recycling the procedure, making subhypotheses or sequential hypotheses to refine the possibilities that remain, and so on.
</ol>
</blockquote>
<p>
Where the scientific method you were taught in grade school is a
filter, strong inference is a <i>search</i>. The idea goes back to
Francis Bacon, who described science as tree traversal using the metaphor of "fingerposts which are set up where roads part, to indicate the several directions". Using the
filter of experimentation to distinguish valid explanations from
invalid ones, strong inference describes a method of ordering the space of possible
explanations, and sort of binary-searching (actually, n-ary-searching)
your way down to the correct explanation.
<p>
Platt didn't talk about starting with general open questions, although
he got close with his comment about "subhypotheses". Part of the
reason Platt's speech was so popular is that effective scientific
technique isn't easy to pin down. Like the algorithm you use when
searching through the phone book, debugging and other science may be
easier to do than it is to precisely specify. (Hint: searching the
phone book is sort of an <a href="http://en.wikipedia.org/wiki/Interpolation_search">interpolation
search</a>.) But debugging at least provides more convenient guiding
examples than the slower, more complex sciences.
<p>
So how does this effective debugging procedure work, in detail? The
fundamental technique is to arrange the search space into a tree based
on the generality of the assertions each hypothesis makes. Every
hypothesis makes at least one assertion, which can be viewed as a
boolean question: is the assertion true, or false? Your tree branches
for each of those possibilities. A more general hypothesis like "the
problem is in function f" appears higher in your search tree than one
that makes more specific assertions, like "the problem is in function
f, and f's local variable foo can be null". Then you traverse the
tree, doing an experiment at each level to select the correct branch.
<p>
One variant on this process has been automated in <a href="http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html">git
bisect</a>. When you know that a bug used to be absent from your
code, you can start with a general hypothesis that the bug was
introduced somewhere in a large range of versions, and narrow that
hypothesis down to smaller ranges and eventually a specific version.
Given a test case and an old revision that doesn't contain the bug,
you can literally binary search your way to the exact revision in
which the bug was introduced. From there, you may need to do
additional searching to get down to the specific problem.
<p>
Sometimes a single experiment can resolve more than one binary
question at a time, so it's useful to look at this process as an n-ary
search through a tree, rather than just a binary search. At any given
level of the tree, you might have branches for more than two
alternative hypotheses.
<p>
Similarly, sometimes you run into a number of equally-general
hypotheses. For example, if you already know that the problem must be
in exactly one of three functions, it might not be clear which one to
test first. One useful ordering for equally-general hypotheses is by
the probability of successful identification of a plausible branch.
If you know that functions <tt>f</tt> and <tt>g</tt> are old and have
been subject to a fair amount of testing, but function <tt>h</tt> is
much newer, you might test <tt>h</tt> first. The idea of arranging
multiple levels of a search tree according to that principle reminds
me of <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman
coding</a>; you organize your tree so that the most probable
hypotheses can be investigated in the least time.
<p>
Of course, questions aren't all created equal, and there isn't always
an easy experiment to resolve every question. You might find that the
obtainable evidence supports more than one hypothesis at some level in
your tree. This happens all the time in multithreaded programming,
where complex race conditions and deadlocks can be extremely difficult
to reproduce. If you can't devise an experiment to determine which of
two branches is correct, you have to take both branches, and hope one
of them is invalidated further down. You can do this one branch at a
time: if the branch you choose first is invalidated, you backtrack to
focus on the correct branch. This means that in science, having
multiple camps in support of divergent theories is perhaps a good
thing: it's parallel processing! On the other hand, there's something
to be said for the idea that a single scientist should be able to find
more than one incomplete explanation plausible simultaneously.
<p>
Worse than merely taking the wrong branch first is what happens if you
fail to consider all the possible explanations at some level in your
tree. You could wind up following down some initially plausible but
ultimately incorrect path without even being aware of the alternative.
In debugging, when you realize you've done this, it's not too
dramatic. Maybe you slap yourself on the forehead. But in more
complex subjects like physics or chemistry, major backtracks to
previously unimagined possibilities are what Thomas Kuhn called
<a href="http://en.wikipedia.org/wiki/The_Structure_of_Scientific_Revolutions">"paradigm shifts"</a> — these are upsets like special relativity,
and the abandonment of the phlogiston theory of combustion.
<p>
The fact that Kuhn wrote about paradigm shift in the 1960s, hundreds
of years after Francis Bacon wrote about conditional inductive trees,
shows how our understanding of scientific technique has lagged behind
our ability to use it. It might make sense to use debugging as a
model for teaching science. Students could develop methodological
intuition on a simplified model, rather than trying to accumulate it
over years of much slower work with a harder problem. Of course,
fledgling scientists would have to learn some computer programming to
be able to play with debugging tasks. There's a fair argument that
tomorrow's scientists would do well to pick up a little bit of
computer programming anyway. But, simplified computational models of
physics or chemistry could provide just as nice a model as debugging.
In fact, in computational biology, serious scientific experiments are
already being done on computer models of physics, rather than the real
thing! Why not start with something simpler, for educational
purposes?
<p>
<span style="font-size:80%">
<b>Update</b>: I originally asserted that Francis Bacon, "using modern CS terminology three hundred years ahead of time, called [strong inference] a 'conditional inductive tree'". I was confused — those were Platt's words, not Bacon's.</span>Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com3tag:blogger.com,1999:blog-3773144963927368838.post-50006895951366330192008-04-13T17:45:00.001-07:002008-04-13T21:43:02.994-07:00PHP and Lua Support for codepad.orgI spent last week making some fairly substantial improvements to the <a href="http://codepad.org">codepad.org</a> back-end. The highlight for the moment is that codepad now speaks PHP and Lua.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com0tag:blogger.com,1999:blog-3773144963927368838.post-44119427809582824282008-03-13T17:06:00.000-07:002008-03-14T16:46:53.458-07:00Tree Lists as a Default List Data StructureI'd like to call your attention to <a
href="http://www.spiteful.com/about/">Tom Kleinpeter</a>'s new <a
href="http://www.spiteful.com/">blog</a>, in which he's been posting
lots of great stuff about his experiences building and scaling
Audiogalaxy and FolderShare. I worked with Tom at both of those
companies, helping him build the p2p clients and backend systems. If
you know me, you know that I think most software engineers are pretty
crappy. Statistically speaking, I probably think you're a pretty
crappy engineer. Tom, on the other hand, is a really good engineer,
one of the best I've worked with. His blog is part of what inspired
me to finally finish setting up this one.
<p> Anyway, in an <a
href="http://www.spiteful.com/2008/02/25/5-more-essentials-for-your-programming-toolbox/">earlier
article</a>, he mentioned <a
href="http://en.wikipedia.org/wiki/Skip_list">skip lists</a> and <a
href="http://en.wikipedia.org/wiki/Unrolled_linked_list">unrolled link
lists</a>, which reminded me of my love for this category of list data
structures.
<p> Most languages that have a built-in list type have one of two
alternatives (and often strangely lack any standard implementation of
the other alternative):
<ul>
<li><a href="http://en.wikipedia.org/wiki/Linked_list">Linked
lists</a> are the popular built-in list type in functional programming
languages like Scheme and Haskell. They allow fast O(1) inserts and
deletes at any point in the list. Unfortunately, lookup is O(n),
which makes many common insert and delete operations (those that don't
start from an iterator in the correct position in the list) O(n) as
well. Memory overhead for linked lists involves at least one link
pointer for every element in the list. All in all, they're an
algebraically fundamental but pragmatically somewhat specialized type of
list.<br/><br/>
<li>Array lists (that is, lists implemented as arrays which are
dynamically resized to fit the length of the list, usually by
doubling/halving the array length when necessary) are popular in many
other dynamic languages, such as Python, Ruby, and Perl, and are the
most popular type of list in the collection libraries of many
langauges that don't have built-in list support, including C++, Java,
and C#. Since they require shifting large portions of the array around
in memory, inserts, deletes, and prepends are O(n), but appends and
lookups are O(1) (and cache-friendly). Amortized memory overhead
is lower than for linked lists, since link pointers are not required,
and in the worst case the overhead is equal to that of linked lists.
They're practical for a wide range of applications, but fall down when
you want to insert or delete in the middle of the list.
</ul>
<p> Tree-based list data structures — let's call them "tree
lists" — are a compromise between those two options. The basic
strategy for tree lists is to split up your list into segments stored
as leaves in a tree (as either linked lists or array lists). Lookups
in a tree list require tree traversal, but not traversal of the entire
list. Inserts and deletes may require shifting some elements around
in memory, but not many of them. Inserts, deletes, prepends, appends,
and lookups all wind up in the neighborhood of O(log n). That's
acceptable performance for a wide variety of operations.
<p>Skip lists and unrolled link lists, though they use unusual
representations of trees, are both examples of this
strategy. (The major trade-off between them is that by using linked
lists for the leaves, skip lists retain the O(1) insert/delete
performance of linked lists, but at the cost of being less
cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.
<p><a
href="http://pypi.python.org/pypi/blist/0.9.4">BList</a> is an
implementation of a type of tree list in Python using B+Trees, which have the
advantage that for small lists, they retain the excellent performance
characteristics of small array lists. Python has <a
href="http://www.python.org/dev/peps/pep-3128/">rejected the
proposal</a> that standard Python lists be replaced with BList, and I
can see why that makes sense for Python; drastically changing the
performance characteristics of a basic type used in most existing code
seems like a bad idea. But I think it would be interesting for a new
language to adopt something like BList for its built-in list type.
Tree lists in general seem like a good generic choice for lists, since
from a performance perspective they're less specialized than linked
lists and array lists.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com9tag:blogger.com,1999:blog-3773144963927368838.post-46949133974572929972008-03-07T12:52:00.000-08:002008-03-07T21:13:22.100-08:00Scaling at 2am with EC2I really like Amazon EC2. Hosting companies everywhere should be learning about virtualization and web services, because this kind of thing is clearly the future of managed hosting.
<p>
This past Tuesday, I posted about my <a href="http://codepad.org/">codepad.org</a> project on <a href="http://news.ycombinator.com/">news.YC</a>. To my surprise, it got a lot of attention. Traffic increased by 1000x. That still amounts to only a modest amount of traffic, but the increase was very sudden, and I'm the only one working on the site. And as it turned out, I had a serious bug in my backend execution servers. I managed to keep the site working most of the time anyway, though, because EC2 saved my bacon.
<p>
First, a couple details about the architecture: codepad.org's backend untrusted-code-execution servers run in heavily firewalled virtual machines on Amazon EC2 instances (which are themselves firewalled and virtual -- it's like turtles all the way down or something.) The web frontend, running on a traditional colocated server, transmits code to the backend and gets output in response. The backend servers don't store any state between requests, so they're easy to replace, and easy to scale horizontally.
<p>
At the beginning of the influx of traffic on Tuesday, I had only one execution server instance running. I quickly discovered that it couldn't keep up with the load, so I launched a couple more instances, and had my web frontend load-balance between them. This is the benefit I expected from EC2: requisitioning new machines takes seconds rather than days, and cloning them from existing servers is effortless.
<p>
Then I discovered that I had a bug. My execution servers would occasionally stop servicing requests for one language or another. At first, when this happened, I would just replace the machine in a hurry, and then continue to hunt for the bug. Because of the firewalling and virtualization involved, just replacing the whole machine was a lot more convenient than trying to swap in a new server process would have been.
<p>
As traffic increased, the lock-ups started happening too frequently for me to replace machines by hand. The first thing I did was to spin up 10 machines, so that load on each machine would be lower, making failures both less frequent and less problematic (since a single-machine failure would only affect 10% of users.) Not long after that, I got a script running that would switch out all 10 machines for new ones every hour. This actually got the site working flawlessly, before I'd even found the bug! It allowed me to get some sleep on Tuesday night.
<p>
Of course, I could have done the same thing at the process level, rather than the machine level. But I've done things like that in the past, and I was surprised by how much easier this was. If you don't write your servers right, switching them out when they fall over becomes a mess of sorting out config files and ports. If they use a lot of memory, you can get seriously stuck, because running them in parallel can cause them to start swapping when you need it least. With machines, there's no chance that they'll step on one another's toes like that. You just start up a new one, change the address in the frontend config, and kill the old one.
<p>
Another benefit of this approach is that the routines I built up in my machine-swapping script are of general utility. Parts of that script are going to be useful in building a system to launch and kill machines automatically in response to fluctuations in load. I'm really curious to find out whether, by using only the number of machines I need at any particular time, I'll be able to make EC2 cheaper than traditional managed hosting, despite slightly higher costs per machine-hour. Is there enough variance (and slow enough variations) in typical web site traffic to do that?Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com8tag:blogger.com,1999:blog-3773144963927368838.post-56296271754971493572008-03-03T08:11:00.000-08:002008-03-22T17:30:15.203-07:00Murmur and Hash Tables for Scalability<a href="http://tanjent.livejournal.com/755617.html">Murmur</a> is a new (non-cryptographic) hash function. Tanjent says: "Way faster than <a href="http://www.azillionmonkeys.com/qed/hash.html">Hsieh</a>, better mixing than <a href="http://burtleburtle.net/bob/hash/doobs.html">Jenkins</a>, as simple as <a href="http://www.isthe.com/chongo/tech/comp/fnv/">FNV</a>." He's also made some <a href="http://tanjent.livejournal.com/756375.html">speed and mixing improvements</a> since its initial release.
<p>
This got me thinking about hash-based load distribution, in which the modulus of a hashed identifier is used to route a request to the machine that will be responsible for handling it, so that load is distributed in a way that allows server machines to keep local state across a series of requests (many load balancers call this "affinity"). Under heavy load, fast hash functions with poor mixing can route too many requests to one machine, overwhelming it while the others have plenty of unused capacity. Cryptographic hash functions will eliminate that problem by distributing requests more evenly, but are sometimes too CPU-intensive on the router. Functions like Hsieh and Jenkins are useful in cases like that, and Murmur might be even better.
<p>
I was talking about this with my brother, and he came up with a way to trade away memory for CPU in applications where the number of possible identifiers is small enough: use a very fast hash function as the basis of a large hash table on the router, mapping identifiers to slow-computed hashes which can in turn be used to select servers. The first time you see an identifier, you perform a slower hash, store it in your local hash table, and use the slow-computed hash value to pick a server. Next time you see the same identifier, you'll find the slow-computed hash in your table at the reduced CPU cost of a much quicker hash. If your identifiers are unlimited, though, your hash table will just keep growing, unless you turn it into an LRU cache by culling old entries.
<p>
Another approach is to generate a random number instead of using a slow-computed hash. The hash function and random number generation are fast, and a pseudo-random number generator should get a very even distribution. An identifier will only map the the same machine for as long as your router stays up, though, so this approach is only appropriate if the cost of having requests sent to the wrong machines is low.
<p>
<b>Update:</b> <a href="http://murmurhash.googlepages.com/">Murmur now has a web site</a>.Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com5tag:blogger.com,1999:blog-3773144963927368838.post-30297170185825205842006-10-07T23:50:00.000-07:002008-03-07T21:14:08.758-08:00Code Blocks and C-like Lambda Expressions<p>
A while back, I started using Ruby for some personal software projects. I was immediately impressed with its "code block" syntax. It's a fairly natural Algol-like syntax for lambda expressions, and is used to great effect in the language. <a name="cutid1"></a>I'd like to advocate wider adoption of this kind of syntax, and suggest some improvements.
<p>
Briefly, Ruby functions can take a special "code block" parameter, which appears in their scope as a function called "yield":
<blockquote><pre>def twice()
yield()
yield()
end</pre></blockquote>
<p>
The magic <tt>yield</tt> function is then specified using the code block syntax, when <tt>twice</tt> is called:
<blockquote><pre>twice() do
print "hello"
end</pre></blockquote>
<p>
Output:
<blockquote><pre>hello
hello</pre></blockquote>
<p>
(Ruby also allows the use of the more C-like <tt>{}</tt> delimiters -- which I generally prefer -- in place of <tt>do</tt>/<tt>end</tt>, but <tt>do</tt>/<tt>end</tt> matches the Algol-like syntax of Ruby's built-in constructs, so I prefer it in that context.)
<p>
Like regular function declarations, code blocks can take parameters:
<blockquote><pre>def with(a, b)
yield(a, b)
end
with(7, 3) do |x, y|
print x+y
end</pre></blockquote>
<p>
Output:
<blockquote><pre>10</pre></blockquote>
<p>
Ruby uses this for all kinds of familiar things, both from procedural and functional languages. Here are some examples:
<blockquote><pre><i># a foreach loop over a list</i>
[1, 2, 3].each do |x|
print x
end
<i># mapping</i>
squares = [1, 2, 3].map do |x|
x*x <i># note the handy auto-return here</i>
end
<i># like what some languages call "using" or "with"</i>
File::open('foo.txt') do |fd|
<i># the file will automatically be closed when we exit this scope</i>
print fd.read()
end</pre></blockquote>
<p>
First of all, I think this is a great syntax for lambda expressions in Algol- and C-like languages. I prefer the C-like variant in general, so I'll use that from now on. Also, in imitation of C/C++/Java/etc function declaration, I would pull the <tt>|x, y|</tt> signature definitions out in front of the <tt>{}</tt>s. (I think I like the vertical bars better than overloading <tt>(x, y)</tt> to handle both invocation and declaration the way C does. Recruiting <tt>||</tt> for delimiters strikes me as an okay idea, but there are some good reasons to prefer distinct symbols for open and close -- you could also overload <tt>()</tt>, <tt>[]</tt>, or even <tt><></tt>.) So I think the standard syntax for <tt>(lambda (x y) code)</tt> in C-like languages should be something like <tt>|x, y| { code }</tt>. This has the fun consequence that function declaration could be done like this:
<blockquote><pre>foo = |x, y| {
do_stuff(x*y)
}</pre></blockquote>
<p>
Or, for a statically-typed language, maybe something like:
<blockquote><pre>int func foo = |int x, float y| {
do_stuff(x*y)
}</pre></blockquote>
<p>
Some of the things Ruby does with its code blocks are built-in constructs (like <tt>foreach</tt> and <tt>using</tt>) in other Algol-like languages. This got me thinking: it would be nice to have a language in which all the control structures were expressible as functions that take code blocks. In Ruby (using the <tt>{}</tt> syntax) you could write an "if" function and call it like this:
<blockquote><pre>if (a > b) {
do_stuff()
}</pre></blockquote>
<p>
But not like this:
<blockquote><pre>if (a > b) {
do_stuff()
} else {
other_stuff()
}</pre></blockquote>
Ruby's code block syntax is based on a similar idea in Smalltalk, which is powerful enough to express <tt>if</tt>/<tt>else</tt> (although not in C-like syntax):
<blockquote><pre>(a > b)
ifTrue: [ doStuff ]
ifFalse: [ otherStuff ].</pre></blockquote>
<p>
The key concept employed in the Smalltalk code above is just <i>named parameters</i>. There's no reason named parameters couldn't work with the C-like code block syntax -- instead of setting the magic "yield" variable the way Ruby does, just allow function-type parameters to be declared, say, at the end of the parameter list, and allow parameters to be specified by name in much the same way Ruby and Python already do. It would be cute to write a language in which "if" could be implemented like this:
<blockquote><pre>if = |cond, block, else=nil| {
<i># using logic operators for control flow is ugly</i>
(cond and (block() or true)) or (else and (else() or false))
}</pre></blockquote>
<p>
and called like this:
<blockquote><pre>if (cond) {
do_stuff()
} else {
other_stuff()
}</pre></blockquote>
<p>
One interesting thing about a syntax like that is that it won't allow you to implement a standard <tt>while (cond) { code }</tt> construct, because the condition doesn't work like a parameter to a function: it can't be evaluated at call time, it needs to be re-evaluated each time the loop restarts. Instead, you'd get something like <tt>while {cond} { code }</tt>, which I find I prefer -- <tt>()</tt> and <tt>{}</tt> have clear and separate meanings, so we'd be clarifying a situation that sometimes surprises beginners by using the correct brackets for the semantics!Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com2tag:blogger.com,1999:blog-3773144963927368838.post-13691829189812239592006-09-10T18:50:00.000-07:002008-03-07T21:14:31.602-08:00Close-block Delimiter Symbols Considered Helpful<p>
One of the first things one notices about Python is that it has "significant whitespace"; in particular, its syntax relies on indentation to delimit blocks of code. Most languages make indentation optional, but a few, like Python and Haskell, require consistent indentation. Some people immediately hate this feature, and some people immediately love it. My own initial reaction was that this was not a good idea, but I didn't feel strongly about it, and I gave it a chance.
<p>
Now I <em>do</em> feel strongly about it.
<p>
Over time I have come to feel that this is the worst single mistake in the design of the Python language. What's particularly bad about this mistake is that it might be propagated into future languages -- it could conceivably irritate programmers like me for decades to come. So this is my attempt to explain why Python should not be imitated in this respect.
<p>
First, let me separate out two aspects of this feature. The first is <em>required indentation</em>. Required indentation is pretty much okay with me. I don't think it's particularly wise to make your language interact badly with the <a href="http://www.jwz.org/doc/tabs-vs-spaces.html">tabs-vs-spaces quagmire</a>, but who knows? Maybe a language feature like this, in a popular language, could finally end that holy war. Anyway, that's not what I'm talking about here. The second aspect, the one I'm interested in, is <em>doing without a close-block delimiter symbol</em>. (An example of block delimiter symbols, if that phrase isn't self-explanitory: Java's block delimiter symbols are "{" and "}".) Note that this is entirely unrelated to the first aspect -- a language could require consistent indentation just like Python does, and also have an explicit close-block delimiter.
<p>
Oh, you say, but that would be redundant. How horrible and ugly! Well, contrary to popular opinion, Python is not entirely without block delimiter symbols. In fact, every Python block must begin with the open-block delimiter, ":", like this:
<blockquote><pre>if a == b<b>:</b>
do_stuff()</pre></blockquote>
<p>
Compare this to the equivalent Ruby code, which has a close-block delimiter, "end", but no explicit open-block delimiter:
<blockquote><pre>if a == b
do_stuff()
<b>end</b></pre></blockquote>
<p>
So there is no ideological purity involved in Python's abandonment of the close-block delimiter; it already has a redundant block delimiter, it just has the wrong one! Why the "wrong" one? Because, perhaps unlike open-block delimiters, close-block delimiter symbols are <em>useful</em>. They're so useful that even Python can't get along without one: it has a keyword that it uses a lot like a restricted close-block delimiter symbol for a few special cases where it just can't get around the need for such a thing! It's called "pass", and it's used in these cases:
<blockquote><pre>
# I don't care whether this works or not
try:
this_might_fail()
except:
<b>pass</b></pre>
<pre># make this method do nothing in this subclass
def override_method():
<b>pass</b></pre></blockquote>
<p>
Again, some Ruby for comparison:
<blockquote><pre># make this method do nothing in this subclass
def override_method()
<b>end</b></pre></blockquote>
<p>
Okay, so technically, pass isn't a close-block delimiter, it's a no-op. Other lines can follow it at the same indentation level. But it's required for otherwise-empty blocks, and there's no reason for it to exist other than to make clear the existence of such a block, and in practice it is used very much like a close-block delimiter. The reasoning here is interesting. When a block contains lines of code, Python's designers presumably feel that the meaning of the code is clear without a close-block delimiter. But when a block contains no code, it's somehow too weird; one "def" statement directly before another, or an "except" with unindented lines after it would be too confusing. So an indication that the block is empty and over is desirable, even though it's not strictly necessary in order for the program to be parsable.
<p>
I also find it interesting that the case with open-block delimiters is not so weird -- in languages which lack them, as Ruby does, the only consequence is that while some blocks start with something like "if a == b", others start with a more generic symbol, like "begin". Many languages do that anyway, even when they have an open-block delimiter, as with the C/C++/Java "do {} while();".
<p>
Speaking of which, you may have noticed that Python lacks a do-while loop. Why? I think it's <a href="http://mail.python.org/pipermail/python-dev/2006-February/060718.html">because it would be awkward</a>:
<blockquote><pre>do:
some_stuff()
while a > b
foo()</pre></blockquote>
<p>
Because of the lack of a close-block delimiter symbol, "while a > b" above looks like an ordinary Python statement, or the start of a new while loop, after the end of the block. Tellingly, <a href="http://www.python.org/dev/peps/pep-0315/">the proposal for adding a do-while loop to Python</a> suggests that it be integrated with the existing while loop, so that the "while" clause could open a new block, sort of like the way if/else and try/except work. The proposed syntax for the example I gave above would employ "pass" to visually close the block, like this:
<blockquote><pre>do:
some_stuff()
while a > b:
<b>pass</b>
foo()</pre></blockquote>
<p>(This combination of "do-while" and "while" into a single looping construct is actually very appealing for other reasons -- I encourage you to read the reasoning in <a href="http://www.python.org/dev/peps/pep-0315/">the PEP</a> to see why more languages should have a construct like Python's proposed do-while-else.)
<p>
Aside from simple readability, there are some important reasons why close-block delimiter symbols are useful. Let me point out what a big deal that is. The whole question about how to do block-delimiting is usually about readability. The questions are things like "does {} look better than begin/end?", and "is (foo bar) clearer than <foo>bar</foo>?", and "which lines should the delimiters go on?" -- questions about what the code looks like. When you use indentation as your only close-block delimiter, though, there are some problems that go <em>beyond</em> readability. Here are the ones I'm aware of:
<ul>
<li> Many technologies are bad at preserving indentation in text. Pasting Java or Ruby code into a chat message or web page (without <pre>) might make it look uglier than you'd like, but pasting Python code into the same places is likely to <em>destroy information</em>, rendering the code uninterpretable or just subtly incorrect. I see this happen pretty frequently.
<p>
<li> When moving a section of code from one nesting level to another, or changing the indentation scheme of a block of code to match the local convention, close-block delimiter symbols preserve the information about where blocks exist in the moving code section. Indentation is very bad at this, because its addition or removal can be done in a non-atomic way, affecting some lines before others. If you indent by hand, it's easy to lose track, as you move down through the pasted section, of the intended containing block for the next line of code. If you get this wrong in a language like Java, your code may look confusing, but it will behave correctly. In a language like Python, your code can quietly do the wrong thing! (This has actually happened to me.) Getting this right without undue mental effort requires special editor features, but even those can't always save you. My editor happens to have those features, and I still occasionally get bitten by this when someone else's editor interprets tab characters differently from mine, or when a nesting-level change is not a simple copy-and-paste operation (for example, when an enclosing conditional block is being removed, or when two-space indentation is being converted to four-space indentation.)
</ul>
Sure, these are not such horrible problems. They don't irritate me most days. But they do irritate me <em>sometimes</em>, and there's no reason why I should ever have to tolerate them! I get nothing in return. Redundant close-block delimiter symbols have no associated cost, and some tangible benefits. Please, when designing new a language -- even a language that requires consistent indentation -- include a close-block delimiter symbol!Steven Hazelhttp://www.blogger.com/profile/16132606068051543995noreply@blogger.com3