hackerdashery: programming languages

Showing posts with label programming languages. Show all posts

Thursday, May 8, 2008

The Terrible Legacy of Legacy

Out of curiosity, I recently started reading the haskell-prime mailing list. Haskell' is the interim name for the language standard that will supersede Haskell98, like C++0x is to C++03. There's been some discussion on the list about where to break backward compatibility with Haskell98, and which features are worth the trouble. It's interesting to see a language just starting to have to deal with these issues. On one side, you've got researchers and language designers angling to improve the language as much as possible before a new standard is nailed down. On the other are mainly industry users, pushing for less change and more backward compatibility, so they don't have to spend time and money upgrading their codebase or working with old and unsupported tools.

The influence of backward compatibility on software is hard to overestimate. Windows Vista is still binary-compatible with many MS-DOS programs dating back to 1981, and the DOS API was in turn meant to make it easy to port CP/M programs from the late ’70s. Meanwhile, Mac OS X and Linux are both Unix variants, with some core API features dating back to the early ’70s.

The situation with processor instruction sets is similar: the processors in today's PCs (and Macs!) are backward-compatible with Intel's original 8086, which itself was designed so that assembly for the 8008 — the first 8-bit microprocessor, released in 1972 — could be converted automatically.

This means that there are 30-year-old programs which would require very little modification to run on today's operating systems. And there's no reason to expect that in another 30 years we won't still be using systems with an unbroken 60-year-long chain of backward compatibility.

It's not that these technologies were so ingenious that we haven't managed to think of anything better in the intervening decades. Rather, when the quality of a software interface gets good enough to become a widespread standard, the value of any given improvement on that interface is dwarfed by the value of compatibility. Progress in any direction that would require a break in compatibility slows dramatically. The bar for de-facto standards isn't "great", but merely "good enough".

What this means is that an increasing number of design features in the software systems we use every day are attributable to historical reasons. That's the terrible legacy of legacy code. The crushing gravity of installed bases eventually pulls even the best-designed systems down into a mire of hard-to-learn, hard-to-use arcana.

A lot of programming languages from the ’90s are feeling that pressure today, and the result is a number of planned backward-incompatible major revisions, including Python 3000, Ruby 2, and Perl 6. I'm going to go out on a limb and claim that those languages have more users and larger existing codebases than Haskell does. If they can make backward-incompatible changes just to clean house, surely a primarily research-oriented language like Haskell can.

Don't get me wrong, there are good reasons to maintain backward compatibility in a wide variety of situations. But if you don't have those reasons, why in the world would you subject your programming language to the mangling, bloating influence of backward compatibility? While backward compatibility is a great default position when you don't have any improvements to make, giving up too much to maintain compatibility is bad for everyone.

The question is, how does the cost to existing users compare to the value to all future users? If your language is still growing in popularity, the number of future users can easily exceed the number of current users by orders of magnitude. If you don't break compatibility to fix a problem, you're hurting all the users who will have to live with the problem for who knows how long, in order to avoid hurting the few who will have to upgrade now. And if you don't fix it, someone else will, in a new language, so your users get stuck in COBOL-support hell eventually anyway. That's if we're lucky. If we're not lucky, your language will become so popular that the value of compatibility will outweigh the value of better languages, and your language will be a drag on progress. It's practically a moral imperative to fix it while you still can.

C++ is an excellent example of a language that has valued backward compatibility over simplicity and consistency. It's a popular, practical tool, but few people consider it beautiful, easy to learn, or easy to work with. As Bjarne Stroustrup put it:

I consider it reasonably obvious that in the absence of compatibility requirements, a new language on this model can be much smaller, simpler, more expressive, and more amenable to tool use than C++, without loss of performance or restriction of application domains. How much smaller? Say 10% of the size of C++ in definition and similar in front-end compiler size. In the "Retrospective" chapter of D&E, I expressed that idea as "Inside C++, there is a much smaller and cleaner language struggling to get out". Most of the simplification would come from generalization — from eliminating the mess of special cases that makes C++ so hard handle to rather than restriction or moving work from compile time to run time.

What Stroustrup originally said was that "Within C++, there is a much smaller and cleaner language struggling to get out," which "would [...] have been an unimportant cult language." I'm not sure I agree with that last part. Java, for example, is a syntactically similar language whose designers did decide to give up compatibility with C in order to achieve greater simplicity and consistency. Even though Java is virtual-machine-based and unsuitable for a wide variety of systems programming tasks, its popularity has by some metrics exceeded that of C++.

What has compatibility bought C++? Java shows that it wasn't a requirement for popularity. And despite being largely backward-compatible with C, C++ has had difficulty supplanting it for many tasks. Indeed, there's more than one place where a break with backward compatibility might have simplified and improved C++ at little cost. The modern C++ style that has become prominent in recent years bears little relation to the way C++ was written in the ’90s. How much value is there, really, in being able to compile both dialects with the same compilers?

So that's my position: backward compatibility at the expense of simplification is only appropriate when you can't gain acceptance any other way. If a given backward-incompatible improvement isn't going to cause a fork in the language, its value probably outweighs the value of compatibility.

Thursday, March 13, 2008

Tree Lists as a Default List Data Structure

I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.

Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.

Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):

Linked lists are the popular built-in list type in functional programming languages like Scheme and Haskell. They allow fast O(1) inserts and deletes at any point in the list. Unfortunately, lookup is O(n), which makes many common insert and delete operations (those that don't start from an iterator in the correct position in the list) O(n) as well. Memory overhead for linked lists involves at least one link pointer for every element in the list. All in all, they're an algebraically fundamental but pragmatically somewhat specialized type of list.
Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.

Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.

Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.

BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.

Saturday, October 7, 2006

Code Blocks and C-like Lambda Expressions

A while back, I started using Ruby for some personal software projects. I was immediately impressed with its "code block" syntax. It's a fairly natural Algol-like syntax for lambda expressions, and is used to great effect in the language. I'd like to advocate wider adoption of this kind of syntax, and suggest some improvements.

Briefly, Ruby functions can take a special "code block" parameter, which appears in their scope as a function called "yield":

def twice()
    yield()
    yield()
end

The magic yield function is then specified using the code block syntax, when twice is called:

twice() do
    print "hello"
end

Output:

hello
hello

(Ruby also allows the use of the more C-like {} delimiters -- which I generally prefer -- in place of do/end, but do/end matches the Algol-like syntax of Ruby's built-in constructs, so I prefer it in that context.)

Like regular function declarations, code blocks can take parameters:

def with(a, b)
    yield(a, b)
end

with(7, 3) do |x, y|
    print x+y
end

Output:

Ruby uses this for all kinds of familiar things, both from procedural and functional languages. Here are some examples:

# a foreach loop over a list
[1, 2, 3].each do |x|
    print x
end

# mapping
squares = [1, 2, 3].map do |x|
    x*x  # note the handy auto-return here
end

# like what some languages call "using" or "with"
File::open('foo.txt') do |fd|
    # the file will automatically be closed when we exit this scope
    print fd.read()
end

First of all, I think this is a great syntax for lambda expressions in Algol- and C-like languages. I prefer the C-like variant in general, so I'll use that from now on. Also, in imitation of C/C++/Java/etc function declaration, I would pull the |x, y| signature definitions out in front of the {}s. (I think I like the vertical bars better than overloading (x, y) to handle both invocation and declaration the way C does. Recruiting || for delimiters strikes me as an okay idea, but there are some good reasons to prefer distinct symbols for open and close -- you could also overload (), [], or even <>.) So I think the standard syntax for (lambda (x y) code) in C-like languages should be something like |x, y| { code }. This has the fun consequence that function declaration could be done like this:

foo = |x, y| {
    do_stuff(x*y)
}

Or, for a statically-typed language, maybe something like:

int func foo = |int x, float y| {
    do_stuff(x*y)
}

Some of the things Ruby does with its code blocks are built-in constructs (like foreach and using) in other Algol-like languages. This got me thinking: it would be nice to have a language in which all the control structures were expressible as functions that take code blocks. In Ruby (using the {} syntax) you could write an "if" function and call it like this:

if (a > b) {
    do_stuff()
}

But not like this:

if (a > b) {
    do_stuff()
} else {
    other_stuff()
}

Ruby's code block syntax is based on a similar idea in Smalltalk, which is powerful enough to express if/else (although not in C-like syntax):

(a > b)
ifTrue: [ doStuff ]
ifFalse: [ otherStuff ].

The key concept employed in the Smalltalk code above is just named parameters. There's no reason named parameters couldn't work with the C-like code block syntax -- instead of setting the magic "yield" variable the way Ruby does, just allow function-type parameters to be declared, say, at the end of the parameter list, and allow parameters to be specified by name in much the same way Ruby and Python already do. It would be cute to write a language in which "if" could be implemented like this:

if = |cond, block, else=nil| {
    # using logic operators for control flow is ugly
    (cond and (block() or true)) or (else and (else() or false))
}

and called like this:

if (cond) {
    do_stuff()
} else {
    other_stuff()
}

One interesting thing about a syntax like that is that it won't allow you to implement a standard while (cond) { code } construct, because the condition doesn't work like a parameter to a function: it can't be evaluated at call time, it needs to be re-evaluated each time the loop restarts. Instead, you'd get something like while {cond} { code }, which I find I prefer -- () and {} have clear and separate meanings, so we'd be clarifying a situation that sometimes surprises beginners by using the correct brackets for the semantics!

Sunday, September 10, 2006

Close-block Delimiter Symbols Considered Helpful

One of the first things one notices about Python is that it has "significant whitespace"; in particular, its syntax relies on indentation to delimit blocks of code. Most languages make indentation optional, but a few, like Python and Haskell, require consistent indentation. Some people immediately hate this feature, and some people immediately love it. My own initial reaction was that this was not a good idea, but I didn't feel strongly about it, and I gave it a chance.

Now I do feel strongly about it.

Over time I have come to feel that this is the worst single mistake in the design of the Python language. What's particularly bad about this mistake is that it might be propagated into future languages -- it could conceivably irritate programmers like me for decades to come. So this is my attempt to explain why Python should not be imitated in this respect.

First, let me separate out two aspects of this feature. The first is required indentation. Required indentation is pretty much okay with me. I don't think it's particularly wise to make your language interact badly with the tabs-vs-spaces quagmire, but who knows? Maybe a language feature like this, in a popular language, could finally end that holy war. Anyway, that's not what I'm talking about here. The second aspect, the one I'm interested in, is doing without a close-block delimiter symbol. (An example of block delimiter symbols, if that phrase isn't self-explanitory: Java's block delimiter symbols are "{" and "}".) Note that this is entirely unrelated to the first aspect -- a language could require consistent indentation just like Python does, and also have an explicit close-block delimiter.

Oh, you say, but that would be redundant. How horrible and ugly! Well, contrary to popular opinion, Python is not entirely without block delimiter symbols. In fact, every Python block must begin with the open-block delimiter, ":", like this:

if a == b:
    do_stuff()

Compare this to the equivalent Ruby code, which has a close-block delimiter, "end", but no explicit open-block delimiter:

if a == b
    do_stuff()
end

So there is no ideological purity involved in Python's abandonment of the close-block delimiter; it already has a redundant block delimiter, it just has the wrong one! Why the "wrong" one? Because, perhaps unlike open-block delimiters, close-block delimiter symbols are useful. They're so useful that even Python can't get along without one: it has a keyword that it uses a lot like a restricted close-block delimiter symbol for a few special cases where it just can't get around the need for such a thing! It's called "pass", and it's used in these cases:

# I don't care whether this works or not
try:
    this_might_fail()
except:
    pass

# make this method do nothing in this subclass
def override_method():
    pass

Again, some Ruby for comparison:

# make this method do nothing in this subclass
def override_method()
end

Okay, so technically, pass isn't a close-block delimiter, it's a no-op. Other lines can follow it at the same indentation level. But it's required for otherwise-empty blocks, and there's no reason for it to exist other than to make clear the existence of such a block, and in practice it is used very much like a close-block delimiter. The reasoning here is interesting. When a block contains lines of code, Python's designers presumably feel that the meaning of the code is clear without a close-block delimiter. But when a block contains no code, it's somehow too weird; one "def" statement directly before another, or an "except" with unindented lines after it would be too confusing. So an indication that the block is empty and over is desirable, even though it's not strictly necessary in order for the program to be parsable.

I also find it interesting that the case with open-block delimiters is not so weird -- in languages which lack them, as Ruby does, the only consequence is that while some blocks start with something like "if a == b", others start with a more generic symbol, like "begin". Many languages do that anyway, even when they have an open-block delimiter, as with the C/C++/Java "do {} while();".

Speaking of which, you may have noticed that Python lacks a do-while loop. Why? I think it's because it would be awkward:

do:
    some_stuff()
while a > b
foo()

Because of the lack of a close-block delimiter symbol, "while a > b" above looks like an ordinary Python statement, or the start of a new while loop, after the end of the block. Tellingly, the proposal for adding a do-while loop to Python suggests that it be integrated with the existing while loop, so that the "while" clause could open a new block, sort of like the way if/else and try/except work. The proposed syntax for the example I gave above would employ "pass" to visually close the block, like this:

do:
    some_stuff()
while a > b:
    pass
foo()

(This combination of "do-while" and "while" into a single looping construct is actually very appealing for other reasons -- I encourage you to read the reasoning in the PEP to see why more languages should have a construct like Python's proposed do-while-else.)

Aside from simple readability, there are some important reasons why close-block delimiter symbols are useful. Let me point out what a big deal that is. The whole question about how to do block-delimiting is usually about readability. The questions are things like "does {} look better than begin/end?", and "is (foo bar) clearer than <foo>bar</foo>?", and "which lines should the delimiters go on?" -- questions about what the code looks like. When you use indentation as your only close-block delimiter, though, there are some problems that go beyond readability. Here are the ones I'm aware of:

Many technologies are bad at preserving indentation in text. Pasting Java or Ruby code into a chat message or web page (without <pre>) might make it look uglier than you'd like, but pasting Python code into the same places is likely to destroy information, rendering the code uninterpretable or just subtly incorrect. I see this happen pretty frequently.
When moving a section of code from one nesting level to another, or changing the indentation scheme of a block of code to match the local convention, close-block delimiter symbols preserve the information about where blocks exist in the moving code section. Indentation is very bad at this, because its addition or removal can be done in a non-atomic way, affecting some lines before others. If you indent by hand, it's easy to lose track, as you move down through the pasted section, of the intended containing block for the next line of code. If you get this wrong in a language like Java, your code may look confusing, but it will behave correctly. In a language like Python, your code can quietly do the wrong thing! (This has actually happened to me.) Getting this right without undue mental effort requires special editor features, but even those can't always save you. My editor happens to have those features, and I still occasionally get bitten by this when someone else's editor interprets tab characters differently from mine, or when a nesting-level change is not a simple copy-and-paste operation (for example, when an enclosing conditional block is being removed, or when two-space indentation is being converted to four-space indentation.)

Sure, these are not such horrible problems. They don't irritate me most days. But they do irritate me sometimes, and there's no reason why I should ever have to tolerate them! I get nothing in return. Redundant close-block delimiter symbols have no associated cost, and some tangible benefits. Please, when designing new a language -- even a language that requires consistent indentation -- include a close-block delimiter symbol!

hackerdashery