Saturday, October 7, 2006

Code Blocks and C-like Lambda Expressions

A while back, I started using Ruby for some personal software projects. I was immediately impressed with its "code block" syntax. It's a fairly natural Algol-like syntax for lambda expressions, and is used to great effect in the language. I'd like to advocate wider adoption of this kind of syntax, and suggest some improvements.

Briefly, Ruby functions can take a special "code block" parameter, which appears in their scope as a function called "yield":

def twice()
    yield()
    yield()
end

The magic yield function is then specified using the code block syntax, when twice is called:

twice() do
    print "hello"
end

Output:

hello
hello

(Ruby also allows the use of the more C-like {} delimiters -- which I generally prefer -- in place of do/end, but do/end matches the Algol-like syntax of Ruby's built-in constructs, so I prefer it in that context.)

Like regular function declarations, code blocks can take parameters:

def with(a, b)
    yield(a, b)
end

with(7, 3) do |x, y|
    print x+y
end

Output:

10

Ruby uses this for all kinds of familiar things, both from procedural and functional languages. Here are some examples:

# a foreach loop over a list
[1, 2, 3].each do |x|
    print x
end

# mapping
squares = [1, 2, 3].map do |x|
    x*x  # note the handy auto-return here
end

# like what some languages call "using" or "with"
File::open('foo.txt') do |fd|
    # the file will automatically be closed when we exit this scope
    print fd.read()
end

First of all, I think this is a great syntax for lambda expressions in Algol- and C-like languages. I prefer the C-like variant in general, so I'll use that from now on. Also, in imitation of C/C++/Java/etc function declaration, I would pull the |x, y| signature definitions out in front of the {}s. (I think I like the vertical bars better than overloading (x, y) to handle both invocation and declaration the way C does. Recruiting || for delimiters strikes me as an okay idea, but there are some good reasons to prefer distinct symbols for open and close -- you could also overload (), [], or even <>.) So I think the standard syntax for (lambda (x y) code) in C-like languages should be something like |x, y| { code }. This has the fun consequence that function declaration could be done like this:

foo = |x, y| {
    do_stuff(x*y)
}

Or, for a statically-typed language, maybe something like:

int func foo = |int x, float y| {
    do_stuff(x*y)
}

Some of the things Ruby does with its code blocks are built-in constructs (like foreach and using) in other Algol-like languages. This got me thinking: it would be nice to have a language in which all the control structures were expressible as functions that take code blocks. In Ruby (using the {} syntax) you could write an "if" function and call it like this:

if (a > b) {
    do_stuff()
}

But not like this:

if (a > b) {
    do_stuff()
} else {
    other_stuff()
}
Ruby's code block syntax is based on a similar idea in Smalltalk, which is powerful enough to express if/else (although not in C-like syntax):
(a > b)
ifTrue: [ doStuff ]
ifFalse: [ otherStuff ].

The key concept employed in the Smalltalk code above is just named parameters. There's no reason named parameters couldn't work with the C-like code block syntax -- instead of setting the magic "yield" variable the way Ruby does, just allow function-type parameters to be declared, say, at the end of the parameter list, and allow parameters to be specified by name in much the same way Ruby and Python already do. It would be cute to write a language in which "if" could be implemented like this:

if = |cond, block, else=nil| {
    # using logic operators for control flow is ugly
    (cond and (block() or true)) or (else and (else() or false))
}

and called like this:

if (cond) {
    do_stuff()
} else {
    other_stuff()
}

One interesting thing about a syntax like that is that it won't allow you to implement a standard while (cond) { code } construct, because the condition doesn't work like a parameter to a function: it can't be evaluated at call time, it needs to be re-evaluated each time the loop restarts. Instead, you'd get something like while {cond} { code }, which I find I prefer -- () and {} have clear and separate meanings, so we'd be clarifying a situation that sometimes surprises beginners by using the correct brackets for the semantics!

Sunday, September 10, 2006

Close-block Delimiter Symbols Considered Helpful

One of the first things one notices about Python is that it has "significant whitespace"; in particular, its syntax relies on indentation to delimit blocks of code. Most languages make indentation optional, but a few, like Python and Haskell, require consistent indentation. Some people immediately hate this feature, and some people immediately love it. My own initial reaction was that this was not a good idea, but I didn't feel strongly about it, and I gave it a chance.

Now I do feel strongly about it.

Over time I have come to feel that this is the worst single mistake in the design of the Python language. What's particularly bad about this mistake is that it might be propagated into future languages -- it could conceivably irritate programmers like me for decades to come. So this is my attempt to explain why Python should not be imitated in this respect.

First, let me separate out two aspects of this feature. The first is required indentation. Required indentation is pretty much okay with me. I don't think it's particularly wise to make your language interact badly with the tabs-vs-spaces quagmire, but who knows? Maybe a language feature like this, in a popular language, could finally end that holy war. Anyway, that's not what I'm talking about here. The second aspect, the one I'm interested in, is doing without a close-block delimiter symbol. (An example of block delimiter symbols, if that phrase isn't self-explanitory: Java's block delimiter symbols are "{" and "}".) Note that this is entirely unrelated to the first aspect -- a language could require consistent indentation just like Python does, and also have an explicit close-block delimiter.

Oh, you say, but that would be redundant. How horrible and ugly! Well, contrary to popular opinion, Python is not entirely without block delimiter symbols. In fact, every Python block must begin with the open-block delimiter, ":", like this:

if a == b:
    do_stuff()

Compare this to the equivalent Ruby code, which has a close-block delimiter, "end", but no explicit open-block delimiter:

if a == b
    do_stuff()
end

So there is no ideological purity involved in Python's abandonment of the close-block delimiter; it already has a redundant block delimiter, it just has the wrong one! Why the "wrong" one? Because, perhaps unlike open-block delimiters, close-block delimiter symbols are useful. They're so useful that even Python can't get along without one: it has a keyword that it uses a lot like a restricted close-block delimiter symbol for a few special cases where it just can't get around the need for such a thing! It's called "pass", and it's used in these cases:

# I don't care whether this works or not
try:
    this_might_fail()
except:
    pass
# make this method do nothing in this subclass
def override_method():
    pass

Again, some Ruby for comparison:

# make this method do nothing in this subclass
def override_method()
end

Okay, so technically, pass isn't a close-block delimiter, it's a no-op. Other lines can follow it at the same indentation level. But it's required for otherwise-empty blocks, and there's no reason for it to exist other than to make clear the existence of such a block, and in practice it is used very much like a close-block delimiter. The reasoning here is interesting. When a block contains lines of code, Python's designers presumably feel that the meaning of the code is clear without a close-block delimiter. But when a block contains no code, it's somehow too weird; one "def" statement directly before another, or an "except" with unindented lines after it would be too confusing. So an indication that the block is empty and over is desirable, even though it's not strictly necessary in order for the program to be parsable.

I also find it interesting that the case with open-block delimiters is not so weird -- in languages which lack them, as Ruby does, the only consequence is that while some blocks start with something like "if a == b", others start with a more generic symbol, like "begin". Many languages do that anyway, even when they have an open-block delimiter, as with the C/C++/Java "do {} while();".

Speaking of which, you may have noticed that Python lacks a do-while loop. Why? I think it's because it would be awkward:

do:
    some_stuff()
while a > b
foo()

Because of the lack of a close-block delimiter symbol, "while a > b" above looks like an ordinary Python statement, or the start of a new while loop, after the end of the block. Tellingly, the proposal for adding a do-while loop to Python suggests that it be integrated with the existing while loop, so that the "while" clause could open a new block, sort of like the way if/else and try/except work. The proposed syntax for the example I gave above would employ "pass" to visually close the block, like this:

do:
    some_stuff()
while a > b:
    pass
foo()

(This combination of "do-while" and "while" into a single looping construct is actually very appealing for other reasons -- I encourage you to read the reasoning in the PEP to see why more languages should have a construct like Python's proposed do-while-else.)

Aside from simple readability, there are some important reasons why close-block delimiter symbols are useful. Let me point out what a big deal that is. The whole question about how to do block-delimiting is usually about readability. The questions are things like "does {} look better than begin/end?", and "is (foo bar) clearer than <foo>bar</foo>?", and "which lines should the delimiters go on?" -- questions about what the code looks like. When you use indentation as your only close-block delimiter, though, there are some problems that go beyond readability. Here are the ones I'm aware of:

  • Many technologies are bad at preserving indentation in text. Pasting Java or Ruby code into a chat message or web page (without <pre>) might make it look uglier than you'd like, but pasting Python code into the same places is likely to destroy information, rendering the code uninterpretable or just subtly incorrect. I see this happen pretty frequently.

  • When moving a section of code from one nesting level to another, or changing the indentation scheme of a block of code to match the local convention, close-block delimiter symbols preserve the information about where blocks exist in the moving code section. Indentation is very bad at this, because its addition or removal can be done in a non-atomic way, affecting some lines before others. If you indent by hand, it's easy to lose track, as you move down through the pasted section, of the intended containing block for the next line of code. If you get this wrong in a language like Java, your code may look confusing, but it will behave correctly. In a language like Python, your code can quietly do the wrong thing! (This has actually happened to me.) Getting this right without undue mental effort requires special editor features, but even those can't always save you. My editor happens to have those features, and I still occasionally get bitten by this when someone else's editor interprets tab characters differently from mine, or when a nesting-level change is not a simple copy-and-paste operation (for example, when an enclosing conditional block is being removed, or when two-space indentation is being converted to four-space indentation.)
Sure, these are not such horrible problems. They don't irritate me most days. But they do irritate me sometimes, and there's no reason why I should ever have to tolerate them! I get nothing in return. Redundant close-block delimiter symbols have no associated cost, and some tangible benefits. Please, when designing new a language -- even a language that requires consistent indentation -- include a close-block delimiter symbol!