Thursday, March 13, 2008

Tree Lists as a Default List Data Structure

I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.

Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.

Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):

  • Linked lists are the popular built-in list type in functional programming languages like Scheme and Haskell. They allow fast O(1) inserts and deletes at any point in the list. Unfortunately, lookup is O(n), which makes many common insert and delete operations (those that don't start from an iterator in the correct position in the list) O(n) as well. Memory overhead for linked lists involves at least one link pointer for every element in the list. All in all, they're an algebraically fundamental but pragmatically somewhat specialized type of list.

  • Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.

Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.

Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.

BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.

9 comments:

David R. MacIver said...

You might want to check your entry on linked lists. You say they're commonly used in Haskell and Scheme (which they are), but then go on to list performance characteristics corresponding to mutable linked lists...

With an immutable linked list you need to recreate the head of the list to insert or delete, so inserting or deleting from point k is an O(k) operation.

Keeping a zipper into a linked list corresponds a bit better to your notion of "iterator in the right place", where you have O(1) insert and delete at the current point, but that's not the same thing.

Steven Hazel said...

David:

Good point. Although, I always wonder to what extent implementations of languages with immutable semantics are able to determine when a value will become unreachable, and implement value-replacements like immutable insertions using faster mutation-based methods. I don't know whether that kind of thing is ever done in practice.

Sebastian Sylvan said...

Seems to me like 2-3 finger trees is something you may want to look up. It supports fast access to either end, concatenation etc, and most other operations are O(log n)

Adrian Thurston said...

The language I'm working on for my thesis project will use tree lists as an implementation of the map type. It is an AVL tree with linked nodes.

http://www.cs.queensu.ca/~thurston/colm/

Adrian

David R. MacIver said...

Steven:

They don't, for the most part. Certainly not in instances like this. Most things converted to mutable state will be things like tail call optimisation and uniqueness types. Most of the time (like here!) converting to mutable state would require the compiler fundamentally change the algorithm you're using, which is both hard and fairly dangerous. It also probably loses a lot of the implementation efficiencies gained by using largely immutable datastructures.

(Incidentally, it's been pointed out to me that Scheme linked lists are in fact mutable, even if they're not normally used that way. Haskell ones are very decidedly not).

Also, seconding the use of finger trees. They're a nice structure.

Dan Doel said...

Although a Haskell compiler won't figure out that you're using lists uniquely and turn that into mutable update, it can (theoretically at least) compile certain list usages very efficiently via various fusion rules and other optimizations. For instance:

fac n = product [1..n]

could be fused and inlined into a loop with an accumulator. The work on stream fusion that Don Stewart and company have been doing, or Neil Mitchell's Supero optimizer are attempts to do this sort of stuff (although Supero is more general than just list fusion). In fact, you can see one of his papers for a somewhat more realistic example, where his optimizer turns:

print . length . words =<< getContents

into a loop that is competitive with the handwritten C equivalent.

Of course, setting individual values at random indices isn't the sort of operation that will be nicely fused. If that's really what you need to do, lists probably aren't the right structure.

tom said...

When you use lists (eg in Scheme/Lisp) you most often map a function on each element, i.e. you process the elements in sequence from the beginning the the end. Lists play well with recursive programming.

If you use lists and often end up looking up values, there is a slight chance you're do something wrong. The answer IMHO would not be to change the implementation of lists but (1) to not use lists or (2) to consider a change in your programming style.

rog peppe said...

one characteristic of simple linked lists that you didn't mention is that when they're immutable, you can share substructure (i.e. you've got an inverted tree, with potentially many list heads). this can save much space, and is also wonderful in a multithreaded environment, as you don't have to copy the data structure when passing it to another thread.

michaelw said...

@rog:

You can share structure in immutable trees, too. The advantage over lists is that you can share more: common prefixes, for example.