I'd like to call your attention to Tom Kleinpeter's new blog, in which he's been posting lots of great stuff about his experiences building and scaling Audiogalaxy and FolderShare. I worked with Tom at both of those companies, helping him build the p2p clients and backend systems. If you know me, you know that I think most software engineers are pretty crappy. Statistically speaking, I probably think you're a pretty crappy engineer. Tom, on the other hand, is a really good engineer, one of the best I've worked with. His blog is part of what inspired me to finally finish setting up this one.
Anyway, in an earlier article, he mentioned skip lists and unrolled link lists, which reminded me of my love for this category of list data structures.
Most languages that have a built-in list type have one of two alternatives (and often strangely lack any standard implementation of the other alternative):
- Linked
lists are the popular built-in list type in functional programming
languages like Scheme and Haskell. They allow fast O(1) inserts and
deletes at any point in the list. Unfortunately, lookup is O(n),
which makes many common insert and delete operations (those that don't
start from an iterator in the correct position in the list) O(n) as
well. Memory overhead for linked lists involves at least one link
pointer for every element in the list. All in all, they're an
algebraically fundamental but pragmatically somewhat specialized type of
list.
- Array lists (that is, lists implemented as arrays which are dynamically resized to fit the length of the list, usually by doubling/halving the array length when necessary) are popular in many other dynamic languages, such as Python, Ruby, and Perl, and are the most popular type of list in the collection libraries of many langauges that don't have built-in list support, including C++, Java, and C#. Since they require shifting large portions of the array around in memory, inserts, deletes, and prepends are O(n), but appends and lookups are O(1) (and cache-friendly). Amortized memory overhead is lower than for linked lists, since link pointers are not required, and in the worst case the overhead is equal to that of linked lists. They're practical for a wide range of applications, but fall down when you want to insert or delete in the middle of the list.
Tree-based list data structures — let's call them "tree lists" — are a compromise between those two options. The basic strategy for tree lists is to split up your list into segments stored as leaves in a tree (as either linked lists or array lists). Lookups in a tree list require tree traversal, but not traversal of the entire list. Inserts and deletes may require shifting some elements around in memory, but not many of them. Inserts, deletes, prepends, appends, and lookups all wind up in the neighborhood of O(log n). That's acceptable performance for a wide variety of operations.
Skip lists and unrolled link lists, though they use unusual representations of trees, are both examples of this strategy. (The major trade-off between them is that by using linked lists for the leaves, skip lists retain the O(1) insert/delete performance of linked lists, but at the cost of being less cache-friendly. Unrolled linked lists make the opposite choice.) Files in many filesystems are also implemented as tree lists, because of another advantage of these structures over array lists: segmenting the list allows for non-contiguous storage, which can help avoid expensive move operations in the heap as well as on disk.
BList is an implementation of a type of tree list in Python using B+Trees, which have the advantage that for small lists, they retain the excellent performance characteristics of small array lists. Python has rejected the proposal that standard Python lists be replaced with BList, and I can see why that makes sense for Python; drastically changing the performance characteristics of a basic type used in most existing code seems like a bad idea. But I think it would be interesting for a new language to adopt something like BList for its built-in list type. Tree lists in general seem like a good generic choice for lists, since from a performance perspective they're less specialized than linked lists and array lists.