Let me tell you about what I did yesterday. I set out to write a little HTTP proxy server that would transfer large (> 1GB) files to and from Amazon S3, doing a tiny bit of processing on the beginning of the data as it came through. Sounds pretty simple, right? You're probably thinking that's just a matter of gluing some HTTP libraries to some S3 libraries. I thought so too.
I've been enjoying Haskell lately, so my first thought was to try to write it in Haskell, even if it meant producing my own Haskell S3 library in the process. But I soon discovered that Haskell's HTTP library doesn't have any facility for streaming large amounts of data over HTTP: it reads the entire request or response into a String in memory.
This is an extremely common problem. Library authors often don't consider applications involving large amounts of data, and assume that everything will fit in memory. This is a particularly sad state of affairs in Haskell, because the language itself supports streaming behavior via lazy evaluation — there's no need to even change the API to support streaming in Haskell. Just make your functions do lazy IO, and those Strings can be used to stream data as-needed. (In fact, there's been some work on this in Haskell.)
Note also that reading and writing as you process data is supported by TCP. A lot of people seem to be at least slightly confused about this situation, particularly in the case of receiving data from a remote sender. If the other side of a TCP connection is sending data faster than you can process it, nothing terrible happens. Your kernel's TCP buffer will fill up, it will advertise a 0-sized receiver window, and the remote client will stop being able to send. The remote TCP buffer will fill, and the socket will correctly block, or stop signaling poll(), until it can write again. To the other side, it won't look any different from a slow, bursty network connection.
Anyway, simple string-based APIs are disappointing, but it gets worse. Even when authors do consider the need to deal with large amounts of data, they often get the API wrong.
I eventually gave up on Haskell, and switched to Python. Things started out okay. BaseHTTPServer and urllib2 both let you stream data to and from file-like objects. But the example Python S3 library doesn't provide a stream API, it reads everything into strings. Well, okay, it's just an example. So I took a look at boto.
Boto has a variety of methods, some with names that sound promising. Like get_contents_to_file, that'll take a file-like object and write the data... wait. I'm getting the data from S3. I want to read it from a file-like object, not pass in a file-like object for boto to write to. Boto goes out of its way here to do the wrong thing! Rather than just handing me the wrapped socket object it no doubt has internally, it reads and writes all the data in a loop, converting the read operations I want to do into write operations I'll have to handle. To do any processing on this data, I'd have to construct a file-like object that can handle writes from boto!
Another thing I considered was writing my own S3 interface using PycURL. But PycURL requires you to give it a "write function" callback, which it will call periodically with data until the entire response body has been read. It's the same situation as with boto: you have to handle writes where you would prefer to perform reads. If you wanted to turn that into a readable file-like object, you'd have to run PycURL in a thread, and handle writes by putting data into a queue, and blocking when the queue is full. Then you could read out of the other side of the queue in the main thread.
For some reason this kind of read/write inversion is almost as common as string-based interfaces. Sometimes API users are required to provide a file-like object, and other times it's a callback function. Either way, the user ends up writing code to handle IO as it is performed by the API function, rather than performing the IO themselves. The most convenient read/write direction for an API to provide is almost always the one provided by the underlying IO: the one that allows the user to perform the IO directly, as they need it.
Frustrated with the situation in Python, I considered using Ruby. Ruby's AWS::S3 has a nicer interface than boto, and allows me to stream data reasonably. (It actually does the same kind of read/write inversion that PycURL does, but Ruby's block syntax makes that somewhat less annoying.) And Mongrel — the web server everybody uses to host their Rails apps — can be used as a library for writing HTTP servers, a lot like Python's BaseHTTPServer. Oh, but wait. Mongrel delivers the HTTP request body either as a string, or, if it exceeds some configurable size, as a tempfile. So it'll let me read from a file-like object, but insists on writing the data to an actual file on disk first!
Perhaps Mongrel is trying to be too smart, to anticipate my needs rather than exposing the more generally useful API underneath. In any case, it didn't work for my application.
The expedient solution for me was to hack up boto to make it not deliberately break the functionality I wanted. I'll be submitting a patch, so perhaps one of the five unhelpful APIs I encountered yesterday will someday be fixed. Why does this seem to be so difficult for library authors to get right?