Thursday, May 15, 2008

How Not to Handle Streams

Let me tell you about what I did yesterday. I set out to write a little HTTP proxy server that would transfer large (> 1GB) files to and from Amazon S3, doing a tiny bit of processing on the beginning of the data as it came through. Sounds pretty simple, right? You're probably thinking that's just a matter of gluing some HTTP libraries to some S3 libraries. I thought so too.

I've been enjoying Haskell lately, so my first thought was to try to write it in Haskell, even if it meant producing my own Haskell S3 library in the process. But I soon discovered that Haskell's HTTP library doesn't have any facility for streaming large amounts of data over HTTP: it reads the entire request or response into a String in memory.

This is an extremely common problem. Library authors often don't consider applications involving large amounts of data, and assume that everything will fit in memory. This is a particularly sad state of affairs in Haskell, because the language itself supports streaming behavior via lazy evaluation — there's no need to even change the API to support streaming in Haskell. Just make your functions do lazy IO, and those Strings can be used to stream data as-needed. (In fact, there's been some work on this in Haskell.)

Note also that reading and writing as you process data is supported by TCP. A lot of people seem to be at least slightly confused about this situation, particularly in the case of receiving data from a remote sender. If the other side of a TCP connection is sending data faster than you can process it, nothing terrible happens. Your kernel's TCP buffer will fill up, it will advertise a 0-sized receiver window, and the remote client will stop being able to send. The remote TCP buffer will fill, and the socket will correctly block, or stop signaling poll(), until it can write again. To the other side, it won't look any different from a slow, bursty network connection.

Anyway, simple string-based APIs are disappointing, but it gets worse. Even when authors do consider the need to deal with large amounts of data, they often get the API wrong.

I eventually gave up on Haskell, and switched to Python. Things started out okay. BaseHTTPServer and urllib2 both let you stream data to and from file-like objects. But the example Python S3 library doesn't provide a stream API, it reads everything into strings. Well, okay, it's just an example. So I took a look at boto.

Boto has a variety of methods, some with names that sound promising. Like get_contents_to_file, that'll take a file-like object and write the data... wait. I'm getting the data from S3. I want to read it from a file-like object, not pass in a file-like object for boto to write to. Boto goes out of its way here to do the wrong thing! Rather than just handing me the wrapped socket object it no doubt has internally, it reads and writes all the data in a loop, converting the read operations I want to do into write operations I'll have to handle. To do any processing on this data, I'd have to construct a file-like object that can handle writes from boto!

Another thing I considered was writing my own S3 interface using PycURL. But PycURL requires you to give it a "write function" callback, which it will call periodically with data until the entire response body has been read. It's the same situation as with boto: you have to handle writes where you would prefer to perform reads. If you wanted to turn that into a readable file-like object, you'd have to run PycURL in a thread, and handle writes by putting data into a queue, and blocking when the queue is full. Then you could read out of the other side of the queue in the main thread.

For some reason this kind of read/write inversion is almost as common as string-based interfaces. Sometimes API users are required to provide a file-like object, and other times it's a callback function. Either way, the user ends up writing code to handle IO as it is performed by the API function, rather than performing the IO themselves. The most convenient read/write direction for an API to provide is almost always the one provided by the underlying IO: the one that allows the user to perform the IO directly, as they need it.

Frustrated with the situation in Python, I considered using Ruby. Ruby's AWS::S3 has a nicer interface than boto, and allows me to stream data reasonably. (It actually does the same kind of read/write inversion that PycURL does, but Ruby's block syntax makes that somewhat less annoying.) And Mongrel — the web server everybody uses to host their Rails apps — can be used as a library for writing HTTP servers, a lot like Python's BaseHTTPServer. Oh, but wait. Mongrel delivers the HTTP request body either as a string, or, if it exceeds some configurable size, as a tempfile. So it'll let me read from a file-like object, but insists on writing the data to an actual file on disk first!

Perhaps Mongrel is trying to be too smart, to anticipate my needs rather than exposing the more generally useful API underneath. In any case, it didn't work for my application.

The expedient solution for me was to hack up boto to make it not deliberately break the functionality I wanted. I'll be submitting a patch, so perhaps one of the five unhelpful APIs I encountered yesterday will someday be fixed. Why does this seem to be so difficult for library authors to get right?

12 comments:

Unknown said...

That's where the uncool kids impress me.

$fp = fopen($_REQUEST['url'], 'r', false);
// Your little bit of processing.
fpassthru($fp);

P.S. Blogger "identity" thing and CAPTCHA seriously sucks. Can't you use Disqus?

Steven Hazel said...

Okay, I've set up Disqus. Unfortunately, they don't convert old posts.

Unknown said...

Hey, Mongrel author here, the full streaming fancy HTTP library is available for you to use just the same way mongrel does. No need to use the whole server since it seems you are just looking to do some streaming. Take a look at the heavily documented http11_parser stuff in ext.

Additionally, there's a client version of the same parser I wrote in the rfuzz.rubyforge.org project you can also use, and it handles chunked encoding.

Finally, I'm currently working on a new C lib and potential server that unifies the above parsers into a single super nice, correct, and *streamable* API.

For the record, the reason that Mongrel has to do tempfiles or StringIO is that Ruby's broke ass cgi.rb is not stream capable and insists on having the whole thing around to parse mime-types out. Otherwise, the API I wrote is fully capable of doing what you want.

Zed

Unknown said...

Try the lazy bytestring version of the Haskell HTTP library:

http://www.dtek.chalmers.se/~tox/site/http.php4

It's not yet merged into the main HTTP library, and the problem with not closing is a pain to work around.

Nevertheless I've used it to build a web crawler, and to build some analysis tools for video streams (Hogg). It's pretty efficient.

Unknown said...

have you looked at the TwistedMatrix library for Python. I think it has what you are looking for.

Mads Sülau Jørgensen said...

Clearly you haven't done you homework properly.

from boto import s3
conn = s3.connection.S3Connection('key', 'secret')
bucket = conn.get_bucket('foobar')
key = bucket.get_key('key')

for bytes in key:
other_fp.write(bytes)

Steven Hazel said...

mads,

Looks like boto beat me to the fix. I was using 1.0a, which lacks that functionality. 1.2a provides the API you mentioned, as well as a more generic read() method.

That puts it ahead of AWS::S3. But, it looks like it still doesn't provide a file-like interface on the sending side. Perhaps the authors aren't aware that the Content-MD5 header is optional?

Mads Sülau Jørgensen said...

Did you look at send_file? As far as i can tell it allows you to stick it a file-like object, which it then read's from self.BufferSize bytes at a time.

Steven Hazel said...

mads,

That's exactly the read/write inversion I was talking about. I don't want to pass in a file and have boto read from it; I want the key object to allow me to write data as I have it ready.

Steven Hazel said...

Zed,

Thanks for pointing out http11_parser, that might be useful sometime.

I also thought it was interesting that the problem I encountered with Mongrel's API was caused by API problems further down. I look forward to seeing your C lib!

Gabriel said...

Didn't try Perl, huh? My first transfer of a 4+ GB file worked fine.

Lonecrow said...

I feel like the last ASP developer on the planet. S3 sample code and free libraries for ever platform under the sun except vbscript.

Yes there is loads for DOTNET, I have nothing against DOTNET I just don't like it for web development.

I have dozens of web based production applications in ASP. It would be nice to be able to add new S3 based features without having to migrate them dotnet.