hackerdashery

Wednesday, August 27, 2014

P vs. NP and the Computational Complexity Zoo

At long last, my second video is finished!

Hackerdashery #2: P vs. NP and the Computational Complexity Zoo

Sunday, April 21, 2013

Hackerdashery YouTube Channel

I'm back! This time in video.

If you liked my blog back in 2008, you might like my new YouTube channel. I'll be talking about programming and all things tangentially related. Check it out and subscribe for more.

Monday, September 1, 2008

Google Chrome and the Future of Browsers

The big news today in the nerd world is that Google is going to ship an open-source browser, called Google Chrome.

Chrome incorporates some of the same ideas Bram and I wrote about recently: each site runs in its own process, so that memory can be bulk-reclaimed when you navigate away from a domain or close a tab, and it provides a task-manager-like interface that lets you see which site is using all you CPU or RAM. It also uses a more sophisticated Javascript VM called V8, which features precise, incremental garbage collection, and dynamic compilation.

Between these good design ideas and Google's ability (and intention) to test against their enormous, ranked web cache, I think Chrome is pretty much guaranteed to be a substantially better browser than IE, Firefox, and Safari are today.

By treating web apps more like regular applications, Chrome, together with Gears, constitutes a serious bid by Google to grow browsers into decently capable application platforms almost on the scale of operating systems. That means Google is slowly but surely working to erode Microsoft's monopoly. If browsers provide a high-quality, universally popular, OS-independent platform for applications, OS vendors no longer have any particular advantage in the desktop applications market. You don't have to look too far beyond Chrome to see operating systems becoming completely interchangeable commodities, just invisible layers that provide hardware support and enough infrastructure to run a browser and maybe some games.

Maybe Google will finally be able to make good on Marc Andreessen's declaration over a decade ago that the browser would make the operating system irrelevant.

Tuesday, July 15, 2008

codepad vim plugin

Nicolas Weber has written a handy Vim plugin that adds commands to paste the contents of your current buffer into codepad, and optionally run it. You can get it here.

I spend most of my time in Emacs, so if people find this useful, maybe I'll write an Emacs version. Although, I've been considering a switch to Vim. I'm thinking maybe its multi-keystroke commands will be easier on my wrists than the control- and meta- chording I do so much of in Emacs. I can't decide whether breaking with a decade of muscle-memory training will be more or less painful.

Thursday, June 19, 2008

Web Browsers and Memory Fragmentation

I've been following Stuart Parmenter's blog posts about improving memory usage in Firefox 3. The most interesting part has been his excellent work on reducing memory fragmentation. By making pretty pictures of memory fragmentation under various malloc implementations, he was able to identify jemalloc as the best option for Firefox, and the improvement made by switching was pretty impressive.

But I suspect that ultimately it won't be enough. The advent of Ajax is dramatically changing the way browsers use memory. Rather than loading a fairly static DOM and occasionally doing a few small manipulations in Javascript as the site is used, modern web apps mutate the DOM much more extensively, adding and removing sections repeatedly as new data is received from the server. And they're going to be doing that kind of thing more and more.

There's an interesting problem here which is unique to browsers. Javascript doesn't give programmers direct access to pointers, so it's free to move objects around in memory however it pleases as it executes code. Modern Javascript implementations take advantage of that abstraction, and manage memory allocation with generational garbage collectors that compact memory by relocating objects as they operate. But the DOM, which also has to be manipulated by browser code in C or C++, is reference-counted. The reference counter can't migrate objects between regions of memory, so it depends on the underlying malloc implementation to control memory fragmentation. That's not always possible even for a very smart malloc to accomplish. DOMs in particular should be challenging, because they contain large numbers of strings of varying sizes.

Reference counting introduces memory fragmentation problems in any language, but the situation is particularly bad in browsers. Javascript programs and DOMs on different web pages share the same process memory space. The fragmentation caused by one page can interact with the memory used by other pages, and even after a problematic page is closed, the fragmented memory often can't be reclaimed. With ordinary applications, fragmented memory is all neatly reclaimed when the process terminates. Right now, that doesn't work for web apps.

But this is a problem for browsers, not for web apps. Individual web app authors won't be motivated to care if your browser uses a lot of memory and causes your machine to start swapping once or twice a day, especially if the blame is shared among many popular sites. Unless one site is much worse than its peers, high memory usage makes the browser look bad, not the app.

So I think browsers will eventually experience pressure to fix this problem. As the correct metaphor for a web page moves farther from "document" and closer to "application", maybe it makes sense for browsers to act more like operating systems. Local memory for a web page could be allocated in dedicated regions, and could all be bulk-reclaimed when the page is closed.

That model is simple, but it still allows individual pages to fragment their own memory and consume more and more over time. Maybe a better answer is to find a way to use a generational garbage collector on the DOM. The familiar OS model of per-process virtual memory management was designed for running programs written in languages like C and C++, and that's why it's appealing for handling browser DOMs. But Javascript is a high-level, functional, garbage-collected language. If browsers are little operating systems that run Javascript applications, maybe they should operate more like Lisp machine operating systems than like Unix.

Tuesday, June 17, 2008

Private Pastes (and Projects) for codepad.org

I've noticed that some pastes on codepad.org include confidentiality notices in their copyright boilerplate. You own (or your employer owns) the copyright on code you paste on codepad.org, but it is a public forum! Pasted code appears on the "Recent Pastes" page, for example, where anyone can see it, and search engines can index it.

It makes sense that some people might want to use the site without revealing their code to the public, though. To support that kind of usage, I've made it possible to flag a paste as private, so that it won't be publicly linked on the site, and will include a noindex meta tag to advise search engines not to index it. Just in case, I went and flagged all the pastes in the database that included the words "confidential" or "copyright" as private pastes.

While I was at it, I added a similar feature for codepad projects. Now you can set up a private codepad project to collaborate on code that you'd rather not show the whole world.

Since paste URLs are randomly generated, they shouldn't be vulnerable to the kind of URL-guessing attacks SmugMug had problems with earlier this year. Still, these measures aren't really good enough for protecting important trade secrets — especially for projects, where the name might be guessable. If it seems like there's demand for it, I'll consider adding access control lists in the future.

Thursday, May 15, 2008

How Not to Handle Streams

Let me tell you about what I did yesterday. I set out to write a little HTTP proxy server that would transfer large (> 1GB) files to and from Amazon S3, doing a tiny bit of processing on the beginning of the data as it came through. Sounds pretty simple, right? You're probably thinking that's just a matter of gluing some HTTP libraries to some S3 libraries. I thought so too.

I've been enjoying Haskell lately, so my first thought was to try to write it in Haskell, even if it meant producing my own Haskell S3 library in the process. But I soon discovered that Haskell's HTTP library doesn't have any facility for streaming large amounts of data over HTTP: it reads the entire request or response into a String in memory.

This is an extremely common problem. Library authors often don't consider applications involving large amounts of data, and assume that everything will fit in memory. This is a particularly sad state of affairs in Haskell, because the language itself supports streaming behavior via lazy evaluation — there's no need to even change the API to support streaming in Haskell. Just make your functions do lazy IO, and those Strings can be used to stream data as-needed. (In fact, there's been some work on this in Haskell.)

Note also that reading and writing as you process data is supported by TCP. A lot of people seem to be at least slightly confused about this situation, particularly in the case of receiving data from a remote sender. If the other side of a TCP connection is sending data faster than you can process it, nothing terrible happens. Your kernel's TCP buffer will fill up, it will advertise a 0-sized receiver window, and the remote client will stop being able to send. The remote TCP buffer will fill, and the socket will correctly block, or stop signaling poll(), until it can write again. To the other side, it won't look any different from a slow, bursty network connection.

Anyway, simple string-based APIs are disappointing, but it gets worse. Even when authors do consider the need to deal with large amounts of data, they often get the API wrong.

I eventually gave up on Haskell, and switched to Python. Things started out okay. BaseHTTPServer and urllib2 both let you stream data to and from file-like objects. But the example Python S3 library doesn't provide a stream API, it reads everything into strings. Well, okay, it's just an example. So I took a look at boto.

Boto has a variety of methods, some with names that sound promising. Like get_contents_to_file, that'll take a file-like object and write the data... wait. I'm getting the data from S3. I want to read it from a file-like object, not pass in a file-like object for boto to write to. Boto goes out of its way here to do the wrong thing! Rather than just handing me the wrapped socket object it no doubt has internally, it reads and writes all the data in a loop, converting the read operations I want to do into write operations I'll have to handle. To do any processing on this data, I'd have to construct a file-like object that can handle writes from boto!

Another thing I considered was writing my own S3 interface using PycURL. But PycURL requires you to give it a "write function" callback, which it will call periodically with data until the entire response body has been read. It's the same situation as with boto: you have to handle writes where you would prefer to perform reads. If you wanted to turn that into a readable file-like object, you'd have to run PycURL in a thread, and handle writes by putting data into a queue, and blocking when the queue is full. Then you could read out of the other side of the queue in the main thread.

For some reason this kind of read/write inversion is almost as common as string-based interfaces. Sometimes API users are required to provide a file-like object, and other times it's a callback function. Either way, the user ends up writing code to handle IO as it is performed by the API function, rather than performing the IO themselves. The most convenient read/write direction for an API to provide is almost always the one provided by the underlying IO: the one that allows the user to perform the IO directly, as they need it.

Frustrated with the situation in Python, I considered using Ruby. Ruby's AWS::S3 has a nicer interface than boto, and allows me to stream data reasonably. (It actually does the same kind of read/write inversion that PycURL does, but Ruby's block syntax makes that somewhat less annoying.) And Mongrel — the web server everybody uses to host their Rails apps — can be used as a library for writing HTTP servers, a lot like Python's BaseHTTPServer. Oh, but wait. Mongrel delivers the HTTP request body either as a string, or, if it exceeds some configurable size, as a tempfile. So it'll let me read from a file-like object, but insists on writing the data to an actual file on disk first!

Perhaps Mongrel is trying to be too smart, to anticipate my needs rather than exposing the more generally useful API underneath. In any case, it didn't work for my application.

The expedient solution for me was to hack up boto to make it not deliberately break the functionality I wanted. I'll be submitting a patch, so perhaps one of the five unhelpful APIs I encountered yesterday will someday be fixed. Why does this seem to be so difficult for library authors to get right?