Posts Tagged ‘programming parallelism STM functional’
The problem with STM: your languages still suck January 2, 2009 | 12:13 pm

So, I’m reading Brandon Werner’s post on STM, about why Software Transactional Memory hasn’t caught on yet. There are three problems with STM that make it a CS Researcher Full Employment Act, allowing many to try and none to succeed in implementing STM for “mainstream” languages like Java, C#, and C++. They are:

1. What happens when you access a piece of transactional memory from outside a transaction? This allows for race conditions to happen, even on transactional memory. Worse yet, these race conditions can often reveal to the programmer the inner workings of the implementation of STM- meaning that it’s now possible to write code that works on one implementation of STM, but not on another.

2. What happens when you access a non-transactional variable or perform an IO from within a transaction? If you start a transaction, then call the function launch_missiles, then abort the transaction, what happens- is Moscow still vaporized?

3) The performance implications of transactional memory. As programmers, we’re used to assignments being cheap. Basically, if we see the code:

    x = y;

we assume it gets compiled into code something like:

   movl %eax, (%ebx)

i.e. a single instruction which takes only a clock cycle or two to complete. One of the problems with STM is it violates this assumption, big time- increasing the cost of a store from a pair of clock cycles into hundreds of clock cycles easily.

Which means that it’s very easy for the overhead of STM to become the dominate cost in code. Imagine the following case: we’re updating a tree that has about a thousand elements in it, so it’s 10 levels deep. Now, assume the logic overhead of walking down the tree is 10 clock cycles per node, or 100 clock cycles total. The single threaded mutable version also does 10 writes, at 2 clock cycles apiece, for a total cost of 120 clock cycles. Add in an overhead of, say, 50 clock cycles per STM store (a reasonable amount), and suddenly the cost of updating that tree jumps to over 600 clock cycles- 500 of which are the STM overhead, meaning that STM is taking up over 80% of the program time, and making the program take 5 times as long. This is a non-trivial overhead.

I should note that the clock cycle counts I’m throwing around are made up, but they’re of the right order of magnitude.

What’s interesting is that these problems are not so much problems with Software Transactional Memory, per se, but problems with the languages themselves. STM is simply putting requirements on the languages that the languages are not equipped to handle- like segregating transactional side-effects from non-transactional side-erffects. Or discouraging heavy use of mutability.

Brandon ends his post with the question “Anyone for Concurrent Haskell?”, but I think it’s useful to see why implementing STM in Java/C#/C++ is so hard that basically no one has successfully done it yet, but implementing STM in Haskell was done over a long weekend by SPJ and with one minor niggle (Simon, you should know better than to linearly search a list, use a tree, man), it works just fine.

For example, what happens if you access a transactional variable from outside a transaction? You can’t. The only way to read a transactional variable is to call readTVar, which has the type TVar a -> STM a. The “STM a” part says “Here, you can have the value held in the tvar- but only in a transaction monad!” The type system guarantees that if you can see the value in a TVar, you’re in a transaction.

Likewise with accessing non-transactional variables or doing IO from within a transaction- all IO, and all the classic (non-transactional) mutable data structures, live (in Haskell), in the IO monad. And you can’t mix the STM monad with the IO monad- you’re either in one or the other. You can build up a closure in one monad to be executed in the other, but the actual code execution is strictly segregated. So the implementation doesn’t have to worry about what happens when you call launch_missiles from within a transaction- since launch_missiles affects the outside world, it has to live in the IO monad, and thus can’t be called from with the STM monad.

And Haskell discourages mutability. Take the tree example I gave above. In single-threaded Haskell, you’d use a purely applicative tree, instead of a mutable one. When updating the tree, instead of modifying the old nodes, you’d simply allocate new nodes, and create a new tree representing the new state. Since the old nodes are immutable, and thus aren’t going to change out from under you, any node that isn’t changed by the update can be simply shared between the new and old trees- meaning that instead of having to allocate a thousand new nodes, you’d only have to allocate 10 new nodes. And with a good GC system, allocation is cheap- we’re looking at maybe 5 clocks per new node, or a total of 50 extra clock cycles. Single-threaded Haskell is slower than single-threaded imperative, but not by much. But here’s the kicker- the intuitive, simple implementation of this in Parallel Haskell would be to keep the immutable tree, and just have a single, mutable variable that points to the current tree. If that variable is transactional, and takes 50 clocks to update, the total cost of updating the tree is now just 200 clock cycles (100 logic + 50 allocation + 50 for the single STM update). This is compared to 120 for the single-threaded, and 600 clock cycles for the lots of STM mutable variables version. Mostly functional with small amounts of mutability pays a much smaller price for STM than the classical heavily mutable and imperative code.

So this is the true barrier to entry to STM: it requires a radical, deep, fundamental change in how people program, in order to work. I’m not sure yet people realize just how big a deal this is: all those <PARODY STYLE=”Sagan”>billions and billions</PARODY> of lines of code written in Java, C, C++, C#, Fortran, Ruby, Python, Perl, etc. are basically garbage at this point. Some small fraction of them might be saved, but the vast majority- coded in a highly imperative way, and such that it’s impossible to segregate side-effecting code from non-side-effecting code, will simply have to be junked and rewritten.

And yet, as bad and as expensive as STM plus pure functional is, the alternatives are worse. STM and imperative imposes huge, unacceptable, performance penalties. Classic threads and locks (or threads and synchronized, for you Java programmers) lead to the worst sort of Heisenbugs. True story: I once had to trace down a race condition that, doing several thousand tests per second, only manifested itself once every four days on average. The vast majority of the time it’d work just fine- the rate of failure was small enough it had escaped testing (think about it- you’ve been running the same test, thousands of times a second, for three days, and it hasn’t failed yet- when do you declare that it looks like the test is working?), but it was common enough that it was causing our customer to lock up about once a week, so it had to get fixed. Oh, and threads and locks aren’t compositional- you can’t take two hunks of code which are individually correct, and combine them to form one large, correct, piece of code. But hey, it’s not like code reuse is that important to programmers, is it?

Separate processes (at least conceptually) and message based concurrency imposes a huge cost on fine-grained sharing, and doesn’t scale up or down well- take a program that uses hundreds of concurrent processes and throw it on a system with thousands of cores, and most of your computing resources won’t be used. Throw the same program with the same hundreds of processes on a single-core system, and watch it bog down hard. To get message passing to scale, you end up having to re-architect and redesign your program every few years to handle the exponentially increasing number of cores machines have. Remember that from here on out, more transistors does not mean faster single-threaded performance, it just means more cores. So, as long as Moore’s law holds out, the number of cores will continue growing exponentially- and, last I heard, Moore’s law was still good for decades. Message passing is still a useful idea (especially for spreading a computation out over multiple different “boxes”, for a vague definition of a box), but it’s not a solution. And note that several of the retrofitting problems STM runs into, message passing also runs into (how many libraries depend upon global state for their proper functioning, for example?).

And lastly but not leastly, simply ignoring the whole multi-threaded issue and staying single threaded means that Moore’s law is now done for you- single cores are about as fast as they’ll ever be. You might gain occasional benefits from increased clock speed, but the old paradigm of having your code double in speed every couple of years is over. Well, except for your competitors, who did adopt to multi-threading- their code is doubling in speed (by getting a doubling in the number of cores they can run on) every couple of years. Have a nice day.

STM is the worst possible way to parallelize code, except for all the others. Unfortunately, the truth is that parallelism- not STM, not functional programming, but parallelism- is both a challenge to our fundamental assumptions about programming, and unfortunately no longer a challenge we programmers can avoid facing.