The problem with STM: your languages still suck January 2, 2009 | 12:13 pm

So, I’m reading Brandon Werner’s post on STM, about why Software Transactional Memory hasn’t caught on yet. There are three problems with STM that make it a CS Researcher Full Employment Act, allowing many to try and none to succeed in implementing STM for “mainstream” languages like Java, C#, and C++. They are:

1. What happens when you access a piece of transactional memory from outside a transaction? This allows for race conditions to happen, even on transactional memory. Worse yet, these race conditions can often reveal to the programmer the inner workings of the implementation of STM- meaning that it’s now possible to write code that works on one implementation of STM, but not on another.

2. What happens when you access a non-transactional variable or perform an IO from within a transaction? If you start a transaction, then call the function launch_missiles, then abort the transaction, what happens- is Moscow still vaporized?

3) The performance implications of transactional memory. As programmers, we’re used to assignments being cheap. Basically, if we see the code:

    x = y;

we assume it gets compiled into code something like:

   movl %eax, (%ebx)

i.e. a single instruction which takes only a clock cycle or two to complete. One of the problems with STM is it violates this assumption, big time- increasing the cost of a store from a pair of clock cycles into hundreds of clock cycles easily.

Which means that it’s very easy for the overhead of STM to become the dominate cost in code. Imagine the following case: we’re updating a tree that has about a thousand elements in it, so it’s 10 levels deep. Now, assume the logic overhead of walking down the tree is 10 clock cycles per node, or 100 clock cycles total. The single threaded mutable version also does 10 writes, at 2 clock cycles apiece, for a total cost of 120 clock cycles. Add in an overhead of, say, 50 clock cycles per STM store (a reasonable amount), and suddenly the cost of updating that tree jumps to over 600 clock cycles- 500 of which are the STM overhead, meaning that STM is taking up over 80% of the program time, and making the program take 5 times as long. This is a non-trivial overhead.

I should note that the clock cycle counts I’m throwing around are made up, but they’re of the right order of magnitude.

What’s interesting is that these problems are not so much problems with Software Transactional Memory, per se, but problems with the languages themselves. STM is simply putting requirements on the languages that the languages are not equipped to handle- like segregating transactional side-effects from non-transactional side-erffects. Or discouraging heavy use of mutability.

Brandon ends his post with the question “Anyone for Concurrent Haskell?”, but I think it’s useful to see why implementing STM in Java/C#/C++ is so hard that basically no one has successfully done it yet, but implementing STM in Haskell was done over a long weekend by SPJ and with one minor niggle (Simon, you should know better than to linearly search a list, use a tree, man), it works just fine.

For example, what happens if you access a transactional variable from outside a transaction? You can’t. The only way to read a transactional variable is to call readTVar, which has the type TVar a -> STM a. The “STM a” part says “Here, you can have the value held in the tvar- but only in a transaction monad!” The type system guarantees that if you can see the value in a TVar, you’re in a transaction.

Likewise with accessing non-transactional variables or doing IO from within a transaction- all IO, and all the classic (non-transactional) mutable data structures, live (in Haskell), in the IO monad. And you can’t mix the STM monad with the IO monad- you’re either in one or the other. You can build up a closure in one monad to be executed in the other, but the actual code execution is strictly segregated. So the implementation doesn’t have to worry about what happens when you call launch_missiles from within a transaction- since launch_missiles affects the outside world, it has to live in the IO monad, and thus can’t be called from with the STM monad.

And Haskell discourages mutability. Take the tree example I gave above. In single-threaded Haskell, you’d use a purely applicative tree, instead of a mutable one. When updating the tree, instead of modifying the old nodes, you’d simply allocate new nodes, and create a new tree representing the new state. Since the old nodes are immutable, and thus aren’t going to change out from under you, any node that isn’t changed by the update can be simply shared between the new and old trees- meaning that instead of having to allocate a thousand new nodes, you’d only have to allocate 10 new nodes. And with a good GC system, allocation is cheap- we’re looking at maybe 5 clocks per new node, or a total of 50 extra clock cycles. Single-threaded Haskell is slower than single-threaded imperative, but not by much. But here’s the kicker- the intuitive, simple implementation of this in Parallel Haskell would be to keep the immutable tree, and just have a single, mutable variable that points to the current tree. If that variable is transactional, and takes 50 clocks to update, the total cost of updating the tree is now just 200 clock cycles (100 logic + 50 allocation + 50 for the single STM update). This is compared to 120 for the single-threaded, and 600 clock cycles for the lots of STM mutable variables version. Mostly functional with small amounts of mutability pays a much smaller price for STM than the classical heavily mutable and imperative code.

So this is the true barrier to entry to STM: it requires a radical, deep, fundamental change in how people program, in order to work. I’m not sure yet people realize just how big a deal this is: all those <PARODY STYLE=”Sagan”>billions and billions</PARODY> of lines of code written in Java, C, C++, C#, Fortran, Ruby, Python, Perl, etc. are basically garbage at this point. Some small fraction of them might be saved, but the vast majority- coded in a highly imperative way, and such that it’s impossible to segregate side-effecting code from non-side-effecting code, will simply have to be junked and rewritten.

And yet, as bad and as expensive as STM plus pure functional is, the alternatives are worse. STM and imperative imposes huge, unacceptable, performance penalties. Classic threads and locks (or threads and synchronized, for you Java programmers) lead to the worst sort of Heisenbugs. True story: I once had to trace down a race condition that, doing several thousand tests per second, only manifested itself once every four days on average. The vast majority of the time it’d work just fine- the rate of failure was small enough it had escaped testing (think about it- you’ve been running the same test, thousands of times a second, for three days, and it hasn’t failed yet- when do you declare that it looks like the test is working?), but it was common enough that it was causing our customer to lock up about once a week, so it had to get fixed. Oh, and threads and locks aren’t compositional- you can’t take two hunks of code which are individually correct, and combine them to form one large, correct, piece of code. But hey, it’s not like code reuse is that important to programmers, is it?

Separate processes (at least conceptually) and message based concurrency imposes a huge cost on fine-grained sharing, and doesn’t scale up or down well- take a program that uses hundreds of concurrent processes and throw it on a system with thousands of cores, and most of your computing resources won’t be used. Throw the same program with the same hundreds of processes on a single-core system, and watch it bog down hard. To get message passing to scale, you end up having to re-architect and redesign your program every few years to handle the exponentially increasing number of cores machines have. Remember that from here on out, more transistors does not mean faster single-threaded performance, it just means more cores. So, as long as Moore’s law holds out, the number of cores will continue growing exponentially- and, last I heard, Moore’s law was still good for decades. Message passing is still a useful idea (especially for spreading a computation out over multiple different “boxes”, for a vague definition of a box), but it’s not a solution. And note that several of the retrofitting problems STM runs into, message passing also runs into (how many libraries depend upon global state for their proper functioning, for example?).

And lastly but not leastly, simply ignoring the whole multi-threaded issue and staying single threaded means that Moore’s law is now done for you- single cores are about as fast as they’ll ever be. You might gain occasional benefits from increased clock speed, but the old paradigm of having your code double in speed every couple of years is over. Well, except for your competitors, who did adopt to multi-threading- their code is doubling in speed (by getting a doubling in the number of cores they can run on) every couple of years. Have a nice day.

STM is the worst possible way to parallelize code, except for all the others. Unfortunately, the truth is that parallelism- not STM, not functional programming, but parallelism- is both a challenge to our fundamental assumptions about programming, and unfortunately no longer a challenge we programmers can avoid facing.

Tags:

  • Raoul Duke

    thanks for the opinionated post — this is such an important area for the computing world that it deserves plenty of energy. especially so clueless people like me can try to get up to speed by reading things like this. :-)

  • Jonathan Allen

    > Separate processes (at least conceptually) and message based concurrency imposes a huge cost on fine-grained sharing, and doesn’t scale up or down well- take a program that uses hundreds of concurrent processes and throw it on a system with thousands of cores, and most of your computing resources won’t be used.

    Erlang style processes are a proven technology and can be easily replicated in languages like C#.

  • Ravi

    This post is exactly the reasoning I went through once I finally understood STM in Haskell. I think “Caging the effects monster” (in SPJ’s words) is the most important engineering problem that today’s programmers face. It sounds like I reach the same conclusion as you – tomorrow’s programmers (I’m guessing 5-10 years out) will likely be programming in a pure, monadic language with STM because that’s the only solution we have so far for a compositional programming style with good parallel performance. I think the side benefit will be that caging effects has several other advantages that I went into in my ICFP Experience Report.

    My suspicion is that it won’t be a (default) lazy language because with STM and parallelism, laziness is no longer necessary “to keep you honest” in terms of purity and default laziness (and the associated space leaks) are one of the few thorns I feel in Haskell. That being said, laziness can be very convenient (and is essential if you want to implement new control abstractions without a macro system), so I’d expect it would be a language with easy optional laziness (something like Haskell’s bang patterns in reverse).

  • http://blogs.msdn.com/stmteam Yossi Levanoni

    Excellent post Brian!

    I would only comment that the functional tree with the mutable root has the problem that it allows no more than a single updater, so there goes your happy ride on Moore’s law. This is something that a surprising number of functional proponents like to gloss over: when the number of “sharing choke-points” decreases, as a result of a design methodology that frowns on shared state, those few mutable variables will become contention bottlenecks. My colleague Dave Detlefs like to say about this that “shared (mutable) state is annoyingly useful”…

    –Yossi

  • Andrew

    the trouble with magazine/opinion pieces on complex technological features is that they spread more confusion than they address. I’m wary that this is the case here.

    Also, there is no need to try and dismiss new but perceived bad technology with negative rhetoric – bad technology will weed itself out. What I’m trying to say is, on a spectrum between clumsy/bullying-dispassionate/technical this article could move a lot to the right without becoming dull to non-researchers.

    Anyway, I’m not a researcher but will attempt to address your points.

    (1)&(2) – what you’re saying here is that programmers must still be aware that they are using a concurrency model right? that there are still limitations that must be dealt with. The response to this is that we have to have realistic expectations, and simply ask the question (which I don’t claim to answer) “Does STM provide a model that results in real world developers writing less buggy concurrent code?”. As Microsoft are extremely aware, this is a question best answered by empricial study – i.e. give it to a large number of developers and see what happens. You won’t be able to do that with a language like Haskell because of the small number of Haskell programmers.

    (3) STM has a performance impact its true, but we need to take into consideration three things:
    (i) If STM proves to be a better model than traditional locking, we can support it in hardware. Major hardware vendors are looking at this already.
    (ii) Microsoft have done some studies that show that in many application classes, the overhead that STM introduces would be limited to ~1% of run-time. Slowing down that 1% might not have that much of an impact.
    (iii) STM is in its infancy – similarly to garbage collection the basics are getting there. People can start writing programs against the model, and the algorithms can improve independently.

  • Chris K

    And STM is like garbage collection in a big way: It is part of the compiler+runtime. Consider the original garbage collection for Java versus the newest ones. The newest ones are the result of huge amounts of investment of capital into very smart design teams that applied basic research to improve performance. The result of improving the compiler+runtime is that Java now runs very close to c/c++ and faster than most everything else ( http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all ).

    The STM runtime can become increasingly clever, and more (Moore?) transistors can implement HARDWARE transactional memory (HTM). Sun has already done this. So the runtime uses a hybrid of STM and HTM. So those transactional memory benchmarks can and will only get better and code written in the STM style will benefit from this without having to be rewritten.

  • http://canonical.org/~kragen Kragen Javier Sitaker

    What do you think about Zope? ZODB is a software transactional store (with persistence and network-distribution, even) with a slightly broken model of conflicts, which is more or less transparent, and it has been used in a side-effecting language to build real apps for a long time. Not everybody likes it, but it does more or less work.

    For that matter, what do you think about GemStone, ObjectStore, or even CICS? All of the problems you describe (writing to the transactional store is slow, accessing transactionally-controlled state from outside the transaction is dangerous, I/O from inside a transaction that aborts is wrong) are common to all transaction-based systems. But they haven’t prevented transaction-based systems from becoming very widely used, even before there was a practical pure functional language.

  • Michael Hendricks

    This reminds me of a talk about Concurrency in Clojure. For those who don’t know, Clojure is a Lisp that runs on the JVM. It uses efficient, immutable data structures and STM to permit some interesting concurrency. I’m not sure it can elegantly handle launch_missiles inside a transaction, but for the curious, it’s worth investigating.

  • Yitz Gale

    Yossi Levanoni wrote:

    > the functional tree with the mutable root has the problem
    > that it allows no more than a single updater, so there goes
    > your happy ride on Moore’s law

    Right, I’m not sure how useful that cycles computation is.
    In a functional program, you usually wouldn’t be “updating
    a tree”. You would structure the application differently.

    I think the important point is that the boundary between
    transactions and side-effects is critical in TM.
    The functional approach expresses this boundary
    in a clear and natural way, whereas it is clumsy and complex
    in any imperative TM implementation so far. Throwing more
    hardware at the problem will not make the importance of
    this boundary go away.

    Ravi wrote:

    > default laziness (and the associated space leaks)

    That is an oft-stated misconception. Space leaks
    are no harder to avoid in Haskell than in any other
    language, once you are comfortable with it.

  • Brian

    Chris K: I don’t hold out a lot of hope for hardware STM, because it’ll be too limited to be practical anytime in the near future. And that’s ignoring the fact that the majority of the hardware that will be out there But even in hardware that supports STM, there is significant limitations, like strict limitations on the total number of (transactional) memory locations that can be read or written, or limitations on the length of the transaction (task switches abort outstanding transactions if I under stand things correctly, so if your transaction can’t complete in a single clock tick, it can’t complete). Also, hardware STM only solves the performance problem- it doesn’t solve the problems of mixing STM and non-STM memory and code.

    Kragen: Transactional backing store is a completely different issue. SQL databases have been transactional since forever (to first approximation). But they have different requirements that STM in the program. Writing a block to the disk takes tens of thousands to tens of millions of clock cycles- a couple hundred extra clock cycles here or there is noise, and the performance issue of being transactional isn’t a problem (possibly being replaced by the much, much larger problem of being disk I/O bound). The transactional store is generally only accessed through some discreet interface (SQL or some special library API), making accessing the transactional store in a non-transactional way much more difficult. Likewise, nothing that doesn’t go through that interface is undone by the transaction. Conceptually the transaction is happening “over there”, not “over here”.

  • Greg Mildenhall

    What you’re saying here makes a great deal of sense, but I wonder if there’s a bigger picture? Explicit concurrency is fine for making use of (say) four processors, but for performance to scale reasonably with a hundred cores the important thing will be implicit parallelism rather than explicit concurrency.

    Once a language supports your STM shopping list, you’ve reached a level of declarativity and non-sequentiality* where the data-flow is well enough defined for a smart compiler/runtime to automatically parallelize your code with a reasonable level of efficiency. We don’t have sufficient smarts for that yet, but with the sort of research dollars you’re talking about…

    STM also needs to be pursued, but I think it’s less bang-for-buck – unless STM can be made to work with the inferior legacy languages we use today. (which I too doubt)

    *Sorry Ravi, but I think this brings non-strictness back into the equation.

  • http://pveentjer.wordpress.com Peter Veentjer

    1) It depends on the implementation of the STM if it allows data to be used without a transaction. An STM could allow ‘objects’ to be detached from an transaction and later be merged (same behavior Hibernate supports).

    2) Using non transactional resources gives the same problems as using non transactional resources under a database transaction (that is the way I look at STM’s). So what you say is nothing new.. (it is true that most people don’t think about it are don’t know how it works on a low level).

    3) It also depends on the STM. I’m currently playing with my own Java based STM implementation to gain new insights and one of my goals was to reduce the amount of overhead. So I decided to use an object level granularity and only instrument fields to other STM objects (needed for lazy loading). So normal fields are totally untouched and can be optimized like they always can (so reads and writes are also untouched). Only when the STM commits the dirty objects need to be committed to the stm and this is where it gets tricky because this is where contention could cause slow performance. But with smart tricks like special partitioning you could ‘commit’ transactions on parallel as long as they are not in the same spacial-partitions (this is just an idea I’m playing with).

    So a lot depends on the STM implementation. And I don’t think that STM’s will ever be as fast as low level concurrency control, but in a lot of cases it isn’t needed I think. If I had a piece of memory that could do 100.000 transactions/second, this would be sufficient for a lot of serverside problems. It is one of the tools in the toolbox.. it isn’t the holy grail (at least as long as we don’t have hardware support).

  • http://coretalk.net Sandy Klausner

    Brian,

    My Sandy: responses may feel a little out of context, but you are welcome to contact me directly to learn more about what is “CubeSystem.”

    1. What happens when you access a piece of transactional memory from outside a transaction? This allows for race conditions to happen, even on transactional memory. Worse yet, these race conditions can often reveal to the programmer the inner workings of the implementation of STM- meaning that it’s now possible to write code that works on one implementation of STM, but not on another.

    Sandy: A CubeSystem grove shared memory can only be accessed by a closure that provides four generic mechanisms – one including a ‘smart” API to enable a host OS application to read and write to grove memory concurrently with Cubicon-driven behavior. The second half of the question can only be handled through the complete standardization of a mechanism like a closure across all domains of practice using higher level iconic language abstractions not found in Java, C# and C++.

    2. What happens when you access a non-transactional variable or perform an IO from within a transaction? If you start a transaction, then call the function launch_missiles, then abort the transaction, what happens- is Moscow still vaporized?

    Sandy: A CubeSystem contains all variables within one or more groves that execute within an atomic transaction or realtime Model of Computation (MoC). Inter-grove messaging is handled by a closure. Moscow remains a thriving city because a closure can never activate an external physical port unless a transaction has just been committed. An abort takes the finger off the launch button.

    3) The performance implications of transactional memory. As programmers, we’re used to assignments being cheap. Basically, if we see the code:

    x = y;

    we assume it gets compiled into code something like:

    movl %eax, (%ebx)

    i.e. a single instruction which takes only a clock cycle or two to complete. One of the problems with STM is it violates this assumption, big time- increasing the cost of a store from a pair of clock cycles into hundreds of clock cycles easily.

    Which means that it’s very easy for the overhead of STM to become the dominate cost in code. Imagine the following case: we’re updating a tree that has about a thousand elements in it, so it’s 10 levels deep. Now, assume the logic overhead of walking down the tree is 10 clock cycles per node, or 100 clock cycles total. The single threaded mutable version also does 10 writes, at 2 clock cycles apiece, for a total cost of 120 clock cycles. Add in an overhead of, say, 50 clock cycles per STM store (a reasonable amount), and suddenly the cost of updating that tree jumps to over 600 clock cycles- 500 of which are the STM overhead, meaning that STM is taking up over 80% of the program time, and making the program take 5 times as long. This is a non-trivial overhead.

    Sandy: A grove contains object composite tree and matrix data structures within a contextual framework. A closure copies and takes soft locks on a specific set of objects. Each copy object maintains a virtual pointer reference to the original object. A closure writes value at the object attribute level to the copy, while a transaction commit writes the entire copy back into the original. An original object maintains a list of locking closures that is used to update sharing copies on an update process. A copy object may either stay in grove server space or transported to a client over the network in a transparent fashion. Therefore, the efficiency of a particular value update is dictated primarily by the number of cycles it takes to maintain and locate a virtual pointer within the shared grove space. This mechanism is built into the Memory Manager and Context Processing modules of the Context Engine and subject to optimization in both software and hardware implementations.

  • Pingback: Enfranchised Mind » Responses to The Problem with STM

  • Pingback: Rick Minerich's Development Wonderland : Discoveries This Week 01/16/2009

  • Kimo Crossman

    Sun: Early Experience with a Commercial Hardware Transactional
    Memory Implementation, ASPLOS’09, March 7–11, 2009

    http://research.sun.com/scalable/pubs/ASPLOS2009-RockHTM.pdf

  • Pingback: Enfranchised Mind » How Should I Burn My Free Time?

  • Brian Hurt

    First off, I just noticed this blog post made reddit (again). Welcome all the new readers.

    Kimo Crossman: The problem with pretty much all the hardware implementations of TM is that they’re limited in how many transactional variables you can update within a transaction. This limitation is necessary to keep chips implementable. But the whole point of TM from the developer’s perspective is that it is composable, in a way that lock-based concurrency isn’t. Needing to worry about the number of transactional variables accessed in a transaction greatly reduces the value of TM, as it requires a violation of opacity- no more black box transactions, and a greatly increased maintenance cost. So no, hardware based transactional memory won’t save us. It’s a good idea, as an optimization, or to use as a primitive to build STM on top of. But it’s not a solution.

  • http://houdini-cs.livejournal.com/ Bill Weiss

    I think you’re using Akismet, or something else that is eating my comments. If you could approve the first one I left (look for an email address with a + in it), that would be great. The second one was just commenting that the first still hasn’t showed up.

  • http://www.smokejumperit.com Robert Fischer

    This is the only comment I’ve got from you. Sorry!

  • http://houdini-cs.livejournal.com/ Bill Weiss

    Let me guess, you’re using Akismet?

  • http://www.smokejumperit.com Robert Fischer

    Yup. Took some training, but it seems to be working out pretty well since then. Better than any other solution, at least.

  • http://www.smokejumperit.com Robert Fischer

    It’s also on DZone: vote it up here.

  • http://houdini-cs.livejournal.com/ Bill Weiss

    Well, it’s costing you real comments. I left one that I put some thought into, and it’s gone now. It’s due (as far as I can tell) to a technical failure on their part, not my posts looking particularly “spammy”.

  • http://kotka.de Meikel Brandmeyer

    Since Micheal Hendricks mentioned Clojure and the launch_missiles…

    Write your launch-missiles using io!:

    (defn launch-missiles
    “Launch attack on remote targets with everything we have.”
    []
    (io!
    (doseq [missile (all-silos)]
    (fire missile))))

    Now whenever you try to launch the missiles inside a transaction,
    Clojure will tell you. Safety switch: “ON”.

  • http://pveentjer.wordpress.com Peter Veentjer

    This is an example of a STM Stack:

    public class Stack{

    private Node head;

    public void push(E item) {
    if (item == null) throw new NullPointerException();
    head = new Node(item, head);
    }

    public E peek() {
    if (head == null)
    return null;

    return removeTopItem();
    }

    public E pop() {
    if (head == null)
    retry();

    return removeTopItem();
    }

    private E removeTopItem() {
    Node oldHead = head;
    head = head.parent;
    return oldHead.value;
    }

    public int size() {
    return head == null ? 0 : head.size();
    }

    public boolean isEmpty() {
    return size() == 0;
    }

    public static class Node {
    final E value;
    final Node parent;
    final int size;

    Node(E value, Node prev) {
    this.value = value;
    this.parent = prev;
    this.size = parent == null ? 1 : prev.size+1;
    }

    int size() {
    return size;
    }
    }
    }

    As you can see, the language doesn’t need to suck if you use an STM.

    You can check http://code.google.com/p/multiverse/ if the layout of the example is f*cked up.

  • Pingback: Helltime for May 1 « I Built His Cage

  • http://canonical.org/~kragen Kragen Javier Sitaker

    You write:

    “Kragen: Transactional backing store is a completely different issue. SQL databases have been transactional since forever (to first approximation)….The transactional store is generally only accessed through some discreet interface (SQL or some special library API), making accessing the transactional store in a non-transactional way much more difficult.”

    That’s why I mentioned ZODB, GemStone, and ObjectStore, for which the transactional store is not accessed through a separate interface; persistent objects act more or less exactly like ordinary non-persistent objects. (I think this is the case for CICS too, which is why I mentioned it, but I don’t have deep enough knowledge of CICS.) I can’t tell if you didn’t read my comment very carefully or if you just didn’t know about these systems.

  • http://www.2n3904.net 2n3904

    Peter, Can you explain
    (quote)

    private E removeTopItem() {
    Node oldHead = head;
    head = head.parent;
    return oldHead.value;

    (/end quote)

  • http://www.eboard.co.il/ יד שניה

    I think the important point is that the boundary between
    transactions and side-effects is critical in TM – good !

  • Pingback: STM — Why mainstraims languages don’t profit from STM | cat musings.org >>= /dev/blog

  • Pingback: OCTO talks ! » Le multithreading zen

  • Empty

    So what is the current state of affairs?