Idle cores to the left of me, race conditions to the right

Monday, June 16, 2008 common lisp concurrency performance

I think that, over the past year, I've read some thirty articles that start with the solemn or anxious announcement that the future of programming will be multicore, and that one way or another, we will have to get used to it. I'll try not to write another one of those. In fact, it appears that multicore processors are also pretty good single-core processors, and most applications are, for the foreseeable future, comfortably single-threaded. And in a lot of cases, they should be. So is there any pressing reason to take notice of this new-fangled hardware?

[I]t looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!

That is from none less than Donald Knuth, in a recent interview. Nonetheless, I have (belatedly, I'll admit) started to get excited about incorporating the kind of 'multi-threading for the sake of multi-threading' that multiple cores encourage into my programs. Why? This might not be a very convincing reason for the serious, result-oriented programmer, but parallel programming, it turns out, is very amusing. Sure, there are dead-lock death-traps, race-conditions, and a general dangerous sense of non-determinism. But there is a logic to all of it, and a working parallel program can be a thing of beauty. That kind of beauty, and the complexity that tends to go with it, is what got me into this business in the first place. (It wasn't my love for hunching over keyboards or a deep dislike of sunlight, in any case.)

This infatuation started — predictably enough — with Erlang. Erlang is kind of hip, makes you think about concurrency in a whole new way, and is entirely dissatisfactory as a programming language. I won't go too deeply into that last point, since taking cheap shots at other people's work tends to hurt one's credibility, but the un-polished and limited feel of the language offended my delicate linguistic sensibilities, and prevented me from investing too deeply into it, with the result that I am still (also somewhat grudgingly, perhaps) doing most of my work in Common Lisp.

Common Lisp has no integrated, light-weight, multicore-friendly threads. In fact, in only recently started having widespread support for OS-level threads at all. This support comes, invariably, in the form of shared memory, locks, and semaphores bordeaux-threads provides a pleasant portable wrapper). While these allow just about any kind of concurrent program to be written, they are fairly tedious and error-prone to work with. CL-MUPROC builds a message-passing system on top of them, which allows a more pleasant programming style. Every CL-MUPROC process is an OS thread though, so you won't want to create thousands of them.

Another approach for getting away from mutexes, one that has been generating a lot of noise recently, is software transactional memory. If that doesn't mean anything to you, I'd recommend watching this video of Simon Peyton Jones explaining the idea. I was surprised to find that there apparently exists a CL-STM (aren't we CLers creative when it comes to naming projects). Seems to have been inactive since early 2007, and as I understand it, STM is rather hard to get right, so I'm not sure I'd risk building something serious on this... but it appears to work. The Clojure people seem to be making good use of STM in any case.

On a related note, SBCL exports a symbol sb-ext:compare-and-swap, which can be used to do small-scale transactional tricks, and is often an order of magnitude faster than the equivalent locking solution. Here is a very simple concurrency-safe stack. The macro is a convenient wrapper that helps use compare-and-swap in a 'transactional' way.

(defmacro concurrent-update (place var &body body)
  `(flet ((action (,var) ,@body))
 (let ((prev ,place))
   (loop :until (eq (sb-ext:compare-and-swap ,place prev (action prev)) prev)
         :do (setf prev ,place))
   prev)))

(defun make-cstack (&rest elements)
  (cons :stack elements))
(defun push-cstack (element stack)
  (concurrent-update (cdr stack) elts
    (cons element elts))
  (values))
(defun pop-cstack (stack)
  (car (concurrent-update (cdr stack) elts
         (cdr elts))))

Stacks are represented as cons cells, since compare-and-swap can only be used on a limited set of 'places' (cars, cdrs, svrefs, symbol-values), and not on things like instance or struct slots. Writing a queue is only slightly more complicated. Making popping threads block when there are no elements available, on the other hand, is decidedly tricky. (I think I have a working implementation... it is much faster than the locking variant, but it is so complicated that I'm not convinced it is correct. If you can recommend any good books or papers on low-level concurrency techniques, please drop me an e-mail.)

In the book How to Write Parallel Programs, which is old (1992) but still relevant, the authors distinguish three models of concurrency: Message passing (as in Erlang), shared or distributed data structures (as in Java), and 'live' data structures, where computations turn into values after they finish running. These are all fundamentally equivalent, in that a program using one of them can always be transformed to use another, but some programs are much more natural when expressed in the right model. STM, which had not been invented at the time the book was written, would probably qualify as a special case of shared data structures.

One of these models, live data structures, has never been very popular. This might be related to the fact that they offer none of the control-flow advantages that the other models have — you spawn a process, and then later you can read the value it produced, but this only buys you something when there is a direct advantage to computing values in parallel. On single-core machines, there isn't, but on multicores, this might be a very pleasant way to speed up some CPU-bound programs. On first seeing Haskell's par operator, which implements something like this, I was duly impressed. You just pass two computations to a little operator and, under the right circumstances, you get your results twice as fast.

Now, without further ado, I'd like to present my new library: PCall (for parallel call). It implements a thread pool and a few simple operators to get behaviour not unlike that of Haskell's par. As a silly example, it can get two seconds worth of sleep in only a single second!

(time (let ((a (pexec (sleep 1)))
            (b (pexec (sleep 1))))
        (join a)
        (join b)))

The code behind this is rather humble and simple, but it seems to work well. See the project page for more details and examples.

This kind of 'local parallelism' makes it relatively easy to verify that the tasks are using their data in a safe way — you typically just make sure that no tasks write to the same location or read anything another task might write.

One issue that keeps coming up with this style of parallelism, an issue that the book I mentioned above also discusses, is finding the right granularity when splitting up tasks. Wrapping up a computation in a function, putting it into a queue, having some thread find it and execute it, and then reading out the results... that is obviously more work than just directly executing the computation. Thus, if the computation is small enough, you are making your program slower by parallelising it.

The solution suggested by the book is to provide a 'granularity knob' — some setting that controls the size of the computation chunks that get delegated off to other processes. When mapping some function over a list, you could for example provide a way to control the amount of elements each task takes care of, and vary that based on the amount of available processors and the amount of computation the given function requires. In some cases, the granularity adjustment could be done automatically by profiling the running code. I guess that is usually more trouble than it is worth, though.

There are also situations where different environments call for completely different approaches. Having a central lock on something works well when there are ten threads occasionally accessing it, but becomes a bottleneck when there are a thousand. Optimistic transactions that fail when another thread interferes with their data have the same problem: They fall apart when there are too many processes. On the other hand, complex schemes that minimise the amount of conflict between threads also carry an overhead, and are often counterproductive when there are only a handful of threads. This is a less pleasant side of concurrent programming: There often is no right solution. In sequential programs, you can weight program complexity against efficiency, and often feel you've hit some sweet spot. In concurrent programs, the sweet spot on my dual-core system might be completely stupid on both an old single-core system and a flashy sixteen-core server.

Despite of this, and the claims by various people that threading is 'just too hard', I rather suspect we'll be able to work it out. As in other complicated fields, it is mostly a matter of finding better abstractions. I hope PCall can provide an abstraction suitable for some problems (though, at this point, I'm not even a hundred percent convinced there are no race conditions left in the library... do help me test it).

Marijn Haverbeke's blog (license)

Idle cores to the left of me, race conditions to the right