Providence Salumu Parallel and Concurrent Programming in Haskell

Chapter 10. Software Transactional Memory

Software transactional memory (STM) is a technique for simplifying concurrent programming by allowing multiple state-changing operations to be grouped together and performed as a single atomic operation. Strictly speaking, “software transactional memory” is an implementation technique, whereas the language construct we are interested in is "atomic blocks." Unfortunately, the former term has stuck, and so the language-level facility is called STM.

STM solves a number of problems that arise with conventional concurrency abstractions, which we describe here through a series of examples. For reference throughout the following sections, the types and operations of the STM interface are:

Control.Concurrent.STM

data STM a -- abstract
instance Monad STM -- among other things

atomically :: STM a -> IO a

data TVar a -- abstract
newTVar   :: a -> STM (TVar a)
readTVar  :: TVar a -> STM a
writeTVar :: TVar a -> a -> STM ()

retry     :: STM a
orElse    :: STM a -> STM a -> STM a

throwSTM  :: Exception e => e -> STM a
catchSTM  :: Exception e => STM a -> (e -> STM a) -> STM a

Running Example: Managing Windows

Imagine a window manager that manages multiple desktops. The user can move windows from one desktop to another, while at the same time, a program can request that its own window move from its current desktop to another desktop. The window manager uses multiple threads: one to listen for input from the user, a set of threads to listen for requests from the programs running in each existing window, and one thread that renders the display to the user.

How should the program represent the state of the display? Let’s assume some abstract types representing desktops and windows respectively:

data Desktop  -- abstract
data Window   -- abstract

A display consists of a number of Desktops, each of which is displaying a set of Windows. To put it another way, a display is a mapping from Desktop to a set of Window objects. The mapping changes over time, so we want to make it mutable, and the state needs to be shared among multiple threads. Hence, following the pattern from “MVar as a Container for Shared State”, we could use a Map stored in an MVar:

type Display = MVar (Map Desktop (Set Window))

This would work, but the MVar is a single point of contention. For example, the rendering thread, which needs to look only at the currently displayed desktop, could be blocked by a window on another desktop that is moving itself. This structure doesn’t allow as much concurrency as we would like.

To allow operations on separate desktops to proceed without impeding each other, perhaps we can have a separate MVar for each desktop:

type Display = Map Desktop (MVar (Set Window))

Unfortunately, this approach also quickly runs into problems. Consider an operation to move a window from one desktop to another:

moveWindow :: Display -> Window -> Desktop -> Desktop -> IO ()
moveWindow disp win a b = do
  wa <- takeMVar ma
  wb <- takeMVar mb
  putMVar ma (Set.delete win wa)
  putMVar mb (Set.insert win wb)
 where
  ma = disp ! a
  mb = disp ! b

Note that we must take both MVars before we can put the results; otherwise, another thread could potentially observe the display in a state in which the window we are moving does not exist. But this raises a problem: what if there is a concurrent call to moveWindow trying to move a window in the opposite direction? Let’s think through what would happen:

  thread 1: moveWindow d w1 a b
  thread 2: moveWindow d w2 b a

Here’s one possible interleaving:

  • Thread 1 takes the MVar for desktop a.
  • Thread 2 takes the MVar for desktop b.
  • Thread 1 tries to take the MVar for desktop b and blocks.
  • Thread 2 tries to take the MVar for desktop a and blocks.

Now we have deadlock: both threads are blocked on each other, and neither can make progress. This is an instance of the classic "Dining Philosophers" problem.

One solution is to impose an ordering on the MVars and require that all agents take MVars in the correct order and release them in the opposite order. That is inconvenient and error-prone, though, and furthermore we have to extend our ordering to any other state that we might need to access concurrently. Large systems written in languages with locks (e.g., operating systems) are often plagued by this problem, and managing the complexity requires building an elaborate infrastructure to detect ordering violations.

Sofware transactional memory provides a way to avoid this deadlock problem without imposing a requirement for ordering on the programmer. To solve the problem using STM, we replace MVar with TVar:

type Display = Map Desktop (TVar (Set Window))

TVar stands for “transactional variable”; it is a mutable variable that can be read or written only within the special monad STM, using the operations readTVar and writeTVar:

readTVar  :: TVar a -> STM a
writeTVar :: TVar a -> a -> STM ()

A computation in the STM monad can be performed in the IO monad, using the atomically function:

atomically :: STM a -> IO a

When an STM computation is performed like this, it is called a transaction because the whole operation takes place atomically with respect to the rest of the program. No other thread can observe an intermediate state in which only some of the operations of the transaction have taken place. The STM computation passed to atomically can be arbitrarily large and can contain any number of TVar operations, but as we shall see later there are performance implications for large transactions.

To implement moveWindow using STM, we first convert all the operations to their STM equivalents, and rename the function to moveWindowSTM to indicate that it is in the STM monad:

windowman.hs

moveWindowSTM :: Display -> Window -> Desktop -> Desktop -> STM ()
moveWindowSTM disp win a b = do
  wa <- readTVar ma
  wb <- readTVar mb
  writeTVar ma (Set.delete win wa)
  writeTVar mb (Set.insert win wb)
 where
  ma = disp ! a
  mb = disp ! b

Then, we wrap this in atomically to make the IO-monad version moveWindow:

moveWindow :: Display -> Window -> Desktop -> Desktop -> IO ()
moveWindow disp win a b = atomically $ moveWindowSTM disp win a b

The code for moveWindowSTM is almost identical to the MVar version, but the behavior is quite different: the sequence of operations inside atomically happens indivisibly as far as the rest of the program is concerned, so the problem we encountered earlier that required taking MVars in the correct order does not occur. What’s more, there is no requirement that we read both TVars before we write them; this would be fine, too:

moveWindowSTM :: Display -> Window -> Desktop -> Desktop -> STM ()
moveWindowSTM disp win a b = do
  wa <- readTVar ma
  writeTVar ma (Set.delete win wa)
  wb <- readTVar mb
  writeTVar mb (Set.insert win wb)
 where
  ma = disp ! a
  mb = disp ! b

So STM is far less error-prone here. The approach also scales to any number of TVars, so we could easily write an operation that moves the windows from all other desktops to the current desktop, for example.

Now suppose that we want to swap two windows, moving window W from desktop A to B, and simultaneously V from B to A. With the MVar representation, we would have to write a special purpose operation to do this, because it has to take the MVars for A and B (in the right order) and then put both MVars back with the new contents. With STM, however, we can express this much more neatly by simply making two calls to moveWindowSTM:

windowman.hs

swapWindows :: Display
            -> Window -> Desktop
            -> Window -> Desktop
            -> IO ()
swapWindows disp w a v b = atomically $ do
  moveWindowSTM disp w a b
  moveWindowSTM disp v b a

This demonstrates the composability of STM operations: any operation of type STM a can be composed with others to form a larger atomic transaction. For this reason, STM operations are usually provided without the atomically wrapper so that clients can compose them as necessary before finally wrapping the entire operation in atomically.

Why is STM a different monad from IO? The STM implementation relies on being able to roll back the effects of a transaction in the event of a conflict with another transaction (and for other reasons, as we shall see shortly). A transaction can be rolled back only if we can track exactly what effects it has, and this would not be possible if arbitrary I/O were allowed inside a transaction—we might have performed some I/O that cannot be undone, like making a noise or launching some missiles. For this reason, the STM monad permits only side effects on TVars, and the STM implementation tracks these effects to ensure the correct transaction semantics. We will discuss the implementation of STM and its performance implications in more detail in “Performance”.

This is an example of using the Haskell type system to enforce a safety invariant. We are guaranteed that every transaction is actually a transaction, because the type system prevents arbitrary side-effects from being performed in the STM monad.

So far, we covered the basic facilities of STM and showed that STM can be used to scale atomicity in a composable way. STM improves the expressibility and robustness of concurrent programs. The benefits of STM in Haskell go further, however. In the following sections, we show how STM can be used to make blocking abstractions compose, and how STM can be used to manage complexity in the presence of failure and interruption.

Blocking

An important part of concurrent programming is dealing with blocking when we need to wait for some condition to be true, or to acquire a particular resource. STM provides an ingenious way to do this with a single operation:

retry :: STM a

The meaning of retry is simply "abandon the current transaction and run it again." An example should help to clarify how retry works. Let’s consider how to implement MVar using STM because takeMVar and putMVar need to be able to block when the MVar is empty or full, respectively.

First the data type: an MVar is always in one of two states; either it is full and contains a value, or it is empty. We model this with a TVar containing Maybe a:[39]

tmvar.hs

newtype TMVar a = TMVar (TVar (Maybe a))

To make an empty TMVar, we simply need a TVar containing Nothing:

newEmptyTMVar :: STM (TMVar a)
newEmptyTMVar = do
  t <- newTVar Nothing
  return (TMVar t)

Now to code takeTMVar, which blocks if the desired variable is empty and returns the content once the variable is set:

takeTMVar :: TMVar a -> STM a
takeTMVar (TMVar t) = do
  m <- readTVar t                       -- 1
  case m of
    Nothing -> retry                    -- 2
    Just a  -> do
      writeTVar t Nothing               -- 3
      return a

1

Read the current contents of the TVar, which we inspect with a case.

2

If the TVar contains Nothing, then the TMVar is empty, so we need to block. The retry operation says, "Run the current transaction again," which will have the desired effect: we keep rerunning the transaction until the TVar no longer contains Nothing and the other case branch is taken. Of course, we don’t really want to blindly rerun the transaction over and over again, making our CPU hot for no good reason. The STM implementation knows that there is no point rerunning the transaction unless something different is likely to happen, and that can be true only if one or more of the TVars that were read by the current transaction have changed. In fact, what happens is that the current thread is blocked until one of the TVars that it is reading is written to, at which point the thread is unblocked again and the transaction is rerun.

3

If the TVar contains Just a, we empty the TMVar by writing Nothing into it and then return the a.

The implementation of putMVar is straightforward:

putTMVar :: TMVar a -> a -> STM ()
putTMVar (TMVar t) a = do
  m <- readTVar t
  case m of
    Nothing -> do
      writeTVar t (Just a)
      return ()
    Just _  -> retry

So now that we have a replacement for MVar built using STM, what can we do with it? Well, STM operations are composable, so we can perform operations on multiple TMVars at the same time:

  atomically $ do
    a <- takeTMVar ta
    b <- takeTMVar tb
    return (a,b)

This STM transaction succeeds when and only when both TMVars are full; otherwise it is blocked. This explains why retry must abandon the whole transaction: if the first takeTMVar succeeds but the second one retries, we do not want the effect of the first takeTMVar to take place.

This example is difficult to program with MVar because taking a single MVar is a side effect that is visible to the rest of the program, and hence cannot be easily undone if the other MVar is empty. One way to implement it is with a third MVar acting as a lock to control access to the other two, but then of course all other clients have to be aware of the locking protocol.

Blocking Until Something Changes

The retry operation allows us to block on arbitrary conditions. As a concrete example, we can use retry to implement the rendering thread in our window manager example. The behavior we want is this:

  • One desktop is designated as having the focus. The focused desktop is the one displayed by the rendering thread.
  • The user may request that the focus be changed at any time.
  • Windows may move around and appear or disappear of their own accord, and the rendering thread must update its display accordingly.

We are supplied with a named function render which handles the business of rendering windows on the display. It should be called whenever the window layout changes:[40]

render :: Set Window -> IO ()

The currently focused desktop is a piece of state that is shared by the rendering thread and some other thread that handles user input. Therefore, we represent that by a TVar:

type UserFocus = TVar Desktop

Next, we define an auxiliary function getWindows that takes the Display and the UserFocus and returns the set of windows to render in the STM monad. The implementation is straightforward: read the current focus and look up the contents of the appropriate desktop in the Display:

windowman.hs

getWindows :: Display -> UserFocus -> STM (Set Window)
getWindows disp focus = do
  desktop <- readTVar focus
  readTVar (disp ! desktop)

Finally, we can implement the rendering thread. The general plan is to repeatedly read the current state with getWindows and call render to render it, but use retry to avoid calling render when nothing has changed. Here is the code:

renderThread :: Display -> UserFocus -> IO ()
renderThread disp focus = do
  wins <- atomically $ getWindows disp focus    -- 1
  loop wins                                     -- 2
 where
  loop wins = do                                -- 3
    render wins                                 -- 4
    next <- atomically $ do
               wins' <- getWindows disp focus   -- 5
               if (wins == wins')               -- 6
                   then retry                   -- 7
                   else return wins'            -- 8
    loop next

1

First, we read the current set of windows to display.

2

We use this as the initial value for the loop.

3

The loop takes the current set of windows as an argument, renders the windows, and then blocks until something changes that requires re-rendering.

4

Each iteration calls render to display the current state and then enters a transaction to read the next state.

5

Inside the transaction, we read the current state.

6

We compare it to the state we just rendered.

7

If the states are the same, then there is no need to do anything, so we call retry.

8

If the states are different, then we return the new state, and the loop iterates with the new state.

The effect of the retry is precisely what we need: it waits until the value read by getWindows could possibly be different, because another thread has successfully completed a transaction that writes to one of the TVars that is read by getWindows. That encompasses both changes to the focus (because the user switched to a different desktop), and changes to the contents of the current desktop (because a window moved, appeared, or disappeared). Furthermore, changes to other desktops can take place without the rendering thread being woken up.

If it weren’t for STM’s retry operation, we’d have to implement this complex logic ourselves, including implementing the signals between threads that modify the state and the rendering thread. This is anti-modular, because operations that modify the state have to know about the observers that need to act on changes. Furthermore, it gives rise to a common source of concurrency bugs: lost wakeups. If we forgot to signal the rendering thread, the display wouldn’t be updated. In this case, the effects are somewhat benign. In a more complex scenario, lost wakeups often lead to deadlocks: the woken thread was supposed to complete an operation on which other threads are waiting.

Merging with STM

Recall that in “Merging” we considered the problem of waiting for any event from a set of possible events. Typically this requires the events to be merged into a single MVar or Chan so that we can wait for the next event using takeMVar or readChan. In turn, this means that the source of each event needs to know which MVar(s) or Chan(s) to send it to, rather than each event being a completely independent entity.

The more general problem of taking either of two MVars requires creating two new threads to take each MVar and put the result into a third MVar. However, even this doesn’t really solve the problem: if we wanted to take at most one of two MVars, then (as far as I am aware) there is no way to do it; you just have to construct your program in a different way so that it doesn’t need to do this.

STM provides a neat solution to both of these problems in the form of an operation that we have not yet introduced:

orElse :: STM a -> STM a -> STM a

The operation orElse a b has the following behavior:

  • First, a is executed. If a returns a result, then the orElse call returns it and ends.
  • If a calls retry instead, a's effects are discarded_ and b is executed instead.

The orElse operator lets us combine two blocking transactions such that one is performed but not both. This is exactly what we need for composing several event sources, or for taking at most one of two MVars (actually TMVars, of course). The latter is coded as follows:

code/tmvar.hs

takeEitherTMVar :: TMVar a -> TMVar b -> STM (Either a b)
takeEitherTMVar ma mb =
  fmap Left (takeTMVar ma)
    `orElse`
  fmap Right (takeTMVar mb)

There are two calls to takeTMVar, with their results wrapped in Left and Right, respectively, composed together with orElse.

One thing to note is that orElse is left-biased: if both TMVars are non-empty, takeEitherTMVar will always return the contents of the first one. Whether this is problematic depends on the application. Be aware that the left-biased nature of orElse can have implications for fairness in some situations.

STM provides two complementary ways to compose blocking operations together: the ordinary monadic bind gives us "and", and orElse gives us "or".

Async Revisited

Recall in “Merging” that we defined waitEither for the Async abstraction by forking two extra threads. STM’s orElse now allows us to define waitEither much more efficiently. Furthermore, the extra flexibility of STM lets us compose Asyncs together in more interesting ways. But first, we need to rewrite the Async implementation in terms of STM, rather than MVar. The translation is straightforward: we just replace MVar with TMVar.

data Async a = Async ThreadId (TMVar (Either SomeException a))

The async function looks familiar, with only an additional atomically to wrap the call to putTMVar in the child thread:

async :: IO a -> IO (Async a)
async action = do
  var <- newEmptyTMVarIO
  t <- forkFinally action (atomically . putTMVar var)
  return (Async t var)

Here we used newEmptyTMVarIO, which is a convenient version of newEmptyTMVar in the IO monad.

The waitCatchSTM function is like waitCatch, but in the STM monad:

waitCatchSTM :: Async a -> STM (Either SomeException a)
waitCatchSTM (Async _ var) = readTMVar var

And we can define waitSTM, the version of waitCatchSTM that re-throws an exception result, in terms of waitCatchSTM:

waitSTM :: Async a -> STM a
waitSTM a = do
  r <- waitCatchSTM a
  case r of
    Left e  -> throwSTM e
    Right a -> return a

Now we can define waitEither by composing two calls to waitSTM using orElse:

waitEither :: Async a -> Async b -> IO (Either a b)
waitEither a b = atomically $
  fmap Left (waitSTM a)
    `orElse`
  fmap Right (waitSTM b)

More generally, we can wait for any number of Asyncs simultaneously. The function waitAny does this by first mapping waitSTM over a list of Asyncs and then composing the calls together by folding them with orElse:

waitAny :: [Async a] -> IO a
waitAny asyncs =
  atomically $ foldr orElse retry $ map waitSTM asyncs

In “Merging” (geturls6.hs), we downloaded several URLs simultaneously and reported the first one to finish by using a version of waitAny that forked a new thread for each Async to wait for. Using the above definition of waitAny with the STM version of Async, we can now solve the same problem without forking a new thread per Async:

geturlsfirst.hs

main :: IO ()
main = do
  let
    download url = do
       r <- getURL url
       return (url, r)

  as <- mapM (async . download) sites

  (url, r) <- waitAny as
  printf "%s was first (%d bytes)\n" url (B.length r)
  mapM_ wait as

The program works as before, creating an Async to download each URL in the list. Then it calls waitAny to get the first result, reports it, and finally waits for the rest to complete.

Implementing Channels with STM

In this section, we’ll implement the Chan type from “MVar as a Building Block: Unbounded Channels” using STM. As we’ll see, using STM to implement Chan is rather less tricky than using MVars, and furthermore we are able to add complex operations that were difficult or impossible using MVars.

The STM version of Chan is called TChan, and the interface we wish to implement is as follows:[41]

data TChan a

newTChan   :: STM (TChan a)
writeTChan :: TChan a -> a -> STM ()
readTChan  :: TChan a -> STM a

This is exactly the same as Chan, except that we renamed Chan to TChan, and all the operations are in the STM monad rather than IO. The full code for the implementation is given next.

TChan.hs: 

data TChan a = TChan (TVar (TVarList a))
                     (TVar (TVarList a))

type TVarList a = TVar (TList a)
data TList a = TNil | TCons a (TVarList a)

newTChan :: STM (TChan a)
newTChan = do
  hole <- newTVar TNil
  read <- newTVar hole
  write <- newTVar hole
  return (TChan read write)

readTChan :: TChan a -> STM a
readTChan (TChan readVar _) = do
  listHead <- readTVar readVar
  head <- readTVar listHead
  case head of
    TNil -> retry
    TCons val tail -> do
        writeTVar readVar tail
        return val

writeTChan :: TChan a -> a -> STM ()
writeTChan (TChan _ writeVar) a = do
  newListEnd <- newTVar TNil
  listEnd <- readTVar writeVar
  writeTVar writeVar newListEnd
  writeTVar listEnd (TCons a newListEnd)

The implementation is similar in structure to the MVar version in “MVar as a Building Block: Unbounded Channels”, so we do not describe it line by line; however, we will point out a few important details:

  • All the operations are in the STM monad, so to use them they need to be wrapped in atomically (but they can also be composed; more about that later).
  • The TList type needs a TNil constructor to indicate an empty list; in the MVar implementation, the empty list was represented implicitly by an empty MVar.
  • Blocking in readTChan is implemented by a call to retry.
  • Nowhere did we have to worry about what happens when a read executes concurrently with a write, because all the operations are atomic.

We now describe three distinct benefits of the STM implementation compared with using MVars.

More Operations Are Possible

In “MVar as a Building Block: Unbounded Channels”, we mentioned the operation unGetChan, which could not be implemented with the desired semantics using MVars. Here is its implementation with STM:

unGetTChan :: TChan a -> a -> STM ()
unGetTChan (TChan readVar _) a = do
   listHead <- readTVar readVar
   newHead <- newTVar (TCons a listHead)
   writeTVar readVar newHead

The obvious implementation does the right thing here. Other operations that were not possible with MVars are straightforward with STM; an example is isEmptyTChan, the MVar version that suffers from the same problem as unGetChan:

isEmptyTChan :: TChan a -> STM Bool
isEmptyTChan (TChan read _write) = do
  listhead <- readTVar read
  head <- readTVar listhead
  case head of
    TNil -> return True
    TCons _ _ -> return False

Composition of Blocking Operations

Because blocking STM computations can be composed together, we can build composite operations like readEitherTChan:

readEitherTChan :: TChan a -> TChan b -> STM (Either a b)

This function reads a value from either of the two TChans passed as arguments, or blocks if they are both empty. Its implementation should look familiar, being similar to takeEitherTMVar:

readEitherTChan :: TChan a -> TChan b -> STM (Either a b)
readEitherTChan a b =
  fmap Left (readTChan a)
    `orElse`
  fmap Right (readTChan b)

Asynchronous Exception Safety

Up until now, we have said nothing about how exceptions in STM behave. The STM monad supports exceptions much like the IO monad, with two operations:

throwSTM  :: Exception e => e -> STM a
catchSTM  :: Exception e => STM a -> (e -> STM a) -> STM a

The throwSTM operation throws an exception, and catchSTM catches exceptions and invokes a handler, just like catch in the IO monad. However, exceptions in STM are different in one vital way: in catchSTM m h, if m raises an exception, then all of its effects are discarded, and then the handler h is invoked. As a degenerate case, if there is no enclosing catchSTM at all, then all of the effects of the transaction are discarded and the exception is propagated out of atomically.

An example should help to demonstrate the motivation for this behavior. Imagine an STM operation readCheck defined as follows:

readCheck :: TChan a -> STM a
readCheck chan = do
  a <- readTChan chan
  checkValue a

Where checkValue is an operation that imposes some extra constraints on the value read from the channel. Now suppose checkValue raises an exception (perhaps accidentally, e.g., divide-by-zero). We would prefer it if the readTChan had not happened because an element of the channel would be lost. Furthermore, we would like readCheck to have this behavior regardless of whether there is an enclosing exception handler or not. Hence catchSTM discards the effects of its first argument in the event of an exception.

The discarding-effects behavior is even more useful in the case of asynchronous exceptions. If an asynchronous exception occurs during an STM transaction, the effects of the transaction are discarded, just as for a synchronous exception. So in most cases, asynchronous exception safety in STM consists of doing absolutely nothing at all. There are no locks to replace, so there is no need for exception handlers or bracket and no need to worry about which critical sections to protect with mask.

The implementation of TChan given earlier is entirely safe with respect to asynchronous exceptions as it stands, and moreover any compositions of these operations are also safe.

STM provides a nice way to write code that is automatically safe with respect to asynchronous exceptions, so it can be useful even for state that is not shared between threads. The only catch is that we have to use STM consistently for all our state, but having made that leap, asynchronous exception safety comes for free.

An Alternative Channel Implementation

In the previous section, we implemented a channel type that was analogous to the MVar-based Chan, in that it has a similar implementation structure and the same basic operations. However, the flexibility of STM gives us more choices in how to construct channels, and in fact if we don’t need dupChan, we can implement a much more efficient channel abstraction.

The key observation is that in STM, an operation can block on any condition whatsoever. This means we can represent the channel contents by any data structure we choose. For example, even a simple list works:

TList.hs

newtype TList a = TList (TVar [a])

newTList :: STM (TList a)
newTList = do
  v  <- newTVar []
  return (TList v)

writeTList :: TList a -> a -> STM ()
writeTList (TList v) a = do
  list <- readTVar v
  writeTVar v (list ++ [a])

readTList :: TList a -> STM a
readTList (TList v) = do
  xs <- readTVar v
  case xs of
    []      -> retry
    (x:xs') -> do
      writeTVar v xs'
      return x

This is a channel abstraction with the same behavior as TChan; readTList blocks when the channel is empty, because it can detect the empty list and call retry.

There is a performance problem with this representation, though. Note that writeTList must add an element to the end of the list, which, using the standard Haskell list datatype, requires an O(n) append operation.

The solution is to use a different queue data structure that supports O(1) enqueue and dequeue operations. There is a folklore technique for representing a queue that has the desired property: the idea is to represent a queue as two lists, xs and ys, where the whole contents of the list is given by xs ++ reverse ys. That is, to take an element from the front we take it from xs, and to add an element to the back we add it to the front of ys; both of these operations are O(1). But what if xs is empty and we need to take an element? In that case, we must reverse ys and let that become the new xs. So while most of the time, taking an element from the front is O(1), occasionally it is O(n). However, we know that each list element is reversed only once, so on average the complexity of both enqueue and dequeue is O(1).[42]

We can use this technique to represent the channel contents. This is the code:

TQueue.hs

data TQueue a = TQueue (TVar [a]) (TVar [a])

newTQueue :: STM (TQueue a)
newTQueue = do
  read  <- newTVar []
  write <- newTVar []
  return (TQueue read write)

writeTQueue :: TQueue a -> a -> STM ()
writeTQueue (TQueue _read write) a = do
  listend <- readTVar write
  writeTVar write (a:listend)

readTQueue :: TQueue a -> STM a
readTQueue (TQueue read write) = do
  xs <- readTVar read
  case xs of
    (x:xs') -> do writeTVar read xs'
                  return x
    [] -> do ys <- readTVar write
             case ys of
               [] -> retry                      -- 1
               _  -> do let (z:zs) = reverse ys -- 2
                        writeTVar write []
                        writeTVar read zs
                        return z

1

If we are reading from the channel and the read list is empty, then we check the write list. If that is also empty, then we block.

2

If the ys list is non-empty, then we must reverse it and make it the new xs list, and then return the first element of the new xs as the value we read from the channel.

+ There is one subtlety here: we must be careful that the reverse is done lazily, which is why we use a let rather than case here. If we were to pattern-match on the result of the reverse strictly, the STM transaction could not complete until the reverse finished (see “Performance”).

Another happy consequence of this representation choice is that we are able to use a separate TVar for each list. This means that in the common case, readers and writers can proceed independently without conflict, which is important if we use this data structure in a parallel setting.

This implementation of channels in STM outperforms both the MVar-based Chan and the TVar-based TChan. A simple benchmark program can be found in chanbench.hs with three different scenarios:

  • Two threads, one reading from and one writing to the channel
  • One thread, writing a large number of values and then reading them
  • One thread, repeatedly writing and then reading a number of values

On my computer, TQueue is about the same as Chan on the first test and wins by about 20% on the second and third test.

Why is TQueue so much faster? The main reason is that the data structure representing the channel contents is much more compact and thus faster to operate on: ordinary linked lists are very cheap in Haskell, whereas operations on TVar and MVar are much more expensive.

Bounded Channels

So far, we have seen one-place channels (MVar and TMVar) and unbounded channels (Chan and TChan), but in practice we often want something between the two. The one-place channel does not allow sufficient concurrency: consider multiple writers with a single reader. If there is a burst of writing activity, most of the writers will block waiting for the reading thread to catch up, and there will be a lot of context switching as the reader services each writer in turn. The unbounded channel has a different pathology: if the reading thread cannot keep up with the writers, the size of the channel will keep growing without bound, and in the worst case we could run out of memory.

Ideally, there should be some limit on the size of the channel so that the channel can absorb bursts of writing activity without the danger that heavy writing will use too much memory.

Fortunately, STM makes it quite straightforward to build a bounded channel. All we need to do is keep track of the current capacity in the channel and arrange that writing to the channel blocks if the channel is currently full. This implementation is based on TQueue:

TBQueue.hs

data TBQueue a = TBQueue (TVar Int) (TVar [a]) (TVar [a]) -- 1

newTBQueue :: Int -> STM (TBQueue a)
newTBQueue size = do
  read  <- newTVar []
  write <- newTVar []
  cap   <- newTVar size
  return (TBQueue cap read write)

writeTBQueue :: TBQueue a -> a -> STM ()
writeTBQueue (TBQueue cap _read write) a = do
  avail <- readTVar cap                         -- 2
  if avail == 0                                 -- 3
     then retry                                 -- 4
     else writeTVar cap (avail - 1)             -- 5
  listend <- readTVar write
  writeTVar write (a:listend)

readTBQueue :: TBQueue a -> STM a
readTBQueue (TBQueue cap read write) = do
  avail <- readTVar cap                         -- 6
  writeTVar cap (avail + 1)
  xs <- readTVar read
  case xs of
    (x:xs') -> do writeTVar read xs'
                  return x
    [] -> do ys <- readTVar write
             case ys of
               [] -> retry
               _  -> do let (z:zs) = reverse ys
                        writeTVar write []
                        writeTVar read zs
                        return z

1

The TBQueue data type is like the TQueue we saw previously but has an extra TVar Int to store the channel’s current capacity.

2

In writeTBQueue, we first read the current capacity.

3

If the capacity is zero, meaning the channel is full,

4

we call retry to block.

5

Otherwise, decrease the capacity by 1, because we are about to add another element.

6

When reading, we always increment the capacity.

In the chanbench.hs channel benchmark, the bounded channel performs almost as well as TQueue in the first test, although it doesn’t do so well in the third test, performing about the same as TChan. The second test, which writes a large number of items to the channel, inevitably fails with TBQueue.

The danger with bounded channels is that it is possible to write a program with a lurking deadlock that is only discovered much later when the program is running in production. This is because the vast majority of the time writeTBQueue does not block, but once in a while, probably under heavy load, the channel fills up and writeTBQueue blocks. If the program depends on writeTBQueue not blocking, it may deadlock. How might we get into this situation? It is the dining philosophers problem again:

thread 1:
  x <- atomically $ readTBQueue q1
  y <- atomically $ readTBQueue q2

thread 2:
  atomically $ writeTBQueue q2 y
  atomically $ writeTBQueue q1 x

This sequence will work perfectly well until q2 becomes full, at which point we get a deadlock. If the communication pattern is obscured by other code, we might not realize there’s a problem.

There’s no silver bullet. The best advice is to test your code thoroughly with a buffer size of 1, because that will tend to expose any deadlocks of this kind during testing. Note that deadlocks will often be detected by the runtime system and result in an exception rather than a hang; see “Detecting Deadlock”.

What Can We Not Do with STM?

STM offers a qualitative improvement over MVar in various ways: composable atomicity, composable blocking, and simpler error handling. Therefore, it is reasonable to ask whether we need MVar at all, and whether there is anything that is harder to accomplish with STM than with MVar.

One unsurprising advantage of MVar is that it is faster than STM. But even though a straightforward comparison of, say, takeMVar against atomically . takeTMVar will show that takeMVar is faster, we should not assume that using MVar will always result in faster code. As we saw in the previous section, we can build a channel using STM that outperforms the MVar-based version, and furthermore is composable.

In fact, MVar does have one other important advantage over STM, which we mentioned earlier: fairness. When multiple threads block on an MVar, they are guaranteed to be woken up in FIFO order, and no single thread can be blocked in takeMVar indefinitely so long as there is a constant supply of putMVars. In contrast, when multiple threads are blocked in STM transactions that depend on a particular TVar, and the TVar is modified by another thread, it is not enough to just wake up one of the blocked transactions—the runtime must wake them all. To see why, consider the following:

do x <- takeTMVar m
   when (x /= 42) retry

A transaction can block on an arbitrary condition, so the runtime doesn’t know whether any individual transaction will be able to make progress after the TVar is changed; it must run the transaction to find out. Hence, when there are multiple transactions that might be unblocked, we have to run them all; after all, they might all be able to continue now. Because the runtime has to run all the blocked transactions, there is no guarantee that threads will be unblocked in FIFO order and no guarantee of fairness.

You might wonder whether we could implement fairness using STM. For example, suppose we want to add fairness to our TMVar implementation. We will need to represent explicitly the queue of blocked takeTMVars, perhaps as a list of TVars, each waiting to receive a value. Conversely, the blocked putTMVars could also be a list of TVars, each with a value to put. In fact, we could represent all the blocked threads by a list of TVar (Maybe a).

So this could be the TMVar data type:

data TMVar a = TMVar (TVar (Maybe a)) (TVar [TVar (Maybe a)])

Now consider how putMVar would work. There are three cases to consider:

The TMVar is empty, and there are no blocked takeTMVars
Store the value in the TMVar and return.
The TMVar is empty, and there are some blocked takeTMVars
Removes the first blocked takeTMVar from the queue and put the value in its TVar.
The TMVar is full
We must create a new TVar containing Just a (the value to be put), add this to the end of the list of blocked putTMVars, and then wait until the TVar contents becomes Nothing.

The last case is the tricky one: we cannot write a transaction that both has a visible effect (adds something to the list) and calls retry, because calling retry abandons any changes to TVars made by the current transaction.

The only way to implement fairness is to abandon composability. We can implement a TMVar with the structure I suggested, but the operations must be in the IO monad, not the STM monad. The trick is to have the STM transaction return an IO action that is executed after the STM transaction completes. I’ll leave the implementation as an exercise for the reader.

In general, the class of operations that STM cannot express are those that involve multi-way communication between threads. The simplest example is a synchronous channel, in which both the reader and the writer must be present simultaneously for the operation to go ahead. We cannot implement this in STM, at least compositionally, for the same reason that we cannot implement TMVar with fairness: the operations need to block and have a visible effect—advertise that there is a blocked thread—simultaneously.

Performance

As with most abstractions, STM has a runtime cost. If we understand the cost model, we can avoid writing code that hits the bad cases. So in this section I’ll give an informal description of the implementation of STM, with enough detail that the reader can understand the cost model.

An STM transaction works by accumulating a log of readTVar and writeTVar operations that have happened so far during the transaction. The log is used in three ways:

  • By storing writeTVar operations in the log rather than applying them to main memory immediately, discarding the effects of a transaction is easy; we just throw away the log. Hence, aborting a transaction has a fixed small cost.
  • Each readTVar must traverse the log to check whether the TVar was written by an earlier writeTVar. Hence, readTVar is an O(n) operation in the length of the log.
  • Because the log contains a record of all the readTVar operations, it can be used to discover the full set of TVars read during the transaction, which we need to know in order to implement retry.

When a transaction reaches the end, the STM implementation compares the log against the contents of memory. If the current contents of memory match the values read by readTVar, the effects of the transaction are committed to memory, and if not, the log is discarded and the transaction runs again from the beginning. This process takes place atomically by locking all the TVars involved in the transaction for the duration. The STM implementation in GHC does not use global locks; only the TVars involved in the transaction are locked during commit, so transactions operating on disjoint sets of TVars can proceed without interference.

There are two important rules of thumb:

  • Never read an unbounded number of TVars in a single transaction because the O(n) performance of readTVar then gives O(n2) for the whole transaction.
  • Try to avoid expensive evaluation inside a transaction because this will cause the transaction to take a long time, increasing the chance that another transaction will modify one or more of the same TVars, causing the current transaction to be re-executed. In the worst case, a long-running transaction re-executes indefinitely because it is repeatedly aborted by shorter transactions.

It is possible that a future STM implementation may use a different data structure to store the log, reducing the readTVar overhead to O(log n) or better (on average), but the likelihood that a long transaction will fail to commit would still be an issue. To avoid that problem, intelligent contention-management is required, which is an area of active research.

The retry operation uses the transaction log to find out which TVars were accessed by the transaction, because changes to any of these TVars must trigger a rerun of the current transaction. Hence, each TVar has a watch list of threads that should be woken up if the TVar is modified, and retry adds the current thread to the watch list of all the TVars read during the current transaction. Hence, retry is O(n) in the number of TVars read during the transaction. When a transaction is committed, if any of the modified TVars has a watch list, then the threads on the list are all woken up.

One other thing to watch out for is composing too many blocking operations together. If we wanted to wait for a list of TMVars to become full, we might be tempted to do this:

atomically $ mapM takeTMVar ts

Imagine that the TMVars all started empty and became full one at a time in the same order as the list ts. Each time a new TMVar becomes full, the transaction wakes up and runs again, going to sleep at the next empty TMVar. We’ll run the transaction from the start, once for every element of ts, so the whole operation is O(n2). If instead, we had written this code:

mapM (atomically . takeTMVar) ts

then it is O(n), although now the semantics are different—it is not a single transaction anymore—but if these semantics are acceptable, then the second form will be much faster.

Summary

To summarize, STM provides several benefits for concurrent programming:

Composable atomicity
You can construct arbitrarily large atomic operations on shared state, which can simplify the implementation of concurrent data structures with fine-grained locking.
Composable blocking
You can build operations that choose between multiple blocking operations, which is very difficult with MVars and other low-level concurrency abstractions.
Robustness in the presence of failure and cancellation
A transaction in progress is aborted if an exception occurs, so STM makes it easy to maintain invariants on state in the presence of exceptions.


[39] The TMVar implementation is available from the Control.Concurrent.STM.TMVar module in the stm package.

[40] We are assuming that the actual window contents are rendered via some separate means, e.g., compositing.

[41] The implementation is available in the module Control.Concurrent.STM.TChan from the stm package.

[42] Technically, the complexity is amortized O(1). For more details on these kinds of data structures, I recommend reading Okasaki’s Purely Functional Data Structures (Cambridge University Press, 1999).

Providence Salumu