Providence Salumu
In this section, I’ve collected a few tricks and techniques that you might find useful when debugging Concurrent Haskell programs.
The threadStatus
function (from GHC.Conc
) returns the current
state of a thread:
threadStatus
::
ThreadId
->
IO
ThreadStatus
Here, ThreadStatus
is defined as follows:
data
ThreadStatus
=
ThreadRunning
--
|
ThreadFinished
--
|
ThreadBlocked
BlockReason
--
|
ThreadDied
--
deriving
(
Eq
,
Ord
,
Show
)
The BlockReason
type gives more information about why a thread is
blocked and is self-explanatory:
data
BlockReason
=
BlockedOnMVar
|
BlockedOnBlackHole
|
BlockedOnException
|
BlockedOnSTM
|
BlockedOnForeignCall
|
BlockedOnOther
deriving
(
Eq
,
Ord
,
Show
)
Here’s an example in GHCi:
> t <- forkIO (threadDelay 3000000) > GHC.Conc.threadStatus t ThreadBlocked BlockedOnMVar > -- wait a few seconds > GHC.Conc.threadStatus t ThreadFinished >
While threadStatus
can be very useful for debugging, don’t use
it for normal control flow in your program. One reason
is that it breaks abstractions. For instance, in the previous example, it showed us
that threadDelay
is implemented using MVar
(at least in this
version of GHC). Another reason is that the result of
threadStatus
is out of date as soon as threadStatus
returns,
because the thread may now be in a different state.
While we should never underestimate the usefulness of adding
putStrLn
calls to our programs to debug them, sometimes this isn’t
quite lightweight enough. putStrLn
can introduce some extra
contention for the stdout
Handle
, which might perturb the
concurrency in the program you’re trying to debug. So in this section,
we’ll look at another way to investigate the behavior of a concurrent
program at runtime.
We’ve used ThreadScope a lot to diagnose performance problems in
this book. ThreadScope generates its graphs from the
information in the .eventlog
file that is produced when we run a
program with the +RTS -l
option. This file is a mine of information
about what was happening behind the scenes when the program ran, and
we can use it for debugging our programs, too.
You may have noticed that ThreadScope identifies threads by their
number. For debugging, it helps a lot to know which thread in the
program corresponds to which thread number; this connection can be
made using labelThread
:
labelThread
::
ThreadId
->
String
->
IO
()
-- defined in GHC.Conc
The labelThread
function has no effect on the running of the
program but causes the program to emit a special event into the event log.
There are also a couple of ways to put your own information in the
eventlog
file:
traceEvent
::
String
->
a
->
a
traceEventIO
::
String
->
IO
()
-- defined in Debug.Trace
Here’s a simple program to demonstrate labelThread
and
traceEventIO
in action:
mvar4.hs
main
=
do
t
<-
myThreadId
labelThread
t
"main"
m
<-
newEmptyMVar
t
<-
forkIO
$
putMVar
m
'a'
labelThread
t
"a"
t
<-
forkIO
$
putMVar
m
'b'
labelThread
t
"b"
traceEventIO
"before takeMVar"
takeMVar
m
takeMVar
m
This program forks two threads. Each of the threads puts a value into
an MVar
, and then the main thread calls takeMVar
on the MVar
twice.
Compile the program with -eventlog
and run it with +RTS -l
:
$ ghc mvar4.hs -threaded -eventlog $ ./mvar4 +RTS -l
This generates the file mvar4.eventlog, which is a space-efficient binary representation of the sequence of events that occurred in the runtime system when the program ran. You need a program to display the
contents of a .eventlog
file; ThreadScope of course is one such
tool, but you can also just display the raw event stream using
the ghc-events
program:[64]
$ ghc-events show mvar4.eventlog
As you might expect, there is a lot of implementation detail in the
event stream, but with the help of labelThread
and traceEventIO
, you
can sort through it to find the interesting bits. Note that if you
try this program yourself, you might not see exactly the same event
log; such is the nature of implementation details.
We labeled the main thread "main"
, so searching for main
in the
log finds this section:
912458: cap 0: running thread 3 950678: cap 0: thread 3 has label "main" -- 953569: cap 0: creating thread 4 -- 956227: cap 0: thread 4 has label "a" -- 957001: cap 0: creating thread 5 -- 958450: cap 0: thread 5 has label "b" 960835: cap 0: stopping thread 3 (thread yielding) -- 997067: cap 0: running thread 4 -- 1007167: cap 0: stopping thread 4 (thread finished) 1008066: cap 0: running thread 5 -- 1010022: cap 0: stopping thread 5 (blocked on an MVar) 1045297: cap 0: running thread 3 -- 1064248: cap 0: before takeMVar -- 1066973: cap 0: waking up thread 5 on cap 0 -- 1067747: cap 0: stopping thread 3 (thread finished) --
So from this event log we can see the sequence of actions that happened at runtime, including which threads got blocked when, and some information about why they got blocked. These clues can often be enough to point you to the cause of a problem.
As I mentioned briefly in “Communication: MVars”, the GHC runtime system can
detect when a thread has become deadlocked and send it the
BlockedIndefinitelyOnMVar
exception. How exactly does this work?
Well, in GHC both threads and MVar
s are objects on the heap, just
like other data values. An MVar
that has blocked threads is
represented by a heap object that points to a list of the blocked
threads. Heap objects are managed by the garbage
collector, which traverses the heap starting from the roots to
discover all the live objects. The set of roots consists of the
running threads and the stack associated with each of these threads.
Any thread that is not reachable from the roots is definitely
deadlocked. The runtime system cannot ever find these threads by
following pointers, so they can never become runnable again.
For example, if a thread is blocked in takeMVar
on an MVar
that is
not referenced by any other thread, then both the MVar
that it is
blocked on and the thread itself will be unreachable. When a thread
is found to be unreachable, it is sent the BlockedIndefinitelyOnMVar
exception (there is also a BlockedIndefinitelyOnSTM
exception for
when a thread is blocked in an STM transaction). The exception gives
the thread a chance to clean up any resources it may have been
holding and also allows the program to quit with an error message
rather than hanging in the event of a deadlock.
The concept extends to mutual deadlock between a group of threads. Suppose we create two threads that deadlock on each other like this:
a
<-
newEmptyMVar
b
<-
newEmptyMVar
forkIO
(
do
takeMVar
a
;
putMVar
b
()
)
forkIO
(
do
takeMVar
b
;
putMVar
a
()
)
...
Then both threads are blocked, each on an MVar
that is reachable
from the other. As far as the garbage collector is concerned, both
threads and the MVar
s a
and b
are unreachable (assuming the
rest of the program does not refer to a
or b
). When there are
multiple unreachable threads, they are all sent the
BlockedIndefinitelyOnMVar
exception at the same time.
This all seems quite reasonable, but you should be aware of some consequences that might not be immediately obvious. Here’s an example:[65]
main
=
do
lock
<-
newEmptyMVar
complete
<-
newEmptyMVar
forkIO
$
takeMVar
lock
`
finally
`
putMVar
complete
()
takeMVar
complete
Study the program for a moment and think about what you expect to happen.
The child thread is clearly deadlocked, and so it should receive the
Blocked
IndefinitelyOnMVar
exception. This will cause the finally
action to run, which performs putMVar complete ()
, which will in
turn unblock the main thread. However, this is not what happens. At
the point where the child thread is deadlocked, the main thread is also deadlocked. The runtime system has no idea that sending the
exception to the child thread will cause the main thread to become
unblocked, so the behavior when there is a group of deadlocked
threads is to send them all the exception at the same time. Hence the
main thread also receives the BlockedIndefinitelyOnMVar
exception,
and the program prints an error message.
The second consequence is that the runtime can’t always prove that a thread is deadlocked even if it seems obvious to you. Here’s another example:
main
=
do
lock
<-
newEmptyMVar
forkIO
$
do
r
<-
try
(
takeMVar
lock
);
(
r
::
Either
SomeException
()
)
threadDelay
1000000
(
lock
==
lock
)
We might expect the child thread to be detected as deadlocked here because it is clear that nothing is ever going to put into the lock
MVar
. But the child thread never receives an exception, and the
program completes printing True
. The reason the deadlock is not
detected here is that the main thread is holding a reference to the
MVar
lock
because it is used in the (slightly contrived)
expression (lock == lock)
on the last line. Deadlock detection
works using garbage collection, which is necessarily a conservative
approximation to the true future behavior of the program.
Suppose that instead of the last line, we had written this:
if
isPrime
43
then
return
()
else
putMVar
lock
()
Provided that the compiler optimizes away isPrime 43
, we would get a
deadlock exception. You can’t in general know how clever the compiler
is going to be, so you should not rely on deadlock detection for the correct working of your program. Deadlock detection is a debugging
feature; in the event of a deadlock, you get an exception rather than a
silent hang, but you should aim to never have any deadlocks in your
program.
In this section, I’ll cover a few tips and techniques for improving the performance of concurrent programs. The standard principles apply here, just as much as in ordinary sequential programming:
labelThread
and traceEvent
can help track down the
culprits (see “Event Logging and ThreadScope”).
GHC strives to provide an extremely efficient implementation of threads. This section explores the performance of a couple of very simple concurrent programs to give you a feel for the efficiency of the basic concurrency operations and how to inspect the performance of your programs.
The first program creates 1,000,000 threads, has each of them put a
token into the same MVar
, and then reads the 1,000,000 tokens from
the MVar
:
numThreads
=
1000000
main
=
do
m
<-
newEmptyMVar
replicateM_
numThreads
$
forkIO
(
putMVar
m
()
)
replicateM_
numThreads
$
takeMVar
m
This program should give us an indication of the memory overhead for
threads because all the threads will be resident in memory at once.
To find out the memory cost, we can run the program with +RTS -s
(the output is abbreviated slightly here):
$ ./threadperf1 +RTS -s 1,048,049,144 bytes allocated in the heap 3,656,054,520 bytes copied during GC 799,504,400 bytes maximum residency (10 sample(s)) 146,287,144 bytes maximum slop 1,768 MB total memory in use (0 MB lost due to fragmentation) INIT time 0.00s ( 0.00s elapsed) MUT time 0.75s ( 0.76s elapsed) GC time 2.21s ( 2.22s elapsed) EXIT time 0.18s ( 0.18s elapsed) Total time 3.14s ( 3.16s elapsed)
So about 1 GB was allocated, although the total memory required by
the program was 1.7 GB. The amount of allocated memory tells us that
threads require approximately 1 KB each, and the extra memory used by the program is due to copying GC overheads. In fact, it is possible to
tune the amount of memory given to a thread when it is allocated,
using the +RTS -k<size>
option; here is the same program using
400-byte threads:
$ ./threadperf1 +RTS -s -k400 424,081,144 bytes allocated in the heap 1,587,567,240 bytes copied during GC 387,551,912 bytes maximum residency (9 sample(s)) 87,195,664 bytes maximum slop 902 MB total memory in use (0 MB lost due to fragmentation) INIT time 0.00s ( 0.00s elapsed) MUT time 0.59s ( 0.59s elapsed) GC time 1.60s ( 1.61s elapsed) EXIT time 0.13s ( 0.13s elapsed) Total time 2.32s ( 2.33s elapsed)
A thread will allocate more memory for its stack on demand, so whether
it is actually a good idea to use +RTS -k400
will depend on your
program. In this case, the threads were doing very little before
exiting, so it did help the overall performance.
The second example also creates 1,000,000 threads, but this time we
create a separate MVar
for each thread to put a token into and then
take all the MVar
s in the main thread before exiting:
numThreads
=
1000000
main
=
do
ms
<-
replicateM
numThreads
$
do
m
<-
newEmptyMVar
forkIO
(
putMVar
m
()
)
return
m
mapM_
takeMVar
ms
This program has quite different performance characteristics:
$ ./threadperf2 +RTS -s 1,153,017,744 bytes allocated in the heap 267,061,032 bytes copied during GC 62,962,152 bytes maximum residency (8 sample(s)) 4,662,808 bytes maximum slop 121 MB total memory in use (0 MB lost due to fragmentation) INIT time 0.00s ( 0.00s elapsed) MUT time 0.70s ( 0.72s elapsed) GC time 0.50s ( 0.50s elapsed) EXIT time 0.02s ( 0.02s elapsed) Total time 1.22s ( 1.24s elapsed)
Although it allocated a similar amount of memory, the total memory in use
by the program at any one time was only 121 MB. This is because each
thread can run to completion independently, unlike the previous
example where all the threads were present and blocked on the same
MVar
. So while the main thread is busy creating more threads, the
threads it has already created can run, complete, and be garbage-collected, leaving behind only the MVar
for the main thread to take later.
Note that the GC overheads of this program are much lower than the
first example. The total time gives us a rough indication of the time
it takes to create an MVar
and a thread, and for the thread to run,
put into the MVar
, complete, and be garbage-collected. We did this
1,000,000 times in about 1.2s, so the time per thread is about 1.2
microseconds.
The conclusion is that threads are cheap in GHC, in both creation time and memory overhead. Context-switch performance is also efficient, as it does not require a kernel round-trip, although we haven’t measured that here. The memory used by threads is automatically recovered when the thread completes, and because thread stacks are movable in GHC, you don’t have to worry about memory fragmentation or running out of address space, as you do with OS threads. The number of threads we can have is limited only by the amount of memory.
We covered one trick here: the +RTS -k<size>
option, which tunes the
initial stack size of a thread. If you have a lot of very tiny
threads, it might be worth tweaking this option from its default 1k
to see if it makes any difference.
We’ve encountered shared data structures a few times so far: the phonebook example in “MVar as a Container for Shared State”, the window-manager in Chapter 10, and the semaphore in “Limiting the Number of Threads with a Semaphore”, not to mention various versions of channels. Those examples covered most of the important techniques to use with shared data structures, but we haven’t compared the various choices directly. In this section, I’ll briefly summarize the options for shared state, with a focus on the performance implications of the different choices.
Typically, the best approach when you want some shared state is to take
an existing pure data structure, such as a list or a Map
, and store
it in a mutable container. Not only is this straightforward to
accomplish, but there are a wide range of well-tuned pure data
structures to choose from, and using a pure data structure means that
reads and writes are automatically concurrent.
There are a couple of subtle performance issues to be aware of, though. The first is the effect of lazy evaluation when writing a new value into the container, which we covered in “MVar as a Container for Shared State”. The second is the choice of mutable container itself, which exposes some subtle performance trade-offs. There are three choices:
MVar
MVar
to keep a
shared counter did not perform well under high contention. This is
a consequence of the fairness guarantee that MVar
offers: if a
thread relinquishes an MVar
and there is another thread waiting,
it must then hand over to the waiting thread; it cannot continue
running and take the MVar
again.
TVar
TVar
sometimes performs better than MVar
under
contention and has the advantage of being composable with other STM
operations. However, be aware of the other performance pitfalls
with STM described in “Performance”.
IORef
Using an IORef
together with atomicModifyIORef
is often a good
choice for performance, as we saw in “Limiting the Number of Threads with a Semaphore”. The
main pitfall here is lazy evaluation; getting enough strictness when
using atomicModifyIORef
is quite tricky. This is a good pattern to follow:
b
<-
atomicModifyIORef
ref
(
\
x
->
let
(
a
,
b
)
=
f
x
in
(
a
,
a
`
seq
`
b
))
b
`
seq
`
return
b
The seq
call on the last line forces the second component of the
pair, which itself is a seq
call that forces a
, which in turn
forces the call to f
. All of this ensures that both the value stored
inside the IORef
and the return value are evaluated strictly, and no
chains of thunks are built up.
GHC has plenty of options to tune the behavior of the runtime system (RTS). For full details, see the GHC User’s Guide. Here, I’ll highlight a few of the options that are good targets for tuning concurrent and parallel programs.
RTS options should be placed between +RTS
and -RTS
, but the -RTS
can be omitted if it would be at the end of the command line.
-N[
cores
]
(Default: 1) We encountered -N
many times throughout
Part I. But what value should you pass? GHC can
automatically determine the number of processors in your machine if
you use -N
without an argument, but that might not always be the
best choice. The GHC runtime system scales well when it has
exclusive access to the number of processors specified with -N
,
but performance can degrade quite rapidly if there is contention for
some of those cores with other processes on the machine.
Should you include hyperthreaded cores in the count? Anecdotal
evidence suggests that using hyperthreaded cores often gives a small
performance boost, but obviously not as much as a full core. On the
other hand, it might be wise to leave the hyperthreaded cores alone in order
to provide some insulation against any contention arising from
other processes. Be aware that using -N
alone normally includes
hyperthreaded cores.
-qa
-qa
prevents it from doing so. This can improve
performance or degrade it, depending on the scheduling
behavior of your operating system and the demands of the program.
-A
size
(Default: 512k
) This option controls the size of the memory
allocation area for each core. A good rule of thumb is to keep this
around the size of the L2 cache per core on your machine. Cache
sizes vary a lot and are often shared between cores, and sometimes there
is even an L3 cache, too. So setting the -A
value is not an exact
science.
There are two opposing factors at play here: using more memory means we run the garbage collector less, but using less memory means we use the caches more. The sweet spot depends on the characteristics of the program and the hardware, so the only consistent advice is to try various values and see what helps.
-I
seconds
-C[
seconds
]
Haskell has a foreign function interface (FFI) that allows Haskell code to call, and be called by, foreign language code (primarily C). Foreign languages also have their own threading models—in C, there are POSIX and Win32 threads, for example—so we need to specify how Concurrent Haskell interacts with the threading models of foreign code.
All of the following assumes the use of GHC’s -threaded
option.
Without -threaded
, the Haskell process uses a single OS thread only,
and multithreaded foreign calls are not supported.
An out-call is a call made from Haskell to a foreign language. At the
present time, the FFI supports only calls to C, so that’s all we
describe here. In the following, we refer to threads in C (i.e., POSIX
or Win32 threads) as “OS threads” to distinguish them from the Haskell
threads created with forkIO
.
As an example, consider making the POSIX C function read()
callable
from Haskell:
foreign import ccall
"read"
c_read
::
CInt
-- file descriptor
->
Ptr
Word8
-- buffer for data
->
CSize
-- size of buffer
->
CSSize
-- bytes read, or -1 on error
This declares a Haskell function c_read
that can be used to call the
C function read()
. Full details on the syntax of foreign
declarations and the relationship between C and Haskell types can be
found in the Haskell
2010 Language Report.
Just as Haskell threads run concurrently with one another, when a
Haskell thread makes a foreign call, that foreign call runs
concurrently with the other Haskell threads, and indeed with any other
active foreign calls. The only way that two C calls can be
running concurrently is if they are running in two separate OS
threads, so that is exactly what happens; if several Haskell threads
call c_read
and they all block waiting for data to be read, there
will be one OS thread per call blocked in read()
.
This has to work even though Haskell threads are not normally mapped one to one with OS threads; in GHC, Haskell threads are lightweight and managed in user space by the runtime system. So to handle concurrent foreign calls, the runtime system has to create more OS threads, and in fact it does this on demand. When a Haskell thread makes a foreign call, another OS thread is created (if necessary), and the responsibility for running the remaining Haskell threads is handed over to the new OS thread, while the current OS thread makes the foreign call.
The implication of this design is that a foreign call may be executed in any OS thread, and subsequent calls may even be executed in different OS threads. In most cases, this isn’t a problem, but sometimes it is; some foreign code must be called by a particular OS thread. There are two situations where this happens:
To handle these requirements, Haskell has a concept of bound
threads. A bound thread is a Haskell thread/OS thread pair that guarantees
that foreign calls made by the Haskell thread always take place in the
associated OS thread. A bound thread is created by forkOS
:
forkOS
::
IO
()
->
IO
ThreadId
Care should be taken when calling forkOS
; it creates a complete new
OS thread, so it can be quite expensive. Furthermore, bound threads
are much more expensive than unbound threads. When context-switching
to or from a bound thread, the runtime system has to switch OS
threads, which involves a trip through the operating system and tends
to be very slow. Use bound threads sparingly.
For more details on bound threads, see the documentation for the
Control.Concurrent
module.
There is a common misconception about forkOS
, which is partly a
consequence of its poorly chosen name. Upon seeing a function called
forkOS
, one might jump to the conclusion that you need to use
forkOS
to call a foreign function like read()
and have it run
concurrently with the other Haskell threads. This isn’t the case. As
I mentioned earlier, the GHC runtime system creates more OS threads on
demand for running foreign calls. Moreover, using forkOS
instead of
forkIO
will make your code a lot slower.
The only reason to call forkOS
is to create a bound thread,
and the only reason for wanting bound threads is to work with foreign
libraries that have particular requirements about the OS thread in which a
call is made.
The thread that runs main
in a Haskell program is a bound thread.
This can give rise to a serious performance problem if you use the
main thread heavily; communication between the main thread and other
Haskell threads will be extremely slow. If you notice that your
program runs several times slower when -threaded
is added, this
is the most likely cause.
The best way around this problem is just to create a new thread from
main
and work in that instead.
When a Haskell thread is making a foreign call, it cannot receive
asynchronous exceptions. There is no way in general to interrupt a
foreign call, so the runtime system waits until the call returns
before raising the exception. This means that a thread blocked in a
foreign call may be unresponsive to timeouts and interrupts, and
moreover that calling throwTo
will block if the target thread is in
a foreign call.
The trick for working around this limitation is to perform the foreign call in a separate thread. For example:
do
a
<-
async
$
c_read
fd
buf
size
r
<-
wait
a
...
Now the current thread is blocked in wait
and can be interrupted
by an exception as usual. Note that if an exception is raised it
won’t cancel the read()
call, which will continue in the background.
Don’t be tempted to use withAsync
here because withAsync
will
attempt to kill the thread calling read()
and will block in doing so.
Operations in the standard System.IO
library already work this way
behind the scenes because they delegate blocking operations to a
special IO manager thread. So there’s no need to worry about forking
extra threads when calling standard IO
operations.
In-calls are calls to Haskell functions that have been exposed to
foreign code with a foreign export
declaration. For example, if we have a
function f
of type Int -> IO Int
, we could expose it like this:
foreign export ccall
"f"
f
::
Int
->
IO
Int
This would create a C function with the following signature:
HsInt
f
(
HsInt
);
Here, HsInt
is the C type corresponding to Haskell’s Int
type.
In a multithreaded program, it is entirely possible for f
to be
called by multiple OS threads concurrently. The GHC runtime system
supports this (provided you use -threaded
) with the following
behavior: each call becomes a new bound thread. That is, a
new Haskell thread is created for each call, and the Haskell thread is
bound to the OS thread that made the call. Hence, any further
out-calls made by the Haskell thread will take place in the same OS
thread that made the original in-call. This turns out to be important
for dealing with GUI callbacks. The GUI wants to run in the main OS
thread only, so when it makes a callback into Haskell, we need to
ensure that GUI calls made by the callback happen in the same OS
thread that invoked the callback.