Providence Salumu Well-Typed - The Haskell Consultants: Understanding the Stack

Understanding the Stack

Wednesday, 21 May 2014, by Edsko de Vries.
Filed under coding.

One of our clients sent us a bug report, which consisted of a 700 line Haskell program with a complaint that “it deadlocks”. After studying the code I concluded that, for the sake of the bug report, it could be summarized as

print a bunch of stuff, then crash with a stack overflow

So I wanted to replace the original code with code that did just that: print a bunch of stuff, then crash with a stack overflow. No big deal:

{-# LANGUAGE CPP #-}
module Main (main) where

go :: Int -> IO Int
go n = do
  if (n `rem` THRESHOLD == 0)
    then putChar '.'
    else return ()
  n' <- go (n + 1)
  return (n + n')

main :: IO ()
main = print =<< go 0

The function go prints a dot every THRESHOLD recursive calls; we have a (dummy, since it can never be reached) addition after the recursive call to ensure that go is not tail recursive and will eventually run out of stack space. The THRESHOLD macro variable is there so that we can tweak how quickly the program runs out of stack space, or in other words, how many dots it prints before it crashes. For example, if we compile the code with

ghc -O1 PrintThenCrash.hs -fforce-recomp -DTHRESHOLD=23

it prints 22,126 dots before crashing; if we compile with

ghc -O1 PrintThenCrash.hs -fforce-recomp -DTHRESHOLD=26

it prints 19,483 dots before crashing. But as I was experimenting to find the right value for this parameter, I noticed something very strange. If we compile with

ghc -O1 PrintThenCrash.hs -fforce-recomp -DTHRESHOLD=24

it never crashes! It just prints, and prints, and prints (with GHC 7.6.3). This blog post documents my attempt to understand this behaviour, and explains some aspects of ghc’s runtime, and in particular, its stack, as it goes.

Checking core

After type checking ghc translates Haskell to an intermediate language called core, which is the “real life” version of the more “academic” language System FC. We can ask ghc to output the (optimized) core version of our program with

ghc -O1 PrintThenCrash.hs -fforce-recomp -DTHRESHOLD=24 -dsuppress-all -ddump-simpl

This is useful, because it allows to verify that even with a THRESHOLD value of 24 the optimizer did not somehow manage to make the function tail recursive:

$wa =
  \ ww_s1SW w_s1SY ->
    case remInt# ww_s1SW 24 of _ {
      __DEFAULT ->
        case $wa (+# ww_s1SW 1) w_s1SY of _ { (# ipv_aEw, ipv1_aEx #) ->
        (# ipv_aEw,
           case ipv1_aEx of _ { I# y_axO -> I# (+# ww_s1SW y_axO) } #)
        };
      0 ->
        case $wa5 stdout '.' w_s1SY of _ { (# ipv_aEw, _ #) ->
        case $wa (+# ww_s1SW 1) ipv_aEw of _ { (# ipv2_XET, ipv3_XEV #) ->
        (# ipv2_XET,
           case ipv3_XEV of _ { I# y_axO -> I# (+# ww_s1SW y_axO) } #)
        }
        }
    }

This is a bit difficult to read; don’t worry about the details for now, we will start dealing with low level details soon enough. For now, it’s enough to note that $wa is the translation of go, and that the recursive calls to $wa are not in tail position; the optimizer did not somehow manage to make the function tail recursive. Ok. Then what? Why is this function not crashing with a stack overflow?

Simplifying the problem

If we cannot understand the behaviour of the code by looking at core then we need to drop all the way down to assembly language. However, if we want to have any hope of being able to step through the assembly language we need to simplify that call to putChar. Sadly, replacing it with

foreign import ccall "putchar" c_putchar :: Char -> IO ()

made the problem go away: the program now always crashed. So the strange behaviour of our program had something to do with the the implementation of putChar. The real putChar is more involved that it might seem; it deals with buffering, character encodings, concurrent access to Handles, etc.

Unfortunately, since I had no idea what aspect of the implementation of putChar was causing the behaviour I was seeing, I could think of no other approach than to inline putChar and the functions it calls, and start simplifying it bit by bit until the strange behaviour disappeared.

Many, many hours later I ended up with this code:

hPutChar :: State# RealWorld -> (# State# RealWorld, () #)
hPutChar w0 = case stdout of () -> (# w0, () #)

go :: State# RealWorld -> (# State# RealWorld, () #)
go w0 =
  case maskAsyncExceptions# hPutChar w0 of
    (# w1, _ #) ->
      case go w1 of
        (# w2, () #) -> (# w2, () #)

stdout :: ()
stdout = ()

main :: IO ()
main = IO go

In order to understand this code, you first have to realize that the IO monad is a state monad with the RealWorld as the state:

newtype IO a = IO (State# RealWorld -> (# State# RealWorld, a #))

I have written the code with explicit state passing, instead of using the monad, so that the Haskell code is as close as possible to the generated core and beyond.

That out of the way, let’s discuss the actual code. First of all, the real putChar is defined as

putChar = hPutChar stdout

where stdout has type Handle. As it turns out, there are only two aspects of hPutChar that are relevant:

  1. hPutChar does a case analysis on something (actually, of course, it does case analyses on lots of things); as it turns out, all that matters for our example is that it does a case analysis at all, so we have modelled stdout simply by ().

  2. This case analysis happens while asynchronous exceptions have been masked. This happens because a Handle contains an MVar, and most of the I/O operation happens while we have taken the value from this MVar (see withHandle' in GHC.IO.Handle.Internals).

At this point we can simplify no further. Although we are not actually printing any dots anymore, if we run this code it will run forever until you interrupt it manually; but if you remove either the case analysis from hPutChar or the call to maskAsyncExceptions# from go the program will exit with a stack overflow almost immediately. What gives?

Understanding hPutChar

When ghc compiles your program, after lexical analysis and syntax analysis it first removes syntactical sugar and does scope analysis; then, after type checking, it translates your code to core, as we saw. The core gets optimized and then translated to stg (for “Spineless Tagless G-machine”); stg code is very similar to core, but with a handful of additional restrictions; most importantly, all constructor applications must be fully applied (and if not, an explicit lambda must be constructed). Finally, the stg code gets translated to C--, which is a portable assembly language similar in intent to llvm, and the C-- code then is translated to assembly language.

We wrote our code in such a low level way that the core and stg translations are not very interesting; they all look very alike. The C-- code however is a big step down in level. Recall that we simplified hPutChar to

hPutChar :: State# RealWorld -> (# State# RealWorld, () #)
hPutChar w0 = case stdout of () -> (# w0, () #)

The C-- code for this “hPutChar” is (-ddump-opt-cmm):

Main.hPutChar_info:
        if ((Sp + -8) < SpLim) goto c1ae;
        R1 = PicBaseReg + IO.stdout_closure;
        I64[Sp - 8] = PicBaseReg + s19G_info;
        Sp = Sp - 8;
        if (R1 & 7 != 0) goto c1ah;
        jump I64[R1]; // [R1]
    c1ae:
        R1 = PicBaseReg + Main.hPutChar_closure;
        jump (I64[BaseReg - 8]); // [R1]
    c1ah: jump s19G_info; // [R1]

s19G_info:
    R1 = PicBaseReg + GHC.Tuple.()_closure + 1;
    Sp = Sp + 8;
    jump (I64[Sp + 0]); // [R1]

This might look rather frightening, especially when you are used to Haskell, so let’s take it step by step. We will be pushing a value to the stack, so we first check if there is enough room on the stack to do so:

if ((Sp + -8) < SpLim) goto c1ae;

If not, we will call a function from the GHC runtime called __stg_gc_fun, which will extend the stack, if possible. We will come back to this in detail later.

In order to do case analysis on stdout we first have to make sure that it is in weak head normal form (in this case, it will always be, but it might not be in general). In order to do this, we need to call its definition; but before we can do that, we need to do two more things.

First, in general stdout might be a thunk, with free variables in the closure, and hence it needs to know where it can find the values of those free variables. The convention is therefore that when we call a closure, the address of the closure can always be found in register R1:

R1 = PicBaseReg + IO.stdout_closure;

Secondly, when stdout completes, it needs to know what to do next. It does this by looking at the stack: when it finishes, it loads a “continuation address” from the top of the stack and calls it. Hence we need to push this address to the stack so that stdout will be able to find it later:

I64[Sp - 8] = PicBaseReg + s19G_info;
Sp = Sp - 8;

At this point we could call stdout, but actually we can do slightly better. ghc implements an optimization called pointer tagging. Since addresses are always word aligned, the lower 3 bits (or, on 32-bit machines, the lower 2 bits) of addresses are always 0. For datatype constructors ghc uses these bits to encode, as part of the pointer itself, which constructor of the datatype it is (non-zero value) or whether the thunk is not yet in weak head normal form. So we can check by looking at the pointer if there is a point calling stdout at all:

if (R1 & 7 != 0) goto c1ah;

c1ah: jump s19G_info; // [R1]

If the constructor is already in weak head normal form, we call the continuation directly (note that s19G_info is the same address that we pushed to the stack as the continuation address for stdout). Finally, if it turns out that we do still need to evaluate the stdout we call it:

jump I64[R1]; // [R1]

In the continuation we now know that stdout is in weak head normal form; technically speaking, we should now pattern match on it to find out which constructor it is, but of course () only has a single constructor, so we can skip that step. This means that all that is left to do is to call our continuation with the result. We are returning

(# w0, () #)

In general, unboxed tuples are represented as a pair of pointers, either as two consecutive memory locations or, ideally, in two registers. However, real world tokens disappear from generated code, so all we have to return is (). By convention the first few arguments are passed in registers; in this case, that means that we need to load the address of () into register R1 before calling the continuation:

R1 = PicBaseReg + GHC.Tuple.()_closure + 1;
Sp = Sp + 8;
jump (I64[Sp + 0]); // [R1]

The + 1 part is pointer tagging at work: () is already in weak head normal form, and it is the first (and only) constructor of the () type.

Understanding go

At this point you might realize why I said at the start that it would be completely unfeasible to step through the real hPutChar; our simplified version just does a case analysis (on something of unit value, no less) and then returns unit, and it is already complicated enough! What about go?

go :: State# RealWorld -> (# State# RealWorld, () #)
go w0 =
  case maskAsyncExceptions# hPutChar w0 of
    (# w1, _ #) ->
      case go w1 of
        (# w2, () #) -> (# w2, () #)

Thankfully, the code was carefully written to make the translation to lower level code as simple as possible; the C-- translation of go does not introduce any more concepts that we already used in hPutChar:

Main.go_info():
    c1aM:
        if ((Sp + -8) < SpLim) goto c1aO;
        R1 = PicBaseReg + Main.hPutChar_closure + 1;
        I64[Sp - 8] = PicBaseReg + s19O_info;
        Sp = Sp - 8;
        jump stg_maskAsyncExceptions#; // [R1]
    c1aO:
        R1 = PicBaseReg + Main.go_closure;
        jump (I64[BaseReg - 8]); // [R1]

s19O_info:
        I64[Sp + 0] = PicBaseReg + s19N_info;
        jump Main.go_info; // []

s19N_info:
        I64[Sp + 0] = PicBaseReg + s19M_info;
        if (R1 & 7 != 0) goto c1aI;
        jump I64[R1]; // [R1]
    c1aI:
        jump s19M_info; // [R1]

s19M_info:
        R1 = PicBaseReg + GHC.Tuple.()_closure + 1;
        Sp = Sp + 8;
        jump (I64[Sp + 0]); // [R1]

The structure is very similar as before, except that we have two (actually, three) case statements.

  1. First we do a stack overflow check again, as before.

  2. We need to evaluate the scrutinee of the first case statement: the call to the primop maskAsyncExceptions# (the equivalent of mask_ in Haskell-land), which expects its argument in R1. In this case, the argument is hPutChar, except that as before we use pointer tagging (in this case, to indicate that the function has been evaluated and that its arity is 1).

  3. The continuation after we are finished with the scrutinee is s19O_info; this is a second case statement, so we push a second continuation (s19N_info) onto the stack and recursively call go.

  4. The second continuation—which is never reached—makes sure that go indeed returned unit, and then returns unit; if you look at the stg translation of go (-ddump-stg) you will notice that go is actually a triply nested case statement:

case maskAsyncExceptions# [Main.hPutChar w0_s19t] of _ {
  (#,#) ipv_s19x _ ->
      case Main.go ipv_s19x of _ {
        (#,#) ipv2_s19D ipv3_s19B ->
            case ipv3_s19B of _ { () -> (#,#) [ipv2_s19D GHC.Tuple.()]; };
      };
};

Three case statements, hence three continuations (s19O_info, s19N_info, s19M_info).

Running the code

All this is pure theory, so far. Let’s confirm it by actually running our code. We can load up the code into a debugger; make sure to compile the code with -debug to link it against the version of the Haskell RTS that has debugging symbols. Since I am working on OSX Mavericks I will be using lldb; gdb will work in a very similar way.

One of the abstractions that C-- offers over real assembly language is virtual registers. When we translate C-- to assembly these are mapped to machine registers by a register allocator in ghc’s native code generator. On my machine, Sp and SpLim are mapped to rbp and r15, and R1 is mapped to rbx.

Let’s load up our code, set a breakpoint in go, ask it to print Sp and SpLim whenever we hit a breakpoint, and then start the program (which, during my debugging, I called “weird”, which is why you will see many references to “weird” below).

# lldb weird
Current executable set to 'weird' (x86_64).

(lldb) breakpoint set -n Main_go_info
Breakpoint 1: where = weird`Main_go_info, address = 0x0000000100000f98

(lldb) target stop-hook add --one-liner "register read rbp r15"
Stop hook #1 added.

(lldb) run
Process 35298 launched: '/Users/dev/wt/weird/weird' (x86_64)
     rbp = 0x0000000000000000
     r15 = 0x0000000000000000
Process 35298 stopped
* thread #1: tid = 0xebc474, 0x0000000100000f98 weird`Main_go_info, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100000f98 weird`Main_go_info
weird`Main_go_info:
-> 0x100000f98:  leaq   -0x8(%rbp), %rax
   0x100000f9c:  cmpq   %r15, %rax
   0x100000f9f:  jb     0x100000fc0               ; Main_go_info + 40
   0x100000fa1:  leaq   0xd8818(%rip), %rax       ; Main_hPutChar_closure
     rbp = 0x0000000100405370
     r15 = 0x00000001004050c0

lldb started the program, stopped at the breakpoint, and told us the value of Sp and SpLim, as asked. It also shows us a disassembly of the code. The C-- code started with

if ((Sp + -8) < SpLim) goto c1aO;

which, if you recall, was to check if we might run into a stack overflow. This translates into

leaq   -0x8(%rbp), %rax
cmpq   %r15, %rax
jb     0x100000fc0               ; Main_go_info + 40

in Intel assembly language. The details don’t matter very much, and are beyond the scope of this blog post. If you haven’t seen Intel assembly language, leaq is “load effective address”, cmpq is “compare” and jb “jump if below”. We will not explain such details any further; hopefully you will be able to squint a bit and see that it resembles the C-- rather closely.

Before we do anything else, let’s see what’s on the stack before we start:

(lldb) memory read -format A -count 3 $rbp
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2

So, the stack frame immediately above us when we start is a catch frame. Catch frames are pushed onto the stack when you use catch, and are used by the runtime to find which exception handler to run when an exception occurs. The catch frame has two additional fields in its payload: the handler to run (in this case, the default top-level handler), and a mask which indicates if asynchronous exceptions have been blocked or not (in this case, they haven’t).

We can use the step command from lldb to start stepping through the execution of the code. We check for stack overflow, find that there is none, and then set things up for the first case statement in go: we push a continuation address to the stack, put the argument to maskAsyncExceptions# in rbx (the machine equivalent of R1), and then call maskAsyncExceptions. At this point the stack therefore looks like

0x100405368: 0x0000000100000f68 weird`s19O_info
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2

and we can confirm that rbx contains the tagged address of hPutChar

(lldb) image lookup --address $rbx
      Address: weird[0x00000001000d97c1] (weird.__DATA.__data + 1)
      Summary: Main_hPutChar_closure + 1

You can find the implementation of maskAsyncExceptions# as stg_maskAsyncExceptionszh in Exception.cmm in the rts/ directory of the ghc source (“zh” is the z-encoding of “#”). The details are not so important, however. It masks async exceptions by setting a flag in a register, pushes a frame onto the stack to unmask exceptions when we are done, and then calls the function (whose address is in R1). This means that we will end up in hPutChar, at which point the stack looks like

0x100405360: 0x000000010009e540 weird`stg_unmaskAsyncExceptionszh_ret_info
0x100405368: 0x0000000100000f68 weird`s19O_info
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2

This continues on. hPutChar executes, completes, and runs the contination, which happens to be stg_unmaskAsyncExceptionszh_ret_info; this unmasks asynchronous exceptions, and continues with the continuation above it, which in this case is the original continuation from go. And after some more steps we end up back in go for the next recursive call, and the whole process repeats. However, the stack now looks like

0x100405368: 0x0000000100000f38 weird`s19N_info
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2
0x100405388: 0x00000001000a46c0 weird`stg_stop_thread_info

Remember that go is not tail recursive, so once this recursive call completes, we still need to call the continuation from the previous invocation. And this repeats; if we run until the breakpoint again (using cont) and print the stack, we will see

0x100405360: 0x0000000100000f38 weird`s19N_info
0x100405368: 0x0000000100000f38 weird`s19N_info
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2
0x100405388: 0x00000001000a46c0 weird`stg_stop_thread_info

and so on. This is precisely the behaviour that we were expecting: since the function is not tail recursive, we are using up more and more stack space and should eventually run out, and crash with a stack overflow. So why don’t we?

Stack overflow

After precisely 84 recursive calls, the stack looks like

0x1004050d0: 0x0000000100000f38 weird`s19N_info
...
0x100405368: 0x0000000100000f38 weird`s19N_info
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2

with 84 s19N_info pointers in total. Moreover, the top of the stack (virtual register Sp, real register rbp) is now dangerously close to the stack limit (virtual register SpLim, real register r15):

(lldb) register read rbp r15
     rbp = 0x00000001004050d0
     r15 = 0x00000001004050c0

That is, we have 16 bytes left, or space for two addresses. This means that we have just enough stack space to make it to the entry point for hPutChar; we get there in the same way as before, and the stack now looks like

0x1004050c0: 0x000000010009e540 weird`stg_unmaskAsyncExceptionszh_ret_info
0x1004050c8: 0x0000000100000f68 weird`s19O_info
0x1004050d0: 0x0000000100000f38 weird`s19N_info
...
0x100405368: 0x0000000100000f38 weird`s19N_info
0x100405370: 0x000000010009eb68 weird`stg_catch_frame_info
0x100405378: 0x0000000000000000
0x100405380: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2

Note that we pushed two additional addresses onto the stack, which is therefore now full. Remember that hPutChar starts with a check for stack overflow:

if ((Sp + -8) < SpLim) goto c1ae;

c1ae:
    R1 = PicBaseReg + Main.hPutChar_closure;
    jump (I64[BaseReg - 8]); // [R1]

Actually, if we look at the unoptimized C-- minus instead (-ddump-cmm) we will see

c1ae:
    R1 = Main.hPutChar_closure;
    jump stg_gc_fun; // [R1]

instead, which is a lot clearer: we load the address of hPutChar into register R1 and then run the garbage collector. In the optimized C-- code we find the address of the garbage collector somewhere else, but we can verify in lldb that it’s the same thing:

(lldb) memory read -format A -count 1 $r13-8
0x1000e1290: 0x000000010009fcb0 weird`__stg_gc_fun

Either way, the garbage collector runs. It notices that we are out of stack space (as opposed to out of heap space; we carefully avoided any heap allocation in this test code), creates a new, bigger, stack, copies the old stack over, and then calls back into our function. In other words, we end up back at the start of hPutChar, but now the stack looks like:

0x1004fcd30: 0x000000010009e540 weird`stg_unmaskAsyncExceptionszh_ret_info
0x1004fcd38: 0x0000000100000f68 weird`s19O_info
0x1004fcd40: 0x0000000100000f38 weird`s19N_info
...
0x1004fcfd8: 0x0000000100000f38 weird`s19N_info
0x1004fcfe0: 0x000000010009eb68 weird`stg_catch_frame_info
0x1004fcfe8: 0x0000000000000000
0x1004fcff0: 0x00000001000dbf72 base_GHCziTopHandler_runIO2_closure + 2

which is the same stack as before, except at a different location; and, crucially, we have space on the stack again:

(lldb) register read rbp r15
rbp = 0x00000001004fcd30
r15 = 0x00000001004f50c0

so we can continue recursing.

Running out of stack space

Of course, we cannot continue increasing the stack forever; eventually we should reach the maximum stack size (8 MB by default in ghc 7.6). So why don’t we? Well, now that we understand the problem at this level of detail a quick Google search will reveal that this is due to a “bug” in ghc (depending on your definition of bug, I suppose) prior to version 7.8: when a stack overflow happens while asynchronous exceptions are blocked, the stack will be grown no matter what the limitation on the stack size is.

In this example, it just so happens that we run out of stack space when we try to push the continuation address in hPutChar, at which point asynchronous exceptions are indeed masked. Hence, this program will continue growing the stack unpunished, until we run out of machine memory completely. If we remove the case statement from hPutChar (or indeed, if we modify the program in a myriad of other, minor, ways) we would detect the stack overflow outside of the maskAsyncExceptions# and hence we would crash with a stack overflow exception when we reach the maximum stack size.

In the original program that we started with, the value of THRESHOLD determines how many continuation addresses for go are on the stack when we call (the real) putChar. We run out of stack space either while asynchronous exceptions were masked (somewhere deep inside the bowels of the real hPutChar), or outside it, and hence we would run forever or crash almost immediately, respectively. This is clearly not what one would expect, and with GHC 7.8 the problem has been resolved – a stack overflow is now treated like any other asynchronous exception, and the program will crash with a stack overflow as soon as asynchronous exceptions are no longer masked.

Further Reading

If you want to understand ghc’s runtime execution model How to make a fast curry: push/enter vs eval/apply is essential reading, although you can ignore the comparison with push/enter, which is not used in ghc. Faster Laziness Using Dynamic Pointer Tagging explains the pointer tagging trick we discussed above. Apart from these two papers, there are various other references that might be helpful. Be aware however that ghc is constantly evolving and many of these references may no longer precisely match what is in ghc right now.

Postscript

One of the many variations that I played with while tracking this bug down looked like

module Main (main) where

import Control.Exception
import Data.IORef
import System.IO.Unsafe (unsafePerformIO)

counter :: IORef Int
{-# NOINLINE counter #-}
counter = unsafePerformIO $ newIORef 0

hPutChar :: IO ()
hPutChar = mask_ $ do
  n <- readIORef counter
  writeIORef counter (n + 1)

go :: Int -> IO Int
go n =
  case n `rem` 2 of
    0 -> do hPutChar
            go (n + 1)
    _ -> do n' <- go (n + 1)
            return (n + n')

main :: IO ()
main = do  _ <- go 0 ; return ()

When you run this in 7.8 this program will crash with

Weird: internal error: scavenge_stack: weird activation record found on stack: 415597384
    (GHC version 7.8.2 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
Abort trap: 6

This turns out to be due to an unrelated bug in 7.8.2; see #9045 (and #8866). It has already been fixed and should be released as part of 7.8.3.

Providence Salumu