= I know kung fu: learning STG by example = The STG machine is an essential part of GHC, the world's leading Haskell compiler. It defines how the Haskell evaluation model should be efficiently implemented on standard hardware. Despite this key role, it is generally poorly understood amongst GHC users. This document aims to provide an overview of the STG machine in its modern, eval/apply-based, pointer-tagged incarnation by a series of simple examples showing how Haskell source code is compiled. == What is STG, exactly? == Haskell code being sucked through GHC has a complex lifecycle. Broadly speaking, it transitions between five representations: {{{ +---------+ LLVM backend /--->| LLVM IR |--\ | +---------+ | LLVM | v +------------+ Desugar +------+ STGify +-----+ CodeGen +-----+ | NCG +----------+ | Parse tree |--------->| Core |-------->| STG |--------->| C-- |----+-------->| Assembly | +------------+ +------+ +-----+ +-----+ | +----------+ | ^ | +---+ | GCC C backend \---->| C |--/ +---+ }}} The path from C-- to assembly varies: the three possible backends are C (`-fvia-c`), LLVM (`-fllvm`), and the default backend -- the native code genarator (or NCG), which generates assembly directly from the GHC-internal C-- data type. STG is a simple functional language, rather like the more famous Core language. It differs in the following main respects: 1. In its current incarnation, it isn't typed in the Haskell sense, though it does know about ''representation'' types 2. It is in administrative normal form (ANF), which is where every subexpression is given a name 3. Every $\lambda$, constructor application, and primitive operator is $\eta$-expanded 4. It is annotated with a ton of information that the code generator is interested in knowing STG expressions can be one of the following: 1. Atoms (i.e. literals and variables) 2. `let`-bindings (both recursive and non-recursive) over another expression, where let-bound things are one of: * A function value with explicit lambdas * An unsaturated application * A constructor applied to atoms * A thunk (i.e. any expression not fitting into one of the above categories) 3. Saturated primitive application of a primitive to variables 4. Application of a variable to one or more atoms 5. Case deconstruction of an expression, where each branch may also be an expression The job of the ''STG machine'' is to evaluate these expressions in a way which is efficiently implementable on standard hardware. This document will look at how exactly this is achieved by looking at real examples of the C-- code GHC generates for various Haskell expressions. This document will take a very low-level view of the machine, so if you want to get comfortable with how the STG machine executes at a more abstract level before reading this document, you might want to read the paper [http://research.microsoft.com/en-us/um/people/simonpj/papers/eval-apply/ "How to make a fast curry: push/enter vs. eval/apply"]. It presents the STG machine without reference to an explicit stack or registers, but instead as a transition system. This transition system has also been implemented as a Haskell program called [http://hackage.haskell.org/package/ministg ministg] by [http://ww2.cs.mu.oz.au/~bjpop/ Bernie Pope], for those who wish to see it in action on some simple examples. == An overview of the STG machine == Before we dive in, a note: this document will describe the STG machine as it is implemented on x86-style architectures. I will use the terms "the STG machine" and "the STG machine as implemented on x86 by GHC" interchangeably. The implementation is somewhat different on x64, not least due to the greater number of available registers. This overview section is rather bare. Readers might be able to fill in any gaps in my explanation by using some of the following sources: * [http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/HaskellExecution The Haskell Execution Model] * [http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage Storage] * [http://research.microsoft.com/en-us/um/people/simonpj/Papers/spineless-tagless-gmachine.ps.gz The Spineless Tagless G-machine] - now sadly rather out of date * [http://research.microsoft.com/en-us/um/people/simonpj/papers/ptr-tag/ptr-tagging.pdf Faster laziness through dynamic pointer tagging] === Components of the machine === In its bare essentials, the STG machine consists of three parts: 1. The STG registers: * There are rather a lot of registers here: more than can be practicably stored in actual available processor registers on most architectures. * To deal with the lack of processor registers, most of the STG registers are actually kept on the stack in a block of memory pointed to by a special STG register called the "base register" (or `BaseReg`). To get or set values of registers which are not kept in processor registers, the STG machine generates an instruction to load or store from an address relative to the `BaseReg`. * The most important four registers are the `BaseReg`, the stack pointer (`Sp`), the heap pointer (`Hp`), and the general purpose register `R1` which is used for intermediate values, as well as for returning evaluated values when unwinding the stack. These are the four registers which are assigned actual processor registers when implementing the STG machine on x86. 2. The STG stack: * Stores function arguments and continuations (i.e. the stack frames which are executed when a function returns) * Grows downwards in memory * The top of the stack is pointed to by the STG register `Sp`, and the maximum available stack pointer is stored in `SpLim`. There is no frame pointer. 3. The heap: * Used to store many different sorts of heap object: notably functions, thunks and data constructors * Grows upwards in memory, towards the stack * All allocation occurs using a bump-allocator: the heap pointer is simply incremented by the number of bytes desired (subject to to a check that this does not exhaust available memory). The garbage collector is responsible for moving objects out of the area of the heap managed by the bump allocator and into the care of its generational collector. * The last address in the bump-allocated part of the heap that has been used is pointed to by the STG register `Hp`, with `HpLim` holding the maximum address available for bump-allocation. === Important concepts in the machine === Some of the key concepts in the STG machine include ''closures'', ''info tables'' and ''entry code''. We tackle them in reverse order: Entry code:: The actual machine code that the STG machine will execute upon "entry". Entry means different things for different heap objects. * For ''thunks'', entry is when the thunk is forced by some demand for its value, such as a `case` expression scrutinising it * For ''functions'', entry is when the function is applied to as many arguments as are demanded by the arity recorded in its info table * For ''continuations'', entry occurs when a value is returned from a nested call, and hence the need arises to consume the value and continue evaluation Info table:: A block of memory allocated statically, which contains metadata about a closure. The most important fields for our purposes are the entry code pointer and the arity information (if this is the info table for a thunk, function or partial application) Closure:: Essentially a heap-allocated pair of the free variables of some code, and a pointer to its info table (i.e. its info pointer). For an example of how these parts work together, consider the following code {{{ my_fun x zs = map (\y -> y + x) zs }}} The nested lambda will give rise to all of the above objects. The closure will store a pointer to `x`'s closure (as it is a free variable of the lambda), along with a pointer to an info table. That info table will contain information relevant to a function value, recording information such as the fact that it has an arity of 1 (i.e. the binding for `y`), and the pointer to the entry code for the function `\y -> y + x` itself. This entry code will implement the addition by combining the closure for the free variable `x` (taken from the closure) with the stack-passed `y` variable's closure. Upon entry to some code, pointers to closures are made available in `R1`. That is to say, before entry code is jumped to, `R1` is set up to point to the associated closure, so that the entry code can access free variables (if any). Closures for code which contain no free variables (such as the closure for `True` and `False`, and functions applied to no arguments such as `(:)` and `id`) are allocated statically by the compiler in the same manner as info tables are. === Overview of execution model of the machine === This will be covered in more detail in the examples below, so I will use this section to make some general points. The goal of the STG machine is to reduce the current expression to a value. When it has done so, it: 1. Stores a tagged pointer to evaluated closure in the STG register `R1` 2. Jumps to the entry code of the info table pointed to by the value at the top of the STG stack * This may also be called the info table of the ''continuation'' of the expression The continuation code is responsible for popping its info pointer (and stack-allocated free variables, if any) from the stack before returning. Arguments are passed on the stack, and are popped by the callee. Upon a jump to the entry code for a function, there are always precisely as many arguments on the stack as the (statically known) arity of that function, and those arguments will be followed by the info pointer of a continuation. == Saturated application to known functions == Handling application in the STG machine is a big topic, and so in this first section we only look at the case of ''saturated'' applications to ''known'' functions - i.e. those functions that the compiler statically knows information such as the entry code pointer and arity for. === Example 1: function application with sufficient stack space === Application of functions is the bread and butter of the STG machine. Correspondingly, this first Haskell program {{{ {-# NOINLINE known_fun #-} known_fun :: a -> a known_fun x = x known_app :: () -> Int known_app _ = known_fun 10 }}} compiles to very simple C-- code {{{ #!c Main_knownzuapp_entry() { cl3: I32[Sp + 0] = stg_INTLIKE_closure+209; jump Main_knownzufun_entry (); } }}} {{{ _Main_knownzuapp_entry: Lcl3: movl L_stg_INTLIKE_closure$non_lazy_ptr,%eax addl $209,%eax movl %eax,(%ebp) jmp _Main_knownzufun_entry }}} The STG machine passes arguments to functions on the STG stack, and a pointer to the stack top is stored in the STG register `Sp`. Furthermore, because GHC currently uses the eval/apply variant of the STG machine, exactly as many arguments as the function expects to receive are guaranteed to present on the stack. Therefore, upon entry to the `known_app` function, we are guaranteed that the STG stack has a pointer to a closure of type `()` on top of it. In order to call `known_fun`, we just modify the top of the stack to replace that pointer with a pointer to the statically allocated closure for the literal `10`, and then tail-call into the entry code of `known_fun`. === Example 2: function application that needs to grow the stack === This Haskell code is apparently little more complicated than the previous example {{{ {-# NOINLINE known_fun_2 #-} known_fun_2 :: a -> a -> a known_fun_2 x _ = x known_app_2 :: () -> Int known_app_2 _ = known_fun_2 10 10 }}} however, it generates radically different C-- code: {{{ #!c Main_knownzuappzu2_entry() { clE: if (Sp - 4 < SpLim) goto clH; I32[Sp + 0] = stg_INTLIKE_closure+209; I32[Sp - 4] = stg_INTLIKE_closure+209; Sp = Sp - 4; jump Main_knownzufunzu2_entry (); clH: R1 = Main_knownzuappzu2_closure; jump stg_gc_fun (); } }}} {{{ _Main_knownzuappzu2_entry: LclE: leal -4(%ebp),%eax cmpl 84(%ebx),%eax jb LclH movl L_stg_INTLIKE_closure$non_lazy_ptr,%eax addl $209,%eax movl %eax,(%ebp) movl L_stg_INTLIKE_closure$non_lazy_ptr,%eax addl $209,%eax movl %eax,-4(%ebp) addl $-4,%ebp jmp _Main_knownzufunzu2_entry LclH: movl $_Main_knownzuappzu2_closure,%esi jmp *-4(%ebx) }}} As before, upon entry the STG stack is guaranteed to have a single closure pointer at its top. However, in order to call into known_fun_2 we need at least two free stack slots at the top for arguments, which means that we have to grow the stack by one word before we can make the call. ==== Checking for sufficient stack space ==== First, we check to see if growing the stack would overflow allocated stack space, by comparing the STG stack pointer register `Sp` with the stack limit register `SpLim`: {{{ #!c if (Sp - 4 < SpLim) goto clH; }}} (The stack grows downwards, hence the ''subtraction'' of 4 from the current `Sp`). If the stack check fails, we branch to `clH`: {{{ #!c clH: R1 = Main_knownzuappzu2_closure; jump stg_gc_fun (); }}} This stores the closure of the current function in `R1`, and then jumps into the hand-written garbage collector code to force it to grow the stack. After the stack has been grown, the collector will call back into `Main_knownzuappzu2_entry` by using the information stored in the (statically-allocated) `Main_knownzuappzu2_closure` closure pointed to by `R1`, and the stack check will be run again - hopefully succeeding this time! ==== Making the known call ==== Given that the stack check succeeds, it is easy to make the actual call we are after. We simply grow the stack by the required amount, and write the two arguments to `known_fun_2` into the top two stack slots (overwriting our own first argument in the process, of course): {{{ #!c I32[Sp + 0] = stg_INTLIKE_closure+209; I32[Sp - 4] = stg_INTLIKE_closure+209; Sp = Sp - 4; }}} A simple tail call to the new function finishes us off: {{{ #!c jump Main_knownzufunzu2_entry (); }}} == Example 3: Unsaturated applications to known functions == Despite describing an undersaturated call, this Haskell code {{{ {-# NOINLINE known_fun_2 #-} known_fun_2 :: a -> a -> a known_fun_2 x _ = x known_undersaturated_app :: () -> Int -> Int known_undersaturated_app _ = known_fun_2 10 }}} compiles to straightforward C-- as follows {{{ #!c Main_knownzuundersaturatedzuapp_entry() { cmd: I32[Sp + 0] = stg_INTLIKE_closure+209; jump Main_knownzufunzu2_entry (); } }}} {{{ _Main_knownzuundersaturatedzuapp_entry: Lcmd: movl L_stg_INTLIKE_closure$non_lazy_ptr,%eax addl $209,%eax movl %eax,(%ebp) jmp _Main_knownzufunzu2_entry }}} The reason that there is no special magic to deal with undersaturated applications to known functions is simple: GHC simply gives `known_undersaturated_app` an arity of 2, so by the time we jump to the entry code the stack must already contain any arguments required by `known_fun_2`. == Example 4: Applications to unknown functions == We aren't going to tackle oversaturated calls to known functions until we've considered happens to calls to statically-unknown functions. To see what these look like, we are going to use the following Haskell code {{{ unknown_app :: (Int -> Int) -> Int -> Int unknown_app f x = f x }}} Which compiles to this C-- function {{{ #!c Main_unknownzuapp_entry() { cnO: R1 = I32[Sp + 0]; Sp = Sp + 4; jump stg_ap_p_fast (); } }}} {{{ _Main_unknownzuapp_entry: Lcn0: movl (%ebp),%esi addl $4,%ebp jmp _stg_ap_p_fast }}} Unlike the previous cases we have looked at, we are compiling an application where we don't statically know either the arity or the info pointer of the function being applied. To deal with such cases, the STG machine uses several pre-compiled "generic apply" functions which inspect the info-table for the function in question and decide how the available arguments should be applied to it. === Dealing with generic application === There are three cases the generic apply functions have to deal with: 1. The function's arity (recorder in the function closure's info table) exactly matches the number of arguments available on the stack * This is the best case. In this case, the generic apply function simply makes a tail call into the function's entry code 2. The function's arity is greater than the number of arguments available on the stack * In this case, the generic apply code allocates a PAP (partial application) closure which closes over both the new arguments and the function pointer, and returns that value, in the normal STGish way, to the continuation on the top of the stack 3. The function's arity is less than the number of arguments available on the stack * In this case, a number of arguments matching the arity are pushed on top of the stack, followed by a continuation which uses another of the generic apply functions to apply the remaining arguments. The code for the original function is then entered * Eventually the code for the continuation is entered and another generic apply function will be tail-called to deal with the result Potentially, one generic apply function is required for every "argument pattern". Some example argument patterns are: {{{ #!rst }}} Because the number of patterns is large (actually unbounded, because functions might be of any arity), GHC only generates generic apply functions for enough patterns so that 99.9% of all calls observed in practice have a generic apply function. Generic apply functions for calls of larger arity can be simulated by chaining together several smaller generic apply functions, in a similar manner as when dealing with oversaturated function applications. === Making the call to the generic application code === Let's remind ourselves of the original code: {{{ #!c R1 = I32[Sp + 0]; Sp = Sp + 4; jump stg_ap_p_fast (); }}} Knowing about generic apply functions, the call itself is easy to understand. We pop the top of the stack (the function argument) into `R1` and then jump into the generic application code for the case where the stack contains a single pointer argument, which deals with all the cases for `f` described above. == Example 5: oversaturated applications to known functions == This Haskell code {{{ {-# NOINLINE known_fun_2 #-} known_fun_2 :: a -> a -> a known_fun_2 x _ = x known_oversat_app :: () -> Int known_oversat_app _ = known_fun_2 id id 10 }}} compiles to the following C-- function {{{ #!c Main_knownzuoversatzuapp_entry() { cmj: if (Sp - 12 < SpLim) goto cmm; I32[Sp + 0] = stg_INTLIKE_closure+209; I32[Sp - 4] = stg_ap_p_info; I32[Sp - 8] = base_GHCziBase_id_closure; I32[Sp - 12] = base_GHCziBase_id_closure; Sp = Sp - 12; jump Main_knownzufunzu2_entry (); cmm: R1 = Main_knownzuoversatzuapp_closure; jump stg_gc_fun (); } }}} {{{ _Main_knownzuoversatzuapp_entry: Lcmj: leal -12(%ebp),%eax cmpl 84(%ebx),%eax jb Lcmm movl L_stg_INTLIKE_closure$non_lazy_ptr,%eax addl $209,%eax movl %eax,(%ebp) movl L_stg_ap_p_info$non_lazy_ptr,%eax movl %eax,-4(%ebp) movl $_base_GHCziBase_id_closure,-8(%ebp) movl $_base_GHCziBase_id_closure,-12(%ebp) addl $-12,%ebp jmp _Main_knownzufunzu2_entry Lcmm: movl $_Main_knownzuoversatzuapp_closure,%esi jmp *-4(%ebx) }}} As you might see, despite being a call to a known function, this code makes use of the generic apply functions we discussed in the last section. Let's pick the function apart and see how it works. First, we do the usual stack check. What differs from the last time we saw this check is that we are not only allocating space for arguments on the stack, but also for a ''continuation''. We set up these new stack entries as follows: {{{ #!c I32[Sp + 0] = stg_INTLIKE_closure+209; I32[Sp - 4] = stg_ap_p_info; I32[Sp - 8] = base_GHCziBase_id_closure; I32[Sp - 12] = base_GHCziBase_id_closure; Sp = Sp - 12; }}} i.e. the final stack looks as follows (note that the code overwrites the old pointer to a closure of type ()): {{{ /----\ +---------------------------+ | Sp |--->| base_GHCziBase_id_closure | \----/ +---------------------------+ | base_GHCziBase_id_closure | +---------------------------+ | stg_ap_p_info | +---------------------------+ | stg_INTLIKE_closure+209 | +---------------------------+ | ... | }}} Because `known_fun_2` is of arity 2, when we jump to its entry code, it will only consume the top two arguments from the stack: i.e. the two pointers to `base_GHCziBase_id_closure`. It will then evaluate to some sort of value and transfer control to the entry code for `stg_ap_p_info`. This is where the magic happens: the entry code for `stg_ap_p_info` will apply the function value that was returned from `known_fun_2` to the (pointer) argument in the "free variable" of its (stack allocated) closure -- and we have arranged that that is `stg_INTLIKE_closure+209`, i.e. the closure for the `Int` literal `10`. This code is shared with the generic application functions for calls to unknown functions, so this will make use of the `stg_ap_p_fast` function we saw before. Finally, control will be transferred back to the caller for `known_oversat_app`, and all will be well. == Example 6: allocation of thunks and data == Something that happens all the time in Haskell is allocation. There are three principal types of thing that get allocated: function closures, thunks, and data. These are all treated pretty much the same in the STG machine for the simple reason that they share many common characteristics: * Entry code which the STG machine jumps to, in order to evaluate them * Note that for constructors, the entry code is trivial, as they are always already evaluated! In this case, control will be transferred directly back to the caller's continuation. * Free variables stored in a closure * For data, these "free variables" will be the values in the fields of the particular data constructor * Info-tables containing various miscellaneous metadata about the heap object, such as function arity Let us look at how a thunk and a data constructor get allocated in a simple setting: {{{ build_data :: Int -> Maybe Int build_data x = Just (x + 1) }}} This compiles into the following C--: {{{ #!c Main_buildzudata_entry() { clE: Hp = Hp + 20; if (Hp > HpLim) goto clH; I32[Hp - 16] = slk_info; I32[Hp - 8] = I32[Sp + 0]; I32[Hp - 4] = base_DataziMaybe_Just_con_info; I32[Hp + 0] = Hp - 16; R1 = Hp - 2; Sp = Sp + 4; jump (I32[I32[Sp + 0]]) (); clI: R1 = Main_buildzudata_closure; jump stg_gc_fun (); clH: HpAlloc = 20; goto clI; } }}} {{{ _Main_buildzudata_entry: LclE: addl $20,%edi cmpl 92(%ebx),%edi ja LclH movl $_slk_info,-16(%edi) movl (%ebp),%eax movl %eax,-8(%edi) movl $_base_DataziMaybe_Just_con_info,-4(%edi) leal -16(%edi),%eax movl %eax,(%edi) leal -2(%edi),%esi addl $4,%ebp movl (%ebp),%eax jmp *(%eax) LclH: movl $20,112(%ebx) LclI: movl $_Main_buildzudata_closure,%esi jmp *-4(%ebx) }}} Let's break this function down slowly. === Checking for sufficient heap space === Any function that needs to allocate memory might find that the heap has been exhausted. If that happens, it needs to call into the garbage collector in order to get the heap cleaned up and (possibly) enlarged. Hence, the first thing any such function does is check to see if enough memory is available for its purposes: {{{ #!c clE: Hp = Hp + 20; if (Hp > HpLim) goto clH; ... clI: R1 = Main_buildzudata_closure; jump stg_gc_fun (); clH: HpAlloc = 20; goto clI; }}} This is simple enough. The function needs to allocate 20 bytes (the data constructor takes up 2 words, and the thunk will take up 3), so it speculatively increments Hp and then checks the STG registers `Hp` and `HpLim` (the pointer to the top of the available heap space) against each other. If memory is insufficient (i.e. we have moved `Hp` past the top of the available heap), the code deals with it by setting the `HpAlloc` register to the number of bytes needed and `R1` to the closure for the function in which the heap check failed, before jumping into the hand-written garbage collector code for the cleanup. The garbage collector will resume execution of the code by using the information from `R1`, after it has freed up enough memory. Side note: I believe that the line setting `R1` is unnecessary here, because `R1` should anyway always be set to the address of the closure when executing the closure entry code. I could be wrong, though. === Performing the actual allocation === Once the heap check succeeds, we will be able to enter the body of the function proper. Since the `Hp` has already been incremented, we can just construct the new heap objects directly: {{{ #!c I32[Hp - 16] = slk_info; I32[Hp - 8] = I32[Sp + 0]; I32[Hp - 4] = base_DataziMaybe_Just_con_info; I32[Hp + 0] = Hp - 16; }}} So we get something like this: {{{ | ... | +--------------------------------+ | slk_info |<-\ Pointer to thunk info table +--------------------------------+ | | (undefined) | | +--------------------------------+ | | x | | The "x + 1" thunk's free variable +--------------------------------+ | | base_DataziMaybe_Just_con_info | | Pointer to Just info table /----\ +--------------------------------+ | | Hp |--->| |--/ Free variable of Just constructor \----/ +--------------------------------+ }}} The bottom two words are the allocated `Just` value, and the three above that correspond to the `x + 1` closure. === Returning an allocated value to the caller === Now that we have allocated the data we entered the function in order to construct, we need to return it to the caller. This is achieved by the following code: {{{ #!c R1 = Hp - 2; Sp = Sp + 4; jump (I32[I32[Sp + 0]]) (); }}} To return, the STG machine: 1. Sets `R1` to the pointer to the result of evaluation 2. Pops all the arguments to the function from the stack 3. Jumps to the entry code for the continuation. This is always found at the top of the STG stack, logically below any arguments that were pushed to make the call. This is indeed exactly what happens here, with two interesting points: pointer tagging, and the double-deference of the stack pointer. These will be discussed in the next two subsections. ==== Pointer tagging ==== One exciting feature is that the code setting `R1`, i.e. `R1 = Hp - 2`. This is setting `R1` to point to the `Just`, we just allocated, but simultaneously tagging that pointer with the value 2. The fact that the tag is non-zero indicates to users of the pointer that the thing pointed to is already evaluated. Furthermore, because `Maybe` has only two constructors, we are able to use the pointer tags to record which constructor it evaluated to: in this case, the 2 indicates the `Just` constructor. It is compulsory to tag pointers before jumping to the address of the continuation entry code: the entry code can and will rely on those tags being present! ==== `TABLES_NEXT_TO_CODE` ==== Because I have compiled GHC without `TABLES_NEXT_TO_CODE`, the entry code for the continuation is found by dereferencing the pointer to the info table we found at the top of the STG stack - i.e. a double-dereference. The layout of heap objects without `TABLES_NEXT_TO_CODE` is as follows: {{{ Closure Info table Entry code +--------------+ +--------------+ +------------+ x --->| Info pointer |--->| Code pointer |-->| ... | +--------------+ +--------------+ | .. FVs .. | | ... | }}} With `TABLES_NEXT_TO_CODE` on, the situation looks more like this: {{{ Closure | .. Info table .. | +--------------+ +------------------+ x --->| Info pointer |--->| .. Entry code .. | +--------------+ | | | .. FVs .. | }}} The `TABLES_NEXT_TO_CODE` optimisation removes the need for that second dereference during the return, because the entry code is always right next to the info table. However, it requires special support from the backend for ensuring that data (i.e. the info table) and code are contiguous in memory, so it cannot always be used. == Example 7: `case` expressions == Let us now examine how `case` expressions are handled. Compiling the following Haskell {{{ case_scrut :: Maybe Int -> Int case_scrut x = case x of Just x -> x; Nothing -> 10 }}} Produces this C-- code {{{ #!c Main_casezuscrut_entry() { ccx: R1 = I32[Sp + 0]; I32[Sp + 0] = scj_info; if (R1 & 3 != 0) goto ccA; jump (I32[I32[R1]]) (); ccA: jump (I32[scj_info]) (); } scj_ret() { cct: _ccu::I32 = R1 & 3; if (_ccu::I32 >= 2) goto ccv; R1 = stg_INTLIKE_closure+209; Sp = Sp + 4; jump (I32[I32[Sp + 0]]) (); ccv: R1 = I32[R1 + 2]; Sp = Sp + 4; R1 = R1 & (-4); jump (I32[I32[R1]]) (); } }}} {{{ _Main_casezuscrut_entry: Lccx: movl (%ebp),%esi movl $_scj_info,(%ebp) testl $3,%esi jne LccA movl (%esi),%eax jmp *(%eax) LccA: jmp *_scj_info _scj_ret: Lcct: movl %esi,%eax andl $3,%eax cmpl $2,%eax jae Lccv movl L_stg_INTLIKE_closure$non_lazy_ptr,%eax leal 209(%eax),%esi addl $4,%ebp movl (%ebp),%eax jmp *(%eax) Lccv: movl 2(%esi),%esi addl $4,%ebp andl $-4,%esi movl (%esi),%eax jmp *(%eax) }}} Notice that GHC has generated ''two'' functions: `Main_casezuscrut_entry` and `scj_ret` correspond to the code for forcing the argument to the `case`, and for the ''continuation'' of the `case` respectively. Let's pick them apart and see how they work! === Forcing the scrutinee of the `case` === When we first call the `case_scrut` function, its entry code begins executing: {{{ #!c ccx: R1 = I32[Sp + 0]; I32[Sp + 0] = scj_info; if (R1 & 3 != 0) goto ccA; jump (I32[I32[R1]]) (); ccA: jump (I32[scj_info]) (); }}} This is a function of arity 1 (i.e. with a single argument), so upon entry the machine state looks like this: {{{ /----\ +--------------------------+ | R1 |---->| Main_casezuscrut_closure | \----/ +--------------------------+ /----\ +------------+ | Sp |---->| x | \----/ +------------+ | ... | }}} Because this is a top level function, the closure is statically allocated and contains no free variables. However, as discussed previously, the single argument to the function is guaranteed to be present at the top of the stack. The code starts off by saving this argument (the `x`) temporarily into `R1`: {{{ #!c ccx: R1 = I32[Sp + 0]; }}} The next thing the code does is overwrites this argument on the stack with a pointer to the info-table of the continuation code. This is the code that will be invoked after `x` has been evaluated into WHNF, and which will do the test to decide whether to continue as the `Nothing` or as the `Just` branch of the case: {{{ #!c I32[Sp + 0] = scj_info; }}} As we saw earlier, any time that the STG machine decides that it has a value in its hand, it will continue evaluation by tail-calling the entry code found by dereferencing the info-table pointer at the top of the stack. So by putting the address of our continuation in here, we ensure that the entry code for `scj_info` is executed after `x` becomes a value. Now, what we need to do is to start the evaluation of `x`. We could just jump into `x`'s entry code and hope for the best, but thanks to GHC's pointer tagging we can sometimes avoid doing this indirect branch. So, instead, we test to see if the `x` pointer has a tag. If it is tagged, then we know that it is already evaluated and hence jump directly to the code for the continuation. If it is not tagged, we are forced to make the jump into the entry code for `x`. This choice is embodied by the following code: {{{ #!c if (R1 & 3 != 0) goto ccA; jump (I32[I32[R1]]) (); ccA: jump (I32[scj_info]) (); }}} Note the test `R1 & 3 != 0`: this reflects the fact that pointer tags are stored in the lower 2 bits of the pointer on 32 bit machines. Another interesting feature is how the `jump` instructions find the entry code: again, we see a deference of the info pointer because `TABLES_NEXT_TO_CODE` is turned off. As we saw, the `case` scrutinisation code ended with one of two things happening: 1. A direct call into the continuation code `scj_ret` if the scrutinee was already evaluated 2. A call into the entry code for the scrutinee, if the scrutinee was not evaluated (or it ''was'' evaluated, but the pointer was somehow not tagged with that information) - Because we pushed `scj_info` onto the STG stack, control will eventually return to `scj_ret` after the evaluation of `x` has finished It is now time to examine the continuation code to see what happens after `x` becomes a value. === Dealing with the forced scrutinee === The continuation code is a little more complicated: {{{ #!c cct: _ccu::I32 = R1 & 3; if (_ccu::I32 >= 2) goto ccv; R1 = stg_INTLIKE_closure+209; Sp = Sp + 4; jump (I32[I32[Sp + 0]]) (); ccv: R1 = I32[R1 + 2]; Sp = Sp + 4; R1 = R1 & (-4); jump (I32[I32[R1]]) (); }}} Whenever the STG machine evaluates to a value it will return the value by jumping to the entry point at the top of the stack. In this case, `R1` is guaranteed to be a (tagged) pointer to the thing that was just evaluated. Because we are scrutinising a `Maybe` type (which has fewer than 4 constructors) the code for the `case` continuation is able to use the tag bits on the returned pointer to decide which of the two branches to take: {{{ #!c cct: _ccu::I32 = R1 & 3; if (_ccu::I32 >= 2) goto ccv; }}} If we were scrutinising a data type with more constructors, the tag bits would only tell us that the thing was evaluated, not which constructor it was evaluated to. In this case, we would have to read the constructor tag by dereferencing `R1` and testing the resulting info table pointer against all possibilities. If the tag was greater than or equal to 2, we go to the `ccv` branch, which deals with what happens if we had a `Just`. In this case, we need to continue by forcing the thunk inside the `Just` and returning that value to our caller, which is what these lines are doing: {{{ #!c ccv: R1 = I32[R1 + 2]; Sp = Sp + 4; R1 = R1 & (-4); jump (I32[I32[R1]]) (); }}} To access the thing inside the `Just`, the code assumes that the `R1` pointer is tagged with the 2 that indicates a `Just` constructor, and hence finds the first free variable (stored 4 bytes into the closure) using `I32[R1 + 2]`, which is then saved into `R1`. It pops the address of `scj_info` that was pushed onto the stack in `Main_casezuscrut_entry` by moving `Sp` up 4 bytes (remember that the STG stack grows downwards) and then untags and jumps into the entry code for the `R1` thunk, using the same double-dereference pattern discussed earlier. There seems to be a small missed opportunity here: the code could check the pointer tag on `R1`, and then return directly if it is set. I imagine that this isn't being done in order to reduce possible code bloat. == Example 8: thunks and thunk update == You might be wondering how the `x + 1` thunk we saw allocated in a previous section will behave when it is actually forced. To remind you, the thunk we saw was constructed by the following Haskell code: {{{ build_data :: Int -> Maybe Int build_data x = Just (x + 1) }}} So how does the `x + 1` thunk work? An excellent question! Let's take a look at the C-- for its entry code and find out: {{{ #!c slk_entry() { cph: if (Sp - 12 < SpLim) goto cpj; I32[Sp - 8] = stg_upd_frame_info; I32[Sp - 4] = R1; R1 = I32[R1 + 8]; I32[Sp - 12] = soN_info; Sp = Sp - 12; if (R1 & 3 != 0) goto cpk; jump (I32[I32[R1]]) (); cpj: jump stg_gc_enter_1 (); cpk: jump (I32[soN_info]) (); } soN_ret() { cp7: Hp = Hp + 8; if (Hp > HpLim) goto cpc; _soL::I32 = I32[R1 + 3] + 1; I32[Hp - 4] = ghczmprim_GHCziTypes_Izh_con_info; I32[Hp + 0] = _soL::I32; R1 = Hp - 3; Sp = Sp + 4; jump (I32[stg_upd_frame_info]) (); cpd: jump stg_gc_enter_1 (); cpc: HpAlloc = 8; goto cpd; } }}} {{{ _sst_entry: Lcph: leal -12(%ebp),%eax cmpl 84(%ebx),%eax jb Lcpj movl L_stg_upd_frame_info$non_lazy_ptr,%eax movl %eax,-8(%ebp) movl %esi,-4(%ebp) movl 8(%esi),%esi movl $_soN_info,-12(%ebp) addl $-12,%ebp testl $3,%esi jne Lcpk movl (%esi),%eax jmp *(%eax) Lcpj: jmp *-8(%ebx) Lcpk: jmp *_soN_info _soN_ret: Lcp7: addl $8,%edi cmpl 92(%ebx),%edi ja Lcpc movl 3(%esi),%eax incl %eax movl $_ghczmprim_GHCziTypes_Izh_con_info,-4(%edi) movl %eax,(%edi) leal -3(%edi),%esi addl $4,%ebp movl L_stg_upd_frame_info$non_lazy_ptr,%eax jmp *(%eax) Lcpc: movl $8,112(%ebx) Lcpd: jmp *-8(%ebx) }}} The original Haskell code read `x + 1`, but GHC has inlined the actual code for the addition operation on `Int`s, which looks something like: {{{ plusInt (I# a) (I# b) = I# (a + b) }}} The second pattern match (to get `b`) has been performed statically by GHC, obtaining the machine literal 1, which shows up directly in the generated code. Therefore, the code only need to evaluate and case-decompose the unknown free variable `x` of our closure, to get the `a` argument to `plusInt`. === Thunk entry point === This evaluation is what is being done by the thunk entry code `slk_entry`. Ignoring the stack check, the C-- begins thusly: {{{ #!c I32[Sp - 8] = stg_upd_frame_info; I32[Sp - 4] = R1; R1 = I32[R1 + 8]; I32[Sp - 12] = soN_info; Sp = Sp - 12; }}} Remembering that upon entry to the thunk entry code, `R1` points to the thunk's closure, the new stack looks as follows: {{{ /----\ +---------------------+ | Sp |--->| soN_info | \----/ +---------------------+ | stg_upd_frame_info | +---------------------+ +-------------+ | (thunk closure ptr) |---->| slk_info | +---------------------+ +-------------+ | ... | | (undefined) | +-------------+ | x | Free variable of thunk +-------------+ }}} The C-- statement `R1 = I32[R1 + 8]` is pulling out the pointer to the free variable of the thunk (which was set up in `Main_buildzudata_entry`) into `R1`. Finally, the entry code evaluates that free variable (checking the tag bits of the pointer first, as usual): {{{ #!c if (R1 & 3 != 0) goto cpk; jump (I32[I32[R1]]) (); cpk: jump (I32[soN_info]) (); }}} Because we put `soN_info` at the top of the stack, when evaluation of `x` is complete the STG machine will continue by executing the `soN_ret` code. The most interesting feature of this code is the extra stuff that has been pushed onto the stack below `soN_ret`: an info pointer called `stg_upd_frame_info`, and a pointer to the thunk currently being evaluated. This is all part of the STG machine's thunk update mechanism. When the `soN_ret` continuation returns, it will transfer control ''not'' to the code forcing the thunk, but to some code which overwrites the contents of the current thunk closure with a closure representing an "indirection". The entry code for such an indirection closure is trivial: it immediately returns a pointer to the thing that was returned from the `soN_ret` continuation in `R1`. These indirections are the mechanism which ensures that the STG machine never repeats the work of evaluating a thunk more than once: after the first evaluation, any code forcing the thunk jumps into the indirection entry code rather than `slk_entry`. That being said, let us look at how the continuation responsible for actually finding the value of `x + 1` works: === Continuation of the thunk === Upon entry to the continuation code, we have the evaluated `x` in `R1`: it now needs to do the addition and allocate a `I#` constructor to hold the result of the addition. Because of the allocation, `soN_ret` begins with a heap check. Ignoring that check, we have the following code: {{{ #!c _soL::I32 = I32[R1 + 3] + 1; I32[Hp - 4] = ghczmprim_GHCziTypes_Izh_con_info; I32[Hp + 0] = _soL::I32; R1 = Hp - 3; Sp = Sp + 4; jump (I32[stg_upd_frame_info]) (); }}} This is mostly standard stuff. Because the `R1` pointer is guaranteed tagged, and there is only one possible constructor, the tag must be 1 and so the `Int#` value inside the `Int` is pulled out using `I32[R1 + 3]`. This is then put into a newly heap-allocated `I#` constructor, which is returned in `R1` after we pop the `soN_info` pointer from the stack. The only interesting point is where we return to: rather than dereference `Sp` to find the info pointer at the top of the STG stack, GHC has generated code that takes advantage of the fact that the `Sp` is guaranteed to point to `stg_upd_frame_info`. This avoids one pointer dereference. == Conclusion == This document has left much of the detail of how STG is implemented out: notable omissions include CAFs, and the precise behaviour of the garbage collector. Nonetheless, my hope is that it has helped you to gain some more insight into the weird and wonderful way the Haskell evaluation model is implemented.