Providence Salumu Performance

Performance

Nomen est numen: "to name is to know"

When I got my start working in Haskell (back in the previous millennium), it felt like a kind of miracle that this magic we were writing could be executed at all.

Nowadays, I feel a little cheated if my code doesn't perform within a small factor of C code.

We get from "it runs!" to "it's small and fast!" by measuring, tweaking, and measuring again. And again.

And again.

Measuring time and space

There are two aspects of performance that are of interest to us.

Let's start with time.

Computing the length of a list

An old favourite:

module Length where

len0 :: [a] -> Int
len0 (_:xs) = 1 + len0 xs
len0 _ = 0

Yawn, right?

But how long does it take to run?

Measuring time performance

The standard Haskell tool for timing measurement is a package named criterion.

cabal install criterion

criterion makes it extremely easy to get a benchmark up and running.

import Criterion.Main
import Length

main = defaultMain [ bench "len0" $ whnf len0 [0..100000] ]

If we compile this to an executable, we'll have a fully usable benchmark program.

The moving parts

The defaultMain function accepts a list of benchmarks to run.

It parses a standard set of command line arguments, then runs the benchmarks.

The bench function describes a single benchmark.

The whnf function describes how to run a benchmark.

How to run a benchmark

criterion provides several ways to run a benchmark.

For pure functions:

For impure IO actions:

The ideal benchmarking environment, and the reality

Ideally, we'd always benchmark on a completely quiet machine with predictable performance.

There are many reasons why this is not achievable, among them:

So are we doomed?

A sanity check

While we can't directly observe sources of measurement interference, we can detect interference that perturbs our measurements.

If the interference is moderate to severe, it will perturb many measurements, and criterion will indicate that its measurements are suspicious.

What do we see?

The output of a criterion run:

warming up
estimating clock resolution...
mean is 2.083073 us (320001 iterations)
found 42185 outliers among 319999 samples (13.2%)
  19877 (6.2%) low severe
  22308 (7.0%) high severe
estimating cost of a clock call...
mean is 57.19379 ns (20 iterations)
found 3 outliers among 20 samples (15.0%)
  3 (15.0%) high mild

benchmarking len0
mean: 1.490665 ms, lb 1.458564 ms, ub 1.531022 ms, ci 0.950
std dev: 183.9797 us, lb 151.8929 us, ub 242.1031 us, ci 0.950

Why scrutinize the clock so closely?

criterion works hard to be fully automatic.

It considers clock resolution, the smallest unit by which the wallclock timer will increment.

Why?

It also considers clock cost, i.e. how long it takes to ask the clock the current time.

Reporting numbers

What about these numbers?

benchmarking len0
mean: 1.490665 ms, lb 1.458564 ms, ub 1.531022 ms, ci 0.950
std dev: 183.9797 us, lb 151.8929 us, ub 242.1031 us, ci 0.950

How come we're giving bounds (lb and ub) on the mean and standard deviation?

Measuring is a noisy business. These are estimates of the range within which 95% of measurements are falling.

Space measurements at a glance

We can get a great start by building our programs with the -rtsopts option. This allows us to pass extra options to GHC's runtime system when we run our compiled program.

Suppose we want to see the space performance of this program.

import Control.Monad (forM_)
import Data.List (sortBy)
import Data.Ord (comparing)
import System.Environment (getArgs)
import qualified Data.Map as M

main = do
args <- getArgs
forM_ args $ \f -> do
ws <- words `fmap` readFile f
forM_ (sortBy (comparing snd) . M.toList .
foldl (\m w -> M.insertWith (+) w 1 m) M.empty $ ws) $
\(w,c) -> putStrLn $ show c ++ "\t" ++ w

We compile it with the following command line:

ghc -O -rtsopts --make WordFreq.hs

GC summary statistics

The +RTS command line option begins a series of RTS options. The RTS will hide these from our program.

The -RTS option ends the series of RTS options. (We can omit it if RTS options are the last thing on the command line.)

./WordFreq foo.txt +RTS -s -RTS

The -s RTS option tells the runtime to print summary statistics from the memory manager.

What do GC summary statistics look like?

   160,394,496 bytes allocated in the heap
   104,813,280 bytes copied during GC
    15,228,592 bytes maximum residency (9 sample(s))
       328,112 bytes maximum slop
        36 MB total memory in use (0 MB lost due to fragmentation)

                  Tot time (elapsed)  Avg pause  Max pause
Gen  0       297 colls,     0 par    0.10s    0.10s     0.0003s    0.0021s
Gen  1         9 colls,     0 par    0.08s    0.10s     0.0113s    0.0350s

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.13s elapsed)
GC      time    0.18s  (  0.20s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.31s  (  0.33s elapsed)

%GC     time      59.1%  (60.7% elapsed)

Alloc rate    1,280,768,615 bytes per MUT second

Productivity  40.9% of total user, 37.6% of total elapsed

Stats, part 1 of 4

Let's break it all down, from top to bottom.

   160,394,496 bytes allocated in the heap
   104,813,280 bytes copied during GC
    15,228,592 bytes maximum residency (9 sample(s))
       328,112 bytes maximum slop
        36 MB total memory in use (0 MB lost due to fragmentation)

Key statistics to look at:

Stats, part 2 of 4

Time spent in the garbage collector:

                  Tot time (elapsed)  Avg pause  Max pause
Gen  0       297 colls,     0 par    0.10s    0.10s     0.0003s    0.0021s
Gen  1         9 colls,     0 par    0.08s    0.10s     0.0113s    0.0350s

Stats, part 3 of 4

Where the program spent its time:

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    0.12s  (  0.13s elapsed)
GC      time    0.18s  (  0.20s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time    0.31s  (  0.33s elapsed)

There are two columns of numbers in case we're running on multiple cores.

Stats, part 4 of 4

These are really the most useful numbers to look at:

%GC     time      59.1%  (60.7% elapsed)

Alloc rate    1,280,768,615 bytes per MUT second

Productivity  40.9% of total user, 37.6% of total elapsed

There were problems in our code - but what were they?

Next step: basic heap profiling

Another standard RTS option:

./WordFreq foo.txt +RTS -hT

This generates a file named WordFreq.hp, which contains a heap profile, a time-based snapshot of what was in the heap and when, categorized by data constructor.

We can't easily read a heap profile, so we use hp2ps to convert it to a PostScript file.

hp2ps -c WordFreq.hp

This will give us WordFreq.ps, which we can open in a suitable viewer.

Heap profiler output

Not so happy looking!

Not so happy looking!

Clearly we're allocating a ton of cons cells, and half a ton of thunks.

Our program with its space leaks fixed

Does this heap profile look better?

What a relief!

What a relief!

It's still a similar shape, but look at the units on the y axis above!

Also, check out the healthier GC summary stats:

%GC     time      33.6%  (32.9% elapsed)

Alloc rate    1,246,782,295 bytes per MUT second

Productivity  66.4% of total user, 63.3% of total elapsed

Full heap profiling

Basic heap profiling is useful, but GHC supports a much richer way to profile our code.

This richer profiling support has a space and time cost, so we don't leave it turned on all the time.

To use it, we must compile both libraries and programs with -prof.

If you're using cabal, see the --enable-library-profiling and --enable-executable-profiling options.

More about full heap profiling

The basics of full heap profiling are similar to what we saw with -hT and hp2ps a moment ago.

The full profiler is a powerful facility, so it's worth reading the profiling chapter of the GHC manual.

In particular, to get much out of the profiler, you'll need to know about cost centres, which are annotated expressions used for book-keeping when profiling.

In many cases, you can simply use the -auto-all option to get GHC to annotate all top-level bindings with cost centres.

You'll also want to use the -P RTS option, which writes a human-readable time and space profile into a file ending with a .prof extension.

What about this?

Please enter the following definition into a source file, and load it into ghci.

sum0 :: [Integer] -> Integer
sum0 (x:xs) = x + sum0 xs
sum0 _ = 0

Now please tell me the result of this expression.

sum0 [0..9876543]

Once you're done, please tell me the result of this expression.

sum0 [0..98765432]

Welcome to Core

Given our earlier definition of the function len0, suppose were to try this on the command line:

ghc -c -ddump-simpl Length.hs

And we'll see GHC dump a transformed version of our code in a language named Core.

Rec {
Length.len0 [Occ=LoopBreaker]
:: forall a_abp. [a_abp] -> GHC.Types.Int
[GblId, Arity=1]
Length.len0 =
\ (@ a_aov) (ds_dpn :: [a_aov]) ->
case ds_dpn of _ {
[] -> GHC.Types.I# 0;
: ds1_dpo xs_abq ->
GHC.Num.+
@ GHC.Types.Int
GHC.Num.$fNumInt
(GHC.Types.I# 1)
(Length.len0 @ a_aov xs_abq)
}
end Rec }

What is Core?

Core is also known as System FC, a greatly simplified version of Haskell that is used internally (in abstract form) by GHC.

Real Haskell code is compiled to Core, which is then transformed repeatedly by various optimization passes. These live in the simplifier, roughly the middle of the compilation pipeline.

What we see with -ddump-simpl is a pretty-printed version of the abstract Core representation after the simplifier has finished all of its transformations.

Isn't Core scary?

Let's walk through some Core, for fun.

From the outside in

Rec {
Length.len0 [Occ=LoopBreaker]
:: forall a_abp. [a_abp] -> GHC.Types.Int
{- ... -}
end Rec }

Function annotations

[GblId, Arity=1]

Type application

Length.len0 =
\ (@ a_aov) (ds_dpn :: [a_aov]) ->

The '@' annotation here is a type application: GHC is applying the type a_aov (another renaming of a) to the function.

Type applications are of little real interest to us right here, but at least we know what this notation is (and we'll see it again soon).

Case analysis, part 1

    case ds_dpn of _ {
[] -> GHC.Types.I# 0;

This looks like regular Haskell. Hooray!

Since that's hardly interesting, let's focus on the right hand side above, namely this expression:

GHC.Types.I# 0

The I# above is the value constructor for the Int type.

This indicates that we are allocating a boxed integer on the heap.

Case analysis, part 2

      : ds1_dpo xs_abq ->

Normal pattern matching on the list type's : constructor. In Core, we use prefix notation, since we've eliminated syntactic sugar.

        GHC.Num.+
@ GHC.Types.Int
GHC.Num.$fNumInt

We're calling the + operator, applied to the Int type.

The use of GHC.Num.$fNumInt is a dictionary.

In other words, dictionary passing has gone from implicit in Haskell to explicit in Core. This will be really helpful!

The actual parameters to +

Finally, we allocate an integer on the heap.

We'll add it to the result of calling len0 on the second argument to the : constructor, where we're applying the a_aov type again.

          (GHC.Types.I# 1)
(Length.len0 @ a_aov xs_abq)

Strictness in Core

In System FC, all evaluation is controlled through case expressions. A use of case demands that an expression be evaluated to WHNF, i.e. to the outermost constructor.

Some examples:

-- Haskell:
foo (Bar a b) = {- ... -}

-- Core:
foo wa = case wa of _ { Bar a b -> {- ... -} }
-- Haskell:
{-# LANGUAGE BangPatterns #-}
let !a = 2 + 2 in foo a

-- Core:
case 2 + 2 of a { __DEFAULT -> foo a }
-- Haskell:
a `seq` b

-- Core:
case a of _ { __DEFAULT -> b }

Pop quiz!

Inspect the output of ghc -ddump-simpl and tell me which values are, and which are not, being forcibly evaluated in the definition of sum0.

In return, I'll tell you why we got this error message:

*** Exception: stack overflow

The evaluation stack

There is no such thing as a regular "call stack" in Haskell, no analogue to the stack you're used to thinking of in C or Python or whatever.

When GHC hits a case expression, and must evaluate a possibly thunked expression to WHNF, it uses an internal stack.

This stack has a fixed size, which defaults to 8MB.

The size of the stack is fixed to prevent a program that's stuck in an infinite loop from consuming all memory.

Most of the time, if you have a thunk that requires anywhere close to 8MB to evaluate, there's likely a problem in your code.

The perils of chained thunks

There are a few ways in which chained thunks can cause us harm.

Besides stack overflows, I can think of two more problems off the top of my head.

Please see if you can tell me what those problems are.

The perils of chained thunks

There are a few ways in which chained thunks can cause us harm.

Besides stack overflows, I can think of two more problems off the top of my head.

So ... thunks are bad?

No, because they enable lazy evaluation.

What's bad is not knowing when lazy or strict evaluation is occurring.

But now that you can read -ddump-simpl output and find those case expressions, you'll be able to tell immediately.

With a little experience, you'll often be able to determine the strictness properties of small Haskell snippets by inspection. (For those times when you can't, -ddump-simpl will still be your friend.)

Pro tips

If you're using GHC 7.2 or newer and want to read simplifier output, consider using options like -dsuppress-all to prevent GHC from annotating the Core.

It makes the dumped Core more readable, but at the cost of information that is sometimes useful.

There's a handful of these suppression options (see the GHC man page), so you can gain finer control over suppressions.

Also, try installing and using the ghc-core tool to automate some of the pain:

cabal install ghc-core

The role of reading Core

I always reach for -ddump-simpl after:

A couple of minutes with simplifier output will help guide me to the one or two strictness annotations I'm likely to really need.

This saves me from the common newbie mistake of a random splatter of unnecessary strictness annotations, indicating a high level of panic and lack of understanding.

Find out more

We've scratched the surface of some of the tools and techniques you can use, but there's plenty more to learn.

Providence Salumu