Parallel and Concurrent Programming in Haskell

Chapter 5. Data Parallel Programming with Repa

The techniques we’ve seen in the previous chapters are great for parallelizing code that uses ordinary Haskell data structures like lists and Maps, but they don’t work as well for data-parallel algorithms over large arrays. That’s because large-scale array computations demand very good raw sequential performance, which we can get only by operating on arrays of unboxed data. We can’t use Strategies to parallelize operations over unboxed arrays, because they need lazy data structures (boxed arrays would be suitable, but not unboxed arrays). Similarly, Par doesn’t work well here either, because in Par the data is passed in IVars.

In this chapter, we’re going to see how to write efficient numerical array computations in Haskell and run them in parallel. The library we’re going to use is called Repa, which stands for REgular PArallel arrays.^[19] The library provides a range of efficient operations for creating arrays and operating on arrays in parallel.

The Repa package is available on Hackage. If you followed the instructions for installing the sample code dependencies earlier, then you should already have it, but if not you can install it with cabal install:

$ cabal install repa

In this chapter, I’m going to use GHCi a lot to illustrate the behavior of Repa; trying things out in GHCi is a great way to become familiar with the types and operations that Repa provides. Because Repa provides many operations with the same names as Prelude functions (e.g., map), we usually import Data.Array.Repa with a short module alias:

> import Data.Array.Repa as Repa

This way, we can refer to Repa’s map function as Repa.map.

Arrays, Shapes, and Indices

Everything in Repa revolves around arrays. A computation in Repa typically consists of computing an array in parallel, perhaps using other arrays as inputs. So we’ll start by looking at the type of arrays, how to build them, and how to work with them.

The Array type has three parameters:

data Array r sh e

The e parameter is the type of the elements, for example Double, Int, or Word8. The r parameter is the representation type, which describes how the array is represented in memory; I’ll come back to this shortly. The sh parameter describes the shape of the array; that is, the number of dimensions it has.

Shapes are built out of two type constructors, Z and :.:

data Z = Z
data tail :. head = tail :. head

The simplest shape, Z, is the shape of an array with no dimensions (i.e., a scalar), which has a single element. If we add a dimension, Z :. Int, we get the shape of an array with a single dimension indexed by Int, otherwise known as a vector. Adding another dimension gives Z :. Int :. Int, the shape of a two-dimensional array, or matrix. New dimensions are added on the right, and the :. operator associates left, so when we write Z :. Int :. Int, we really mean (Z :. Int) :. Int.

The Z and :. symbols are both type constructors and value constructors, which can be a little confusing at times. For example, the data value Z :. 3 has type Z :. Int. The data value form is used in Repa to mean either “shapes” or "indices." For example, Z :. 3 can be either the shape of three-element vectors, or the index of the fourth element of a vector (indices count from zero).

Repa supports only Int-typed indices. A few handy type synonyms are provided for the common shape types:

type DIM0 = Z
type DIM1 = DIM0 :. Int
type DIM2 = DIM1 :. Int

Let’s try a few examples. A simple way to build an array is to use fromListUnboxed:

fromListUnboxed :: (Shape sh, Unbox a) => sh -> [a] -> Array U sh a

The fromListUnboxed function takes a shape of type sh and a list of elements of type a, and builds an array of type Array U sh a. The U is the representation and stands for Unboxed: this array will contain unboxed elements. Don’t worry about the Shape and Unbox type classes. They are just there to ensure that we use only the appropriate shape constructors (Z and :.) and supported element types, respectively.

Let’s build a 10-element vector of Int and fill it with the numbers 1…10. We need to pass a shape argument, which will be Z:.10 for a 10-element vector:

> fromListUnboxed (Z :. 10) [1..10]

<interactive>:15:1:
    No instance for (Shape (Z :. head0))
      arising from a use of `fromListUnboxed'
    The type variable `head0' is ambiguous
    Possible fix: add a type signature that fixes these type variable(s)
    Note: there is a potential instance available:
      instance Shape sh => Shape (sh :. Int)
        -- Defined in `Data.Array.Repa.Index'
    Possible fix: add an instance declaration for (Shape (Z :. head0))
    In the expression: fromListUnboxed (Z :. 10) [1 .. 10]
    In an equation for `it': it = fromListUnboxed (Z :. 10) [1 .. 10]

Oops! This illustrates something that you will probably encounter a lot when working with Repa: a type error caused by insufficient type information. In this case, the integer 10 in Z :. 10 is overloaded, so we have to say explicitly that we mean Int. There are many ways to give GHC the extra bit of information it needs; one way is to add a type signature to the whole expression, which has type Array U DIM1 Int:

> fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])

Similarly, we can make a two-dimensional array, with 3 rows of 5 columns, and fill it with the elements 1 to 15:

> fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int
AUnboxed ((Z :. 3) :. 5) (fromList [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])

Conceptually, the array we created is this:

1	2	3	4	5
6	7	8	9	10
11	12	13	14	15

But internally, the array is stored as a single vector (after all, computer memory is one-dimensional). We can see the vector in the result of the call to fromListUnboxed; it contains the same elements that we initialized the array with.

The shape of the array is there to tell Repa how to interpret the operations on it. For example, if we ask for the element at index Z:.2:.1 in an array with shape Z:.3:.5, we’ll get the element at position 2 * 5 + 1 in the vector. We can try it using the ! operator, which extracts an element from an array. The type of ! is:

(!) :: (Shape sh, Source r e) => Array r sh e -> sh -> e

Let’s get the element at position Z:.2:.1 from our example matrix:

> let arr = fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int
> arr ! (Z:.2:.1)
12

The element 12 is therefore 2 rows down and 1 column across. As I mentioned earlier, indices count from zero in Repa.

Internally, Repa is using the function toIndex to convert an index to an Int offset, given a shape:

toIndex :: Shape sh => sh -> sh -> Int

For example:

> toIndex (Z:.3:.5 :: DIM2) (Z:.2:.1 :: DIM2)
11

Because the layout of an array in memory is the same regardless of its shape, we can even change the shape without copying the array:

> reshape (Z:.5:.3) arr ! (Z:.2:.1 :: DIM2)
8

With the shape Z:.5:.3, the index Z:.2:.1 corresponds to the element at 2 * 3 + 1 = 7, which has value 8.

Here are a couple of other operations on shapes that often come in handy :

rank :: Shape sh => sh -> Int  -- number of dimensions
size :: Shape sh => sh -> Int  -- number of elements

To retrieve the shape of an array, we can use extent:

extent :: (Shape sh, Source r e) => Array r sh e -> sh

For example:

> extent arr
(Z :. 3) :. 5
> rank (extent arr)
2
> size (extent arr)
15

Operations on Arrays

We can map a function over an array using Repa’s map function:

Repa.map :: (Shape sh, Source r a)
         => (a -> b) -> Array r sh a -> Array D sh b

We can see from the type that map returns an array with the representation D. The D representation stands for Delayed; this means that the array has not been computed yet. A delayed array is represented by a function from indices to elements.

We can apply map to an array, but there’s no way to print out the result:

> let a = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
> Repa.map (+1) a

<interactive>:26:1:
    No instance for (Show (Array D DIM1 Int))
      arising from a use of `print'
    Possible fix:
      add an instance declaration for (Show (Array D DIM1 Int))
    In a stmt of an interactive GHCi command: print it

As its name suggests, a delayed array is not an array yet. To turn it into an array, we have to call a function that allocates the array and computes the value of each element. The computeS function does this for us:

computeS :: (Load r1 sh e, Target r2 e) => Array r1 sh e -> Array r2 sh e

The argument to computeS is an array with a representation that is a member of the Load class, whereas its result is an array with a representation that is a member of the Target class. The most important instances of these two classes are D and U respectively; that is, computeS turns a delayed array into a concrete unboxed array.^[20].

Applying computeS to the result of map gives us an unboxed array:

> computeS (Repa.map (+1) a) :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [2,3,4,5,6,7,8,9,10,11])

You might be wondering why there is this extra complication—why doesn’t map just produce a new array? The answer is that by representing the result of an array operation as a delayed array, a sequence of array operations can be performed without ever building the intermediate arrays; this is an optimization called fusion, and it’s critical to achieving good performance with Repa. For example, if we composed two maps together:

> computeS (Repa.map (+1) (Repa.map (^2) a)) :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [2,5,10,17,26,37,50,65,82,101])

The intermediate array between the two maps is not built, and in fact if we compile this rather than running it in GHCi, provided the optimization option -O is enabled, it will compile to a single efficient loop over the input array.

Let’s see how it works. The fundamental way to get a delayed array is fromFunction:

fromFunction :: sh -> (sh -> a) -> Array D sh a

The fromFunction operation creates a delayed array. It takes the shape of the array and a function that specifies the elements. For example, we can make a delayed array that represents the vector of integers 0 to 9 like this:

> let a = fromFunction (Z :. 10) (\(Z:.i) -> i :: Int)
> :t a
a :: Array D (Z :. Int) Int

Delayed arrays support indexing, just like manifest arrays:

> a ! (Z:.5)
5

Indexing a delayed array works by just calling the function that we supplied to fromFunction with the given index.

We need to apply computeS to make the delayed array into a manifest array:

> computeS a :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [0,1,2,3,4,5,6,7,8,9])

The computeS function creates the array and for each of the indices of the array, it calls the function stored in the delayed array to find the element at that position.

The map function, along with many other operations on arrays, can be specified in terms of fromFunction. For example, here is a definition of map:

> let mymap f a = fromFunction (extent a) (\ix -> f (a ! ix))
> :t mymap
mymap
  :: (Shape sh, Source r e) =>
     (e -> a) -> Array r sh e -> Array D sh a

It works just like the real map:

> computeS (mymap (+1) a) :: Array U DIM1 Int
AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])

What happens if we compose two maps together? The result would be a delayed array containing a function that indexes into another delayed array. So we’re building up a nested function that defines the array elements, rather than intermediate arrays. Furthermore, Repa is carefully engineered so that at compile time the nested function call is optimized away as far as possible, yielding very efficient code.

Example: Computing Shortest Paths

In “Example: Shortest Paths in a Graph”, we looked at an implementation of the Floyd-Warshall algorithm for computing the lengths of shortest paths in a sparse weighted directed graph. Here, we’ll investigate how to code up the algorithm over dense graphs, using Repa.

For reference, here is the pseudocode definition of the algorithm:

shortestPath :: Graph -> Vertex -> Vertex -> Vertex -> Weight
shortestPath g i j 0 = weight g i j
shortestPath g i j k = min (shortestPath g i j (k-1))
                           (shortestPath g i k (k-1) + shortestPath g k j (k-1))

We implement this by first computing all the shortest paths for k == 0, then k == 1, and so on up to the maximum vertex in the graph.

For the dense version, we’re going to use an adjacency matrix; that is, a two-dimensional array indexed by pairs of vertices, where each element is the length of the path between the two vertices. Here is our representation of graphs:

fwdense.hs

type Weight = Int
type Graph r = Array r DIM2 Weight

The implementation of the shortest paths algorithm is as follows:

shortestPaths :: Graph U -> Graph U
shortestPaths g0 = go g0 0                                                -- 
  where
    Z :. _ :. n = extent g0                                               -- 

    go !g !k | k == n    = g                                              -- 
             | otherwise =
                 let g' = computeS (fromFunction (Z:.n:.n) sp)            -- 
                 in  go g' (k+1)                                          -- 
     where
       sp (Z:.i:.j) = min (g ! (Z:.i:.j)) (g ! (Z:.i:.k) + g ! (Z:.k:.j)) --

	The number of vertices in the graph, `n`, is found by pattern-matching on the shape of the input graph, which we get by calling `extent`.
	We need to loop over the vertices, with `k` taking values from 0 up to `n - 1`. This is done with a local recursive function `go`, which takes the current graph `g` and `k` as arguments. The initial value for `g` is `g0`, the input graph, and the initial value for `k` is 0.
	The first case in `go` applies when we have looped over all the vertices, and `k == n`. The result is the current graph, `g`.
	Here is the interesting case. We’re going to build a new adjacency matrix, `g'`, for this step using `fromFunction`. The shape of the array is `Z:.n:.n`, the same as the input, and the function to compute each element is `sp` (discussed later). To manifest the new graph, we call `computeS`. Do we have to call `computeS` for each step, or could we wait until the end? If we don’t manifest the graph at each step, then we will be calling a nest of `k` functions every time we index into the current graph, `g`, which is exactly what this dynamic-programming solution seeks to avoid. So we must manifest the graph at each step.
	Recursively call `go` to continue with the next step, passing the new graph we just computed, `g'`, and the next value of `k`.
	The `sp` function computes the value of each element in the new matrix and is a direct translation of the pseudocode: the shortest path between `i` and `j` is the minimum of the current shortest path, and the shortest path that goes from `i` to `k` and then to `j`, all of which we get by indexing into the current graph, `g`.

The code is quite readable and somewhat shorter than the sparse version of the algorithm we saw before. However, there are a couple of subtleties that might not be obvious, but are nevertheless important for making the code run quickly:

I deliberately used an explicit recursive function, go, rather than something like foldl', even though the latter would lead to shorter code. The optimizations in Repa work much better when all the code is visible to the compiler, and calling out to library functions can sometimes hide details from GHC and prevent optimizations. There are no hard and fast rules here; I experimented with both the explicit version and the foldl' version, and found the explicit loop faster.
There are bang-patterns on the arguments to go. This is good practice for iterative loops like this one and helps Repa to optimize the loop.

Let’s go ahead and compile the program and try it out on a 500-vertex graph:

> ghc fwdense.hs -O2 -fllvm
[1 of 1] Compiling Main             ( fwdense.hs, fwdense.o )
Linking fwdense ...
> ./fwdense 500 +RTS -s
31250125000
   1,077,772,040 bytes allocated in the heap
      31,516,280 bytes copied during GC
      10,334,312 bytes maximum residency (171 sample(s))
       2,079,424 bytes maximum slop
              32 MB total memory in use (3 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       472 colls,     0 par    0.01s    0.01s     0.0000s    0.0005s
  Gen  1       171 colls,     0 par    0.03s    0.03s     0.0002s    0.0063s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.46s  (  1.47s elapsed)
  GC      time    0.04s  (  0.04s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.50s  (  1.50s elapsed)

Note that I added a couple of optimization options : -O2 turns up GHC’s optimizer, and -fllvm enables GHC’s LLVM backend, which significantly improves the performance of Repa code; on my machine with this particular example, I see a 40% improvement from -fllvm.^[21]

Parallelizing the Program

Now to make the program run in parallel. To compute an array in parallel, Repa provides a variant of the computeS operation, called computeP:

computeP :: (Monad m, Source r2 e, Target r2 e, Load r1 sh e)
         => Array r1 sh e
         -> m (Array r2 sh e)

Whereas computeS computes an array sequentially, computeP uses the available cores to compute the array in parallel. It knows the size of the array, so it can divide the work equally amongst the cores.

The type is almost the same as computeS, except that computeP takes place in a monad. It works with any monad, and it doesn’t matter which monad is used because the purpose of the monad is only to ensure that computeP operations are performed in sequence and not nested. Hence we need to modify our code so that the go function is in a monad, which entails a few small changes. Here is the code:

shortestPaths :: Graph U -> Graph U
shortestPaths g0 = runIdentity $ go g0 0                      -- 
  where
    Z :. _ :. n = extent g0

    go !g !k | k == n    = return g                           -- 
             | otherwise = do
                 g' <- computeP (fromFunction (Z:.n:.n) sp)   -- 
                 go g' (k+1)
     where
        sp (Z:.i:.j) = min (g ! (Z:.i:.j)) (g ! (Z:.i:.k) + g ! (Z:.k:.j))

	We need to use a monad, so the `Identity` monad will do.
	Remember to `return` the result, as we’re now in a monad.
	Instead of `let` to bind `g'`, we use `do` and monadic bind and replace `computeS` with `computeP`. There are no differences to the `fromFunction` call or the `sp` function.

To run it in parallel, we’ll need to add the -threaded option when compiling. Let’s see how it performs:

> ghc -O2 fwdense1 -threaded -fllvm  -fforce-recomp
[1 of 1] Compiling Main             ( fwdense1.hs, fwdense1.o )
Linking fwdense1 ...
> ./fwdense1 500 +RTS -s
31250125000
...
  Total   time    1.89s  (  1.91s elapsed)

There’s some overhead for using computeP, which here seems to be about 27%. That’s quite high, but we can recoup it by using more cores. With four cores:

> ./fwdense1 500 +RTS -s -N4
31250125000
...
  Total   time    2.15s  (  0.57s elapsed)

That equates to a 2.63 speedup against the sequential version, for almost zero effort. Not bad!

Did we really need to thread a monad through the go function? Strictly speaking, the answer is no, because you can always replace computeS with (runIdentity . computeP), but this can lead to trouble. To illustrate what can go wrong, let’s compute two arrays with computeP, where the second will depend on the first. The first is just a vector of Int:

> let arr = fromFunction (Z:.5) (\(Z:.i) -> i :: Int)
> let parr = runIdentity $ computeP arr :: Array U DIM1 Int

And the second is a copy of the first, using fromFunction again:

> let arr2 = fromFunction (Z:.5) (\ix -> parr ! ix)

Now, when we try to compute the second array using computeP, we get:

> runIdentity $ computeP arr2 :: Array U DIM1 Int
Data.Array.Repa: Performing nested parallel computation sequentially.
  You've probably called the 'compute' or 'copy' function while another
  instance was already running. This can happen if the second version
  was suspended due to lazy evaluation. Use 'deepSeqArray' to ensure
  that each array is fully evaluated before you 'compute' the next one.

A call to computeP cannot refer to another array calculated with computeP, unless the inner computeP has already been evaluated. Here, we didn’t evaluate it; we just bound it with let, using runIdentity to satisfy the Monad requirement.

The monad requirement in computeP is there to help us avoid this problem, because computeP ensures that the result is fully evaluated in the monad. In GHCi, we can use the IO monad:

> let arr = fromFunction (Z:.5) (\(Z:.i) -> i :: Int)
> parr <- computeP arr :: IO (Array U DIM1 Int)
> let arr2 = fromFunction (Z:.5) (\ix -> parr ! ix)
> computeP arr2 :: IO (Array U DIM1 Int)
AUnboxed (Z :. 5) (fromList [0,1,2,3,4])

So this is the rule of thumb: if your program makes multiple calls to computeP, try to ensure that they are performed in the same monad.

Folding and Shape-Polymorphism

Folds are an important class of operations over arrays; they are the operations that perform a collective operation over all the elements of an array to produce a single result, such as summing the array or finding its maximum element. For example, the function sumAllS calculates the sum of all the elements in an array:

sumAllS
  :: (Num a, Shape sh, Source r a, Unbox a, Elt a)
  => Array r sh a
  -> a

For an array of elements of type a that supports addition (the Num constraint), sumAllS produces a single result that is the sum of all the elements:

> let arr = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
> sumAllS arr
55

But sometimes we don’t want to fold over the whole array. There are occasions where we need to fold over just one dimension. For example, in the shortest paths example, suppose we wanted to take the resulting matrix of path lengths and find for each vertex the furthest distance we would have to travel from that vertex to any other vertex in the graph.

Our graph may have some nodes that are not connected, and in that case we represent the distance between them by a special large value called inf (the value of inf doesn’t matter as long as it is larger than all the path lengths in the graph). For the purposes of finding the maximum distance to other nodes, we’ll ignore nodes that are not reachable and hence have path length inf. So the function to compute the maximum of two path lengths is as follows:

maxDistance :: Weight -> Weight -> Weight
maxDistance x y
  | x == inf  = y
  | y == inf  = x
  | otherwise = max x y

Now we want to fold maxDistance over just one dimension of our two-dimensional adjacency matrix. There is a function called foldS that does just that; here is its type:

foldS :: (Shape sh, Source r a, Elt a, Unbox a)
      => (a -> a -> a)                           -- 
      -> a                                       -- 
      -> Array r (sh :. Int) a                   -- 
      -> Array U sh a                            --

	The function to fold.
	The unitary value of type `a`.
	The input array. Note that the shape is `(sh :. Int)`, which means that this is an array of some shape `sh` with one more dimension.
	The output array has shape `sh`; that is, one dimension fewer than the input array. For example, if we pass in an array of shape `Z:.Int:.Int`, `sh` is `Z:.Int`. The fold takes place over the inner dimension of the array, which we normally think of as the rows. Each row is reduced to a single value.

The fwdense.hs program has a small test graph of six vertices:

> extent testGraph
(Z :. 6) :. 6

If we use foldS to fold maxDistance over the matrix of shortest paths, we obtain the maximum distance from each vertex to any other vertex:

> foldS maxDistance inf (shortestPaths testGraph)
AUnboxed (Z :. 6) (fromList [20,19,31,18,15,21])

And if we fold once more, we’ll find the longest distance between any two nodes (for which a path exists) in the graph:

> foldS maxDistance inf (foldS maxDistance inf (shortestPaths testGraph))
AUnboxed Z (fromList [31])

Note that the result this time is an array with zero dimensions, otherwise known as a scalar.

A function named foldP allows us to fold in parallel:

foldP :: (Shape sh, Source r a, Elt a, Unbox a, Monad m)
      => (a -> a -> a)
      -> a
      -> Array r (sh :. Int) a
      -> m (Array U sh a)

For the same reasons as computeP, foldP is performed in an arbitrary monad. The arguments are the same as for foldS.

The function argument used with foldP must be associative. That is, the function f must satisfy f x (f y z) == f (f x y) z. This is because unlike foldS, foldP doesn’t necessarily fold the function over the array elements in strict left-to-right order; it folds different parts of the array in parallel and then combines the results from those parts using the folding function.

Note that strictly speaking, although mathematical addition is associative, floating-point addition is not, due to rounding errors. However, we tend to ignore this detail when using foldP because a small amount of nondeterminism in the floating point result is normally acceptable.

Example: Image Rotation

Repa is a great tool for coding image manipulation algorithms, which tend to be naturally parallel and involve a lot of data. In this section, we’ll write a program to rotate an image about its center by a specified number of degrees.

For reading and writing image data, Repa provides an interface to the DevIL library, which is a cross-platform C library for image manipulation. DevIL supports reading and writing various common image formats, including PNG and JPG. The library is wrapped by the Haskell package repa-devil, which provides a convenient Haskell API to DevIL. The two operations we’ll be using are readImage and writeImage:

readImage  :: FilePath -> IL Image
writeImage :: FilePath -> Image -> IL ()

Where the Image type defines various in-memory image representations:

data Image
  = RGBA (Array F DIM3 Word8)
  | RGB  (Array F DIM3 Word8)
  | BGRA (Array F DIM3 Word8)
  | BGR  (Array F DIM3 Word8)
  | Grey (Array F DIM2 Word8)

A color image is represented as a three-dimensional array. The first two dimensions are the Y and X axes, and the last dimension contains the three color channels and optionally an alpha channel. The first four constructors of Image correspond to different orderings of the color channels and the presence or not of an alpha channel. The last option, Grey, is a grayscale image with one byte per pixel.

Which one of these is returned by readImage depends on the type of image file being read. For example, a color JPEG image returns data in RGB format, but a PNG image returns in RGBA format.

You may have noticed one unfamiliar aspect to these array types: the F representation type. This indicates that the array data is held in foreign memory; that is, it was allocated by C code. Apart from being allocated by C rather than Haskell, the F representation is identical to U.

Note that readImage and writeImage are in the IL monad. The purpose of the IL monad is to ensure that the DevIL library is initialized properly. This is done by runIL:

runIL :: IL a -> IO a

It’s perfectly fine to have multiple calls to runIL; the library will be initialized only once.

Our program will take three arguments: the number of degrees to rotate the image by, the input filename, and the output filename, respectively:

main :: IO ()
main = do
    [n, f1,f2] <- getArgs
    runIL $ do
      (RGB v) <- readImage f1                                            -- 
      rotated <- computeP $ rotate (read n) v :: IL (Array F DIM3 Word8) -- 
      writeImage f2 (RGB rotated)                                        --

	Read the image data from the file `f1` (the second command-line argument).
	The function `rotate`, which we will define shortly, returns a delayed array representing the rotated image. We call `computeP` here to calculate the new array in parallel. In the earlier examples, we used `computeP` to produce arrays with `U` representation, but here we’re producing an array with `F` representation. This is possible because `computeP` is overloaded on the desired output representation; this is the purpose of the `Target` type class.
	Finally, write the new image to the file `f2`.

Next we’ll write the function rotate, which actually calculates the rotated image data. First, we have a decision to make: what should the size of the rotated image be? We have the option of producing a smaller image than the input, and discarding any pixels that fall outside the boundaries after rotation, or to adjust the image size to contain the rotated image, and fill in the empty areas with something else (e.g., black). I’ll opt, somewhat arbitrarily, to keep the output image size the same as the input and fill in the empty areas with black. Please feel free to modify the program to do something more sensible.

rotate :: Double -> Array F DIM3 Word8 -> Array D DIM3 Word8
rotate deg g = fromFunction (Z :. y :. x :. k) f      -- 
    where
        sh@(Z :. y :. x :. k)   = extent g

        !theta = pi/180 * deg                         -- 

        !st = sin theta                               -- 
        !ct = cos theta

        !cy = fromIntegral y / 2 :: Double            -- 
        !cx = fromIntegral x / 2 :: Double

        f (Z :. i :. j :. k)                          -- 
          | inShape sh old = g ! old                  -- 
          | otherwise      = 0                        -- 
          where
            fi = fromIntegral i - cy                  -- 
            fj = fromIntegral j - cx

            i' = round (st * fj + ct * fi + cy)       -- 
            j' = round (ct * fj - st * fi + cx)

            old = Z :. i' :. j' :. k                  --

The formula to rotate a point (x,y) by an angle θ about the origin is given by:

x′ = y sin θ + x cos θ

y′ = y cos θ + x sin θ

However, we want to rotate our image about the center, but the origin is the upper-left corner. Hence we need to adjust the points to be relative to the center of the image before translation and adjust them back afterward.

	We’re creating a delayed array, represented by the function `f`. The dimensions of the array are the same as the input array, which we get by calling `extent` just below.
	Convert the angle by which to rotate the image from degrees to radians.
	Because we’ll need the values of `sin theta` and `cos theta` twice each, we defined them once here.
	`cy` and `cx` are the y- and x-coordinates, respectively, of the center of the image.
	The function `f`, which gives the value of the new image at position `i`, `j`, `k` (where `k` here is between 0 and 2, corresponding to the RGB color channels).
	First, we need to check whether the `old` pixel (the pixel we are rotating into this position) is within the bounds of the original image. The function `inShape` does this check for us: `inShape` `::` `Shape` `sh` `=>` `sh` `->` `sh` `->` `Bool` If the `old` pixel is within the image, then we return the value at that position in the old image.
	If the rotated position in the old image is out of bounds, then we return zero, giving a black pixel at this position in the new image.
	`fi` and `fj` are the y and x values of this point relative to the center of the image, respectively.
	`i'` and `j'` are the coordinates of the pixel in the old image that will be rotated to this position in the new image, given by the previous formulae for `st` and `ct`.
	Finally, `old` is the index of the pixel in the old image.

To see the program working, we first need an image to rotate: Figure 5-1.

Figure 5-1. Image in need of rotation

Running the program like so results in the straightened image shown in Figure 5-2:

$ ./rotateimage 4 wonky.jpg straight.jpg

Figure 5-2. Straightened image

Let’s check the performance of the program:

$ rm straight.jpg
$ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s
...
  Total   time    0.69s  (  0.69s elapsed)

And see how much we can gain by running it in parallel, on four cores:

$ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s -N4
...
  Total   time    0.76s  (  0.24s elapsed)

The result is a speedup of 2.88. However, this program spends 0.05s of its time reading and writing the image file (measured by modifying the program to omit the rotation step), and if we factor this into the results, we obtain a speedup for the parallel portion of the program of 3.39.

Summary

Repa provides a convenient framework for describing array operations and has some significant benefits:

Intermediate arrays are automatically eliminated when array operations are composed (fusion).
Operations like computeP and foldP automatically parallelize using the available cores.

There are a couple of gotchas to bear in mind:

Repa relies heavily on GHC’s optimizer and is quite sensitive to things like strictness annotations and INLINE pragmas. A good rule of thumb is to use both of these liberally. You might also need to use simpler code and fewer library functions so that GHC can see all of your code and optimize it.
Don’t forget to add the -fllvm option if your computer supports it.

There’s much more to Repa that we haven’t covered. For example, Repa has support for stencil convolutions: a common class of image-processing algorithms in which a transformation on each pixel is calculated as some function of the surrounding pixels. For certain kinds of stencil functions that are known at compile time, Repa can generate specialized code that runs extremely fast.

To learn more, take a look at the full Repa documentation on Hackage.

^[19]Note that we’re using Repa version 3.2 here; 3.0 had a somewhat different API.

^[20]There are other array representations that aren’t covered in this chapter; for more details, see the Repa documentation

^[21]You might not have LLVM installed on your computer, in which case the -fllvm option will not work. Don’t worry: Repa works perfectly well without it. The code will just be slower.

Prev	Up	Next
Chapter 4. Dataflow Parallelism: The Par Monad	Home	Chapter 6. GPU Programming with Accelerate