Providence Salumu
The techniques we’ve seen in the previous chapters are great for
parallelizing code that uses ordinary Haskell data structures like
lists and Map
s, but they don’t work as well for data-parallel
algorithms over large arrays. That’s because large-scale array
computations demand very good raw sequential performance, which we can
get only by operating on arrays of unboxed data. We can’t use
Strategies to parallelize operations over unboxed arrays, because they
need lazy data structures (boxed arrays would be suitable, but not
unboxed arrays). Similarly, Par
doesn’t work well here either,
because in Par
the data is passed in IVar
s.
In this chapter, we’re going to see how to write efficient numerical array computations in Haskell and run them in parallel. The library we’re going to use is called Repa, which stands for REgular PArallel arrays.[19] The library provides a range of efficient operations for creating arrays and operating on arrays in parallel.
The Repa package is available on Hackage. If you followed the
instructions for installing the sample code dependencies earlier, then
you should already have it, but if not you can install it with cabal install
:
$ cabal install repa
In this chapter, I’m going to use GHCi a lot to illustrate the behavior
of Repa; trying things out in GHCi is a great way to become familiar
with the types and operations that Repa provides. Because Repa
provides many operations with the same names as Prelude
functions
(e.g., map
), we usually import Data.Array.Repa
with a short module
alias:
> import Data.Array.Repa as Repa
This way, we can refer to Repa’s map
function as Repa.map
.
Everything in Repa revolves around arrays. A computation in Repa typically consists of computing an array in parallel, perhaps using other arrays as inputs. So we’ll start by looking at the type of arrays, how to build them, and how to work with them.
The Array
type has three parameters:
data
Array
r
sh
e
The e
parameter is the type of the elements, for example Double
, Int
, or
Word8
. The r
parameter is the representation type, which
describes how the array is represented in memory; I’ll come back to
this shortly. The sh
parameter describes the shape of the array;
that is, the number of dimensions it has.
Shapes are built out of two type constructors, Z
and :.
:
data
Z
=
Z
data
tail
:.
head
=
tail
:.
head
The simplest shape, Z
, is the shape of an array with no dimensions (i.e., a
scalar), which has a single element. If we add a dimension, Z
:. Int
, we get the shape of an array with a single dimension indexed
by Int
, otherwise known as a vector. Adding another dimension gives
Z :. Int :. Int
, the shape of a two-dimensional array, or matrix.
New dimensions are added on the right, and the :.
operator
associates left, so when we write Z :. Int :. Int
, we really mean
(Z :. Int) :. Int
.
The Z
and :.
symbols are both type constructors and value
constructors, which can be a little confusing at times. For example, the
data value Z :. 3
has type Z :. Int
. The data value form is used
in Repa to mean either “shapes” or "indices." For example, Z :. 3
can be either the shape of three-element vectors, or the index of
the fourth element of a vector (indices count from zero).
Repa supports only Int
-typed indices. A few handy type synonyms are provided for the common shape types:
type
DIM0
=
Z
type
DIM1
=
DIM0
:.
Int
type
DIM2
=
DIM1
:.
Int
Let’s try a few examples. A simple way to build an array is to use
fromListUnboxed
:
fromListUnboxed
::
(
Shape
sh
,
Unbox
a
)
=>
sh
->
[
a
]
->
Array
U
sh
a
The fromListUnboxed
function takes a shape of type sh
and a list
of elements of type a
, and builds an array of type Array U sh a
.
The U
is the representation and stands for Unboxed: this array will
contain unboxed elements. Don’t worry about the Shape
and Unbox
type classes. They are just there to ensure that we use only the
appropriate shape constructors (Z
and :.
) and supported element types, respectively.
Let’s build a 10-element vector of Int
and fill it with the numbers
1…10. We need to pass a shape argument, which will be Z:.10
for a 10-element vector:
> fromListUnboxed (Z :. 10) [1..10] <interactive>:15:1: No instance for (Shape (Z :. head0)) arising from a use of `fromListUnboxed' The type variable `head0' is ambiguous Possible fix: add a type signature that fixes these type variable(s) Note: there is a potential instance available: instance Shape sh => Shape (sh :. Int) -- Defined in `Data.Array.Repa.Index' Possible fix: add an instance declaration for (Shape (Z :. head0)) In the expression: fromListUnboxed (Z :. 10) [1 .. 10] In an equation for `it': it = fromListUnboxed (Z :. 10) [1 .. 10]
Oops! This illustrates something that you will probably encounter a
lot when working with Repa: a type error caused by insufficient type
information. In this case, the integer 10
in Z :. 10
is
overloaded, so we have to say explicitly that we mean Int
. There
are many ways to give GHC the extra bit of information it needs; one
way is to add a type signature to the whole expression, which has type
Array U DIM1 Int
:
> fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])
Similarly, we can make a two-dimensional array, with 3 rows of 5 columns, and fill it with the elements 1 to 15:
> fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int AUnboxed ((Z :. 3) :. 5) (fromList [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Conceptually, the array we created is this:
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
But internally, the array is stored as a single vector (after all,
computer memory is one-dimensional). We can see the vector in the
result of the call to fromListUnboxed
; it contains the same
elements that we initialized the array with.
The shape of the array is there to tell Repa how to interpret the
operations on it. For example, if we ask for the element at index
Z:.2:.1
in an array with shape Z:.3:.5
, we’ll get the element at
position 2 * 5 + 1 in the vector. We can try it using the !
operator, which extracts an element from an array. The type of !
is:
(
!
)
::
(
Shape
sh
,
Source
r
e
)
=>
Array
r
sh
e
->
sh
->
e
Let’s get the element at position Z:.2:.1
from our example matrix:
> let arr = fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int > arr ! (Z:.2:.1) 12
The element 12 is therefore 2 rows down and 1 column across. As I mentioned earlier, indices count from zero in Repa.
Internally, Repa is using the function toIndex
to convert an index
to an Int
offset, given a shape:
toIndex
::
Shape
sh
=>
sh
->
sh
->
Int
For example:
> toIndex (Z:.3:.5 :: DIM2) (Z:.2:.1 :: DIM2) 11
Because the layout of an array in memory is the same regardless of its shape, we can even change the shape without copying the array:
> reshape (Z:.5:.3) arr ! (Z:.2:.1 :: DIM2) 8
With the shape Z:.5:.3
, the index Z:.2:.1
corresponds to the
element at 2 * 3 + 1 = 7, which has value 8.
Here are a couple of other operations on shapes that often come in handy:
rank
::
Shape
sh
=>
sh
->
Int
-- number of dimensions
size
::
Shape
sh
=>
sh
->
Int
-- number of elements
To retrieve the shape of an array, we can use extent
:
extent
::
(
Shape
sh
,
Source
r
e
)
=>
Array
r
sh
e
->
sh
> extent arr (Z :. 3) :. 5 > rank (extent arr) 2 > size (extent arr) 15
We can map a function over an array using Repa’s map
function:
Repa.map
::
(
Shape
sh
,
Source
r
a
)
=>
(
a
->
b
)
->
Array
r
sh
a
->
Array
D
sh
b
We can see from the type that map
returns an array with the
representation D
. The D
representation stands for Delayed; this
means that the array has not been computed yet. A delayed array is
represented by a function from indices to elements.
We can apply map
to an array, but there’s no way to print out the
result:
> let a = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int > Repa.map (+1) a <interactive>:26:1: No instance for (Show (Array D DIM1 Int)) arising from a use of `print' Possible fix: add an instance declaration for (Show (Array D DIM1 Int)) In a stmt of an interactive GHCi command: print it
As its name suggests, a delayed array is not an array yet. To turn
it into an array, we have to call a function that allocates the array
and computes the value of each element. The computeS
function does
this for us:
computeS
::
(
Load
r1
sh
e
,
Target
r2
e
)
=>
Array
r1
sh
e
->
Array
r2
sh
e
The argument to computeS
is an array with a representation that is a
member of the Load
class, whereas its result is an array with a
representation that is a member of the Target
class. The most
important instances of these two classes are D
and U
respectively;
that is, computeS
turns a delayed array into a concrete unboxed
array.[20].
Applying computeS
to the result of map
gives us an unboxed array:
> computeS (Repa.map (+1) a) :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [2,3,4,5,6,7,8,9,10,11])
You might be wondering why there is this extra complication—why
doesn’t map
just produce a new array? The answer is that by
representing the result of an array operation as a delayed array, a
sequence of array operations can be performed without ever building
the intermediate arrays; this is an optimization called fusion, and
it’s critical to achieving good performance with Repa. For example,
if we composed two map
s together:
> computeS (Repa.map (+1) (Repa.map (^2) a)) :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [2,5,10,17,26,37,50,65,82,101])
The intermediate array between the two map
s is not built, and in
fact if we compile this rather than running it in GHCi, provided the
optimization option -O
is enabled, it will compile to a single
efficient loop over the input array.
Let’s see how it works. The fundamental way to get a delayed array is
fromFunction
:
fromFunction
::
sh
->
(
sh
->
a
)
->
Array
D
sh
a
The fromFunction
operation creates a delayed array. It takes the
shape of the array and a function that specifies the elements. For
example, we can make a delayed array that represents the vector of
integers 0 to 9 like this:
> let a = fromFunction (Z :. 10) (\(Z:.i) -> i :: Int) > :t a a :: Array D (Z :. Int) Int
Delayed arrays support indexing, just like manifest arrays:
> a ! (Z:.5) 5
Indexing a delayed array works by just calling the function that we
supplied to fromFunction
with the given index.
We need to apply computeS
to make the delayed array into a manifest array:
> computeS a :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [0,1,2,3,4,5,6,7,8,9])
The computeS
function creates the array and for each of the indices
of the array, it calls the function stored in the delayed array to find
the element at that position.
The map
function, along with many other operations on arrays, can be
specified in terms of fromFunction
. For example, here is a
definition of map
:
> let mymap f a = fromFunction (extent a) (\ix -> f (a ! ix)) > :t mymap mymap :: (Shape sh, Source r e) => (e -> a) -> Array r sh e -> Array D sh a
It works just like the real map
:
> computeS (mymap (+1) a) :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])
What happens if we compose two map
s together? The result would
be a delayed array containing a function that indexes into another
delayed array. So we’re building up a nested function that defines
the array elements, rather than intermediate arrays. Furthermore,
Repa is carefully engineered so that at compile time the nested
function call is optimized away as far as possible, yielding very
efficient code.
In “Example: Shortest Paths in a Graph”, we looked at an implementation of the Floyd-Warshall algorithm for computing the lengths of shortest paths in a sparse weighted directed graph. Here, we’ll investigate how to code up the algorithm over dense graphs, using Repa.
For reference, here is the pseudocode definition of the algorithm:
shortestPath :: Graph -> Vertex -> Vertex -> Vertex -> Weight shortestPath g i j 0 = weight g i j shortestPath g i j k = min (shortestPath g i j (k-1)) (shortestPath g i k (k-1) + shortestPath g k j (k-1))
We implement this by first computing all the shortest paths for k ==
0
, then k == 1
, and so on up to the maximum vertex in the graph.
For the dense version, we’re going to use an adjacency matrix; that is, a two-dimensional array indexed by pairs of vertices, where each element is the length of the path between the two vertices. Here is our representation of graphs:
fwdense.hs
type
Weight
=
Int
type
Graph
r
=
Array
r
DIM2
Weight
The implementation of the shortest paths algorithm is as follows:
shortestPaths
::
Graph
U
->
Graph
U
shortestPaths
g0
=
go
g0
0
--
where
Z
:.
_
:.
n
=
extent
g0
--
go
!
g
!
k
|
k
==
n
=
g
--
|
otherwise
=
let
g'
=
computeS
(
fromFunction
(
Z:.
n
:.
n
)
sp
)
--
in
go
g'
(
k
+
1
)
--
where
sp
(
Z:.
i
:.
j
)
=
min
(
g
!
(
Z:.
i
:.
j
))
(
g
!
(
Z:.
i
:.
k
)
+
g
!
(
Z:.
k
:.
j
))
--
The code is quite readable and somewhat shorter than the sparse version of the algorithm we saw before. However, there are a couple of subtleties that might not be obvious, but are nevertheless important for making the code run quickly:
go
, rather
than something like foldl'
, even though the latter would lead to
shorter code. The optimizations in Repa work much better when all
the code is visible to the compiler, and calling out to library
functions can sometimes hide details from GHC and prevent optimizations.
There are no hard and fast rules here; I experimented with both the
explicit version and the foldl'
version, and found the explicit
loop faster.
go
. This is good
practice for iterative loops like this one and helps Repa to
optimize the loop.
Let’s go ahead and compile the program and try it out on a 500-vertex graph:
> ghc fwdense.hs -O2 -fllvm [1 of 1] Compiling Main ( fwdense.hs, fwdense.o ) Linking fwdense ... > ./fwdense 500 +RTS -s 31250125000 1,077,772,040 bytes allocated in the heap 31,516,280 bytes copied during GC 10,334,312 bytes maximum residency (171 sample(s)) 2,079,424 bytes maximum slop 32 MB total memory in use (3 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 472 colls, 0 par 0.01s 0.01s 0.0000s 0.0005s Gen 1 171 colls, 0 par 0.03s 0.03s 0.0002s 0.0063s INIT time 0.00s ( 0.00s elapsed) MUT time 1.46s ( 1.47s elapsed) GC time 0.04s ( 0.04s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.50s ( 1.50s elapsed)
Note that I added a couple of optimization options: -O2
turns up
GHC’s optimizer, and -fllvm
enables GHC’s LLVM backend, which
significantly improves the performance of Repa code; on my machine
with this particular example, I see a 40% improvement from
-fllvm
.[21]
Now to make the program run in parallel. To compute an array in parallel, Repa provides a variant of the computeS
operation, called computeP
:
computeP
::
(
Monad
m
,
Source
r2
e
,
Target
r2
e
,
Load
r1
sh
e
)
=>
Array
r1
sh
e
->
m
(
Array
r2
sh
e
)
Whereas computeS
computes an array sequentially, computeP
uses the available cores to compute the array in parallel. It knows
the size of the array, so it can divide the work equally amongst the
cores.
The type is almost the same as computeS
, except that computeP
takes place in a monad. It works with any
monad, and it doesn’t matter which monad is used because the purpose of the monad is only to ensure that computeP
operations are performed in sequence and not nested. Hence we need
to modify our code so that the go
function is in a monad, which
entails a few small changes. Here is the code:
shortestPaths
::
Graph
U
->
Graph
U
shortestPaths
g0
=
runIdentity
$
go
g0
0
--
where
Z
:.
_
:.
n
=
extent
g0
go
!
g
!
k
|
k
==
n
=
return
g
--
|
otherwise
=
do
g'
<-
computeP
(
fromFunction
(
Z:.
n
:.
n
)
sp
)
--
go
g'
(
k
+
1
)
where
sp
(
Z:.
i
:.
j
)
=
min
(
g
!
(
Z:.
i
:.
j
))
(
g
!
(
Z:.
i
:.
k
)
+
g
!
(
Z:.
k
:.
j
))
We need to use a monad, so the |
|
Remember to |
|
Instead of |
To run it in parallel, we’ll need to add the -threaded
option when
compiling. Let’s see how it performs:
> ghc -O2 fwdense1 -threaded -fllvm -fforce-recomp [1 of 1] Compiling Main ( fwdense1.hs, fwdense1.o ) Linking fwdense1 ... > ./fwdense1 500 +RTS -s 31250125000 ... Total time 1.89s ( 1.91s elapsed)
There’s some overhead for using computeP
, which here seems to be
about 27%. That’s quite high, but we can recoup it by using more
cores. With four cores:
> ./fwdense1 500 +RTS -s -N4 31250125000 ... Total time 2.15s ( 0.57s elapsed)
That equates to a 2.63 speedup against the sequential version, for almost zero effort. Not bad!
Folds are an important class of operations over arrays; they are the
operations that perform a collective operation over all the
elements of an array to produce a single result, such as summing the
array or finding its maximum element. For example, the function
sumAllS
calculates the sum of all the elements in an array:
sumAllS
::
(
Num
a
,
Shape
sh
,
Source
r
a
,
Unbox
a
,
Elt
a
)
=>
Array
r
sh
a
->
a
For an array of elements of type a
that supports addition (the Num
constraint), sumAllS
produces a single result that is the sum of all
the elements:
> let arr = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int > sumAllS arr 55
But sometimes we don’t want to fold over the whole array. There are occasions where we need to fold over just one dimension. For example, in the shortest paths example, suppose we wanted to take the resulting matrix of path lengths and find for each vertex the furthest distance we would have to travel from that vertex to any other vertex in the graph.
Our graph may have some nodes that are not connected, and in that case
we represent the distance between them by a special large value called
inf
(the value of inf
doesn’t matter as long as it is larger than
all the path lengths in the graph). For the purposes of finding the
maximum distance to other nodes, we’ll ignore nodes that are not
reachable and hence have path length inf
. So the function to
compute the maximum of two path lengths is as follows:
maxDistance
::
Weight
->
Weight
->
Weight
maxDistance
x
y
|
x
==
inf
=
y
|
y
==
inf
=
x
|
otherwise
=
max
x
y
Now we want to fold maxDistance
over just one dimension of our
two-dimensional adjacency matrix. There is a function called foldS
that does just that; here is its type:
foldS
::
(
Shape
sh
,
Source
r
a
,
Elt
a
,
Unbox
a
)
=>
(
a
->
a
->
a
)
--
->
a
--
->
Array
r
(
sh
:.
Int
)
a
--
->
Array
U
sh
a
--
The function to fold. |
|
The unitary value of type |
|
The input array. Note that the shape is |
|
The output array has shape |
The fwdense.hs program has a small test graph of six vertices:
> extent testGraph (Z :. 6) :. 6
If we use foldS
to fold maxDistance
over the matrix of shortest
paths, we obtain the maximum distance from each vertex to any other
vertex:
> foldS maxDistance inf (shortestPaths testGraph) AUnboxed (Z :. 6) (fromList [20,19,31,18,15,21])
And if we fold once more, we’ll find the longest distance between any two nodes (for which a path exists) in the graph:
> foldS maxDistance inf (foldS maxDistance inf (shortestPaths testGraph)) AUnboxed Z (fromList [31])
Note that the result this time is an array with zero dimensions, otherwise known as a scalar.
A function named foldP
allows us to fold in parallel:
foldP
::
(
Shape
sh
,
Source
r
a
,
Elt
a
,
Unbox
a
,
Monad
m
)
=>
(
a
->
a
->
a
)
->
a
->
Array
r
(
sh
:.
Int
)
a
->
m
(
Array
U
sh
a
)
For the same reasons as computeP
, foldP
is performed in an
arbitrary monad. The arguments are the same as for foldS
.
The function argument used with foldP
must be associative. That
is, the function f
must satisfy f x (f y z) == f (f x y) z
. This
is because unlike foldS
, foldP
doesn’t necessarily fold the
function over the array elements in strict left-to-right order; it
folds different parts of the array in parallel and then combines the
results from those parts using the folding function.
Note that strictly speaking,
although mathematical addition is associative, floating-point addition
is not, due to rounding errors. However, we tend to ignore this
detail when using foldP
because a small amount of nondeterminism in
the floating point result is normally acceptable.
Repa is a great tool for coding image manipulation algorithms, which tend to be naturally parallel and involve a lot of data. In this section, we’ll write a program to rotate an image about its center by a specified number of degrees.
For reading and writing image data, Repa provides an interface to
the DevIL library, which is a cross-platform C library for image
manipulation. DevIL supports reading and writing various common
image formats, including PNG and JPG. The library is wrapped by the
Haskell package repa-devil
, which provides a convenient Haskell API
to DevIL. The two operations we’ll be using are readImage
and
writeImage
:
readImage
::
FilePath
->
IL
Image
writeImage
::
FilePath
->
Image
->
IL
()
Where the Image
type defines various in-memory image
representations:
data
Image
=
RGBA
(
Array
F
DIM3
Word8
)
|
RGB
(
Array
F
DIM3
Word8
)
|
BGRA
(
Array
F
DIM3
Word8
)
|
BGR
(
Array
F
DIM3
Word8
)
|
Grey
(
Array
F
DIM2
Word8
)
A color image is represented as a three-dimensional array. The first
two dimensions are the Y and X axes, and the last dimension contains the
three color channels and optionally an alpha channel. The first four
constructors of Image
correspond to different orderings of the color
channels and the presence or not of an alpha channel. The last
option, Grey
, is a grayscale image with one byte per pixel.
Which one of these is returned by readImage
depends on the type of
image file being read. For example, a color JPEG image returns
data in RGB
format, but a PNG image returns in RGBA
format.
You may have noticed one unfamiliar aspect to these array types: the
F
representation type. This indicates that the array data is held
in foreign memory; that is, it was allocated by C code. Apart from
being allocated by C rather than Haskell, the F
representation is identical to U
.
Note that readImage
and writeImage
are in the IL
monad. The
purpose of the IL
monad is to ensure that the DevIL library
is initialized properly. This is done by runIL
:
runIL
::
IL
a
->
IO
a
It’s perfectly fine to have multiple calls to runIL
; the library
will be initialized only once.
Our program will take three arguments: the number of degrees to rotate the image by, the input filename, and the output filename, respectively:
main
::
IO
()
main
=
do
[
n
,
f1
,
f2
]
<-
getArgs
runIL
$
do
(
RGB
v
)
<-
readImage
f1
--
rotated
<-
computeP
$
rotate
(
read
n
)
v
::
IL
(
Array
F
DIM3
Word8
)
--
writeImage
f2
(
RGB
rotated
)
--
Next we’ll write the function rotate
, which actually calculates the
rotated image data. First, we have a decision to make: what should the
size of the rotated image be? We have the option of producing a
smaller image than the input, and discarding any pixels that fall
outside the boundaries after rotation, or to adjust the image size to
contain the rotated image, and fill in the empty areas with something
else (e.g., black). I’ll opt, somewhat arbitrarily, to keep the output
image size the same as the input and fill in the empty areas with
black. Please feel free to modify the program to do something more
sensible.
rotate
::
Double
->
Array
F
DIM3
Word8
->
Array
D
DIM3
Word8
rotate
deg
g
=
fromFunction
(
Z
:.
y
:.
x
:.
k
)
f
--
where
sh
@
(
Z
:.
y
:.
x
:.
k
)
=
extent
g
!
theta
=
pi
/
180
*
deg
--
!
st
=
sin
theta
--
!
ct
=
cos
theta
!
cy
=
fromIntegral
y
/
2
::
Double
--
!
cx
=
fromIntegral
x
/
2
::
Double
f
(
Z
:.
i
:.
j
:.
k
)
--
|
inShape
sh
old
=
g
!
old
--
|
otherwise
=
0
--
where
fi
=
fromIntegral
i
-
cy
--
fj
=
fromIntegral
j
-
cx
i'
=
round
(
st
*
fj
+
ct
*
fi
+
cy
)
--
j'
=
round
(
ct
*
fj
-
st
*
fi
+
cx
)
old
=
Z
:.
i'
:.
j'
:.
k
--
The formula to rotate a point (x,y) by an angle θ about the origin is given by:
However, we want to rotate our image about the center, but the origin is the upper-left corner. Hence we need to adjust the points to be relative to the center of the image before translation and adjust them back afterward.
We’re creating a delayed array, represented by the function |
|
Convert the angle by which to rotate the image from degrees to radians. |
|
Because we’ll need the values of |
|
|
|
The function |
|
First, we need to check whether the
If the |
|
If the rotated position in the old image is out of bounds, then we return zero, giving a black pixel at this position in the new image. |
|
|
|
|
|
Finally, |
To see the program working, we first need an image to rotate: Figure 5-1.
Running the program like so results in the straightened image shown in Figure 5-2:
$ ./rotateimage 4 wonky.jpg straight.jpg
Let’s check the performance of the program:
$ rm straight.jpg $ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s ... Total time 0.69s ( 0.69s elapsed)
And see how much we can gain by running it in parallel, on four cores:
$ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s -N4 ... Total time 0.76s ( 0.24s elapsed)
The result is a speedup of 2.88. However, this program spends 0.05s of its time reading and writing the image file (measured by modifying the program to omit the rotation step), and if we factor this into the results, we obtain a speedup for the parallel portion of the program of 3.39.
Repa provides a convenient framework for describing array operations and has some significant benefits:
computeP
and foldP
automatically parallelize
using the available cores.
There are a couple of gotchas to bear in mind:
INLINE
pragmas. A good
rule of thumb is to use both of these liberally. You might also
need to use simpler code and fewer library functions so that GHC
can see all of your code and optimize it.
-fllvm
option if your computer supports it.
There’s much more to Repa that we haven’t covered. For example, Repa has support for stencil convolutions: a common class of image-processing algorithms in which a transformation on each pixel is calculated as some function of the surrounding pixels. For certain kinds of stencil functions that are known at compile time, Repa can generate specialized code that runs extremely fast.
To learn more, take a look at the full Repa documentation on Hackage.
[19] Note that we’re using Repa version 3.2 here; 3.0 had a somewhat different API.
[20] There are other array representations that aren’t covered in this chapter; for more details, see the Repa documentation
[21] You might not have LLVM installed on your computer, in which case the -fllvm
option will not work. Don’t worry: Repa works perfectly well without it. The code will just be slower.