Providence Salumu
The techniques we’ve seen in the previous chapters are great for
parallelizing code that uses ordinary Haskell data structures like
lists and Maps, but they don’t work as well for data-parallel
algorithms over large arrays.  That’s because large-scale array
computations demand very good raw sequential performance, which we can
get only by operating on arrays of unboxed data.  We can’t use
Strategies to parallelize operations over unboxed arrays, because they
need lazy data structures (boxed arrays would be suitable, but not
unboxed arrays).  Similarly, Par doesn’t work well here either,
because in Par the data is passed in IVars.
In this chapter, we’re going to see how to write efficient numerical array computations in Haskell and run them in parallel. The library we’re going to use is called Repa, which stands for REgular PArallel arrays.[19] The library provides a range of efficient operations for creating arrays and operating on arrays in parallel.
The Repa package is available on Hackage. If you followed the
instructions for installing the sample code dependencies earlier, then
you should already have it, but if not you can install it with cabal install:
$ cabal install repa
In this chapter, I’m going to use GHCi a lot to illustrate the behavior
of Repa; trying things out in GHCi is a great way to become familiar
with the types and operations that Repa provides.  Because Repa
provides many operations with the same names as Prelude functions
(e.g., map), we usually import Data.Array.Repa with a short module
alias:
> import Data.Array.Repa as Repa
This way, we can refer to Repa’s map function as Repa.map.
Everything in Repa revolves around arrays. A computation in Repa typically consists of computing an array in parallel, perhaps using other arrays as inputs. So we’ll start by looking at the type of arrays, how to build them, and how to work with them.
The Array type has three parameters:
dataArrayrshe
The e parameter is the type of the elements, for example Double, Int, or
Word8.  The r parameter is the representation type, which
describes how the array is represented in memory; I’ll come back to
this shortly.  The sh parameter describes the shape of the array;
that is, the number of dimensions it has.
Shapes are built out of two type constructors, Z and :.:
dataZ=Zdatatail:.head=tail:.head
The simplest shape, Z, is the shape of an array with no dimensions (i.e., a
scalar), which has a single element.  If we add a dimension, Z
:. Int, we get the shape of an array with a single dimension indexed
by Int, otherwise known as a vector.  Adding another dimension gives
Z :. Int :. Int, the shape of a two-dimensional array, or matrix.
New dimensions are added on the right, and the :. operator
associates left, so when we write Z :. Int :. Int, we really mean
(Z :. Int) :. Int.
The Z and :. symbols are both type constructors and value
constructors, which can be a little confusing at times.  For example, the
data value Z :. 3 has type Z :. Int. The data value form is used
in Repa to mean either “shapes” or "indices." For example, Z :. 3 can be either the shape of three-element vectors, or the index of
the fourth element of a vector (indices count from zero).
Repa supports only Int-typed indices. A few handy type synonyms are provided for the common shape types:
typeDIM0=ZtypeDIM1=DIM0:.InttypeDIM2=DIM1:.Int
Let’s try a few examples.  A simple way to build an array is to use
fromListUnboxed:
fromListUnboxed::(Shapesh,Unboxa)=>sh->[a]->ArrayUsha
The fromListUnboxed function takes a shape of type sh and a list
of elements of type a, and builds an array of type Array U sh a.
The U is the representation and stands for Unboxed: this array will
contain unboxed elements.  Don’t worry about the Shape and Unbox
type classes. They are just there to ensure that we use only the
appropriate shape constructors (Z and :.) and supported element types, respectively.
Let’s build a 10-element vector of Int and fill it with the numbers
1…10.  We need to pass a shape argument, which will be Z:.10
for a 10-element vector:
> fromListUnboxed (Z :. 10) [1..10]
<interactive>:15:1:
    No instance for (Shape (Z :. head0))
      arising from a use of `fromListUnboxed'
    The type variable `head0' is ambiguous
    Possible fix: add a type signature that fixes these type variable(s)
    Note: there is a potential instance available:
      instance Shape sh => Shape (sh :. Int)
        -- Defined in `Data.Array.Repa.Index'
    Possible fix: add an instance declaration for (Shape (Z :. head0))
    In the expression: fromListUnboxed (Z :. 10) [1 .. 10]
    In an equation for `it': it = fromListUnboxed (Z :. 10) [1 .. 10]
Oops!  This illustrates something that you will probably encounter a
lot when working with Repa: a type error caused by insufficient type
information.  In this case, the integer 10 in Z :. 10 is
overloaded, so we have to say explicitly that we mean Int.  There
are many ways to give GHC the extra bit of information it needs; one
way is to add a type signature to the whole expression, which has type
Array U DIM1 Int:
> fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])
Similarly, we can make a two-dimensional array, with 3 rows of 5 columns, and fill it with the elements 1 to 15:
> fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int AUnboxed ((Z :. 3) :. 5) (fromList [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
Conceptually, the array we created is this:
1  | 
2  | 
3  | 
4  | 
5  | 
6  | 
7  | 
8  | 
9  | 
10  | 
11  | 
12  | 
13  | 
14  | 
15  | 
But internally, the array is stored as a single vector (after all,
computer memory is one-dimensional).  We can see the vector in the
result of the call to fromListUnboxed; it contains the same
elements that we initialized the array with.
The shape of the array is there to tell Repa how to interpret the
operations on it.  For example, if we ask for the element at index
Z:.2:.1 in an array with shape Z:.3:.5, we’ll get the element at
position 2 * 5 + 1 in the vector.  We can try it using the !
operator, which extracts an element from an array.  The type of !
is:
(!)::(Shapesh,Sourcere)=>Arrayrshe->sh->e
Let’s get the element at position Z:.2:.1 from our example matrix:
> let arr = fromListUnboxed (Z :. 3 :. 5) [1..15] :: Array U DIM2 Int > arr ! (Z:.2:.1) 12
The element 12 is therefore 2 rows down and 1 column across. As I mentioned earlier, indices count from zero in Repa.
Internally, Repa is using the function toIndex to convert an index
to an Int offset, given a shape:
toIndex::Shapesh=>sh->sh->Int
For example:
> toIndex (Z:.3:.5 :: DIM2) (Z:.2:.1 :: DIM2) 11
Because the layout of an array in memory is the same regardless of its shape, we can even change the shape without copying the array:
> reshape (Z:.5:.3) arr ! (Z:.2:.1 :: DIM2) 8
With the shape Z:.5:.3, the index Z:.2:.1 corresponds to the
element at 2 * 3 + 1 = 7, which has value 8.
Here are a couple of other operations on shapes that often come in handy:
rank::Shapesh=>sh->Int-- number of dimensionssize::Shapesh=>sh->Int-- number of elements
To retrieve the shape of an array, we can use extent:
extent::(Shapesh,Sourcere)=>Arrayrshe->sh
> extent arr (Z :. 3) :. 5 > rank (extent arr) 2 > size (extent arr) 15
We can map a function over an array using Repa’s map function:
Repa.map::(Shapesh,Sourcera)=>(a->b)->Arrayrsha->ArrayDshb
We can see from the type that map returns an array with the
representation D.  The D representation stands for Delayed; this
means that the array has not been computed yet.  A delayed array is
represented by a function from indices to elements.
We can apply map to an array, but there’s no way to print out the
result:
> let a = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int
> Repa.map (+1) a
<interactive>:26:1:
    No instance for (Show (Array D DIM1 Int))
      arising from a use of `print'
    Possible fix:
      add an instance declaration for (Show (Array D DIM1 Int))
    In a stmt of an interactive GHCi command: print it
As its name suggests, a delayed array is not an array yet.  To turn
it into an array, we have to call a function that allocates the array
and computes the value of each element.  The computeS function does
this for us:
computeS::(Loadr1she,Targetr2e)=>Arrayr1she->Arrayr2she
The argument to computeS is an array with a representation that is a
member of the Load class, whereas its result is an array with a
representation that is a member of the Target class.  The most
important instances of these two classes are D and U respectively;
that is, computeS turns a delayed array into a concrete unboxed
array.[20].
Applying computeS to the result of map gives us an unboxed array:
> computeS (Repa.map (+1) a) :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [2,3,4,5,6,7,8,9,10,11])
You might be wondering why there is this extra complication—why
doesn’t map just produce a new array?  The answer is that by
representing the result of an array operation as a delayed array, a
sequence of array operations can be performed without ever building
the intermediate arrays; this is an optimization called fusion, and
it’s critical to achieving good performance with Repa.  For example,
if we composed two maps together:
> computeS (Repa.map (+1) (Repa.map (^2) a)) :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [2,5,10,17,26,37,50,65,82,101])
The intermediate array between the two maps is not built, and in
fact if we compile this rather than running it in GHCi, provided the
optimization option -O is enabled, it will compile to a single
efficient loop over the input array.
Let’s see how it works.  The fundamental way to get a delayed array is
fromFunction:
fromFunction::sh->(sh->a)->ArrayDsha
The fromFunction operation creates a delayed array.  It takes the
shape of the array and a function that specifies the elements.  For
example, we can make a delayed array that represents the vector of
integers 0 to 9 like this:
> let a = fromFunction (Z :. 10) (\(Z:.i) -> i :: Int) > :t a a :: Array D (Z :. Int) Int
Delayed arrays support indexing, just like manifest arrays:
> a ! (Z:.5) 5
Indexing a delayed array works by just calling the function that we
supplied to fromFunction with the given index.
We need to apply computeS to make the delayed array into a manifest array:
> computeS a :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [0,1,2,3,4,5,6,7,8,9])
The computeS function creates the array and for each of the indices
of the array, it calls the function stored in the delayed array to find
the element at that position.
The map function, along with many other operations on arrays, can be
specified in terms of fromFunction.  For example, here is a
definition of map:
> let mymap f a = fromFunction (extent a) (\ix -> f (a ! ix))
> :t mymap
mymap
  :: (Shape sh, Source r e) =>
     (e -> a) -> Array r sh e -> Array D sh a
It works just like the real map:
> computeS (mymap (+1) a) :: Array U DIM1 Int AUnboxed (Z :. 10) (fromList [1,2,3,4,5,6,7,8,9,10])
What happens if we compose two maps together?  The result would
be a delayed array containing a function that indexes into another
delayed array.  So we’re building up a nested function that defines
the array elements, rather than intermediate arrays.  Furthermore,
Repa is carefully engineered so that at compile time the nested
function call is optimized away as far as possible, yielding very
efficient code.
In “Example: Shortest Paths in a Graph”, we looked at an implementation of the Floyd-Warshall algorithm for computing the lengths of shortest paths in a sparse weighted directed graph. Here, we’ll investigate how to code up the algorithm over dense graphs, using Repa.
For reference, here is the pseudocode definition of the algorithm:
shortestPath :: Graph -> Vertex -> Vertex -> Vertex -> Weight
shortestPath g i j 0 = weight g i j
shortestPath g i j k = min (shortestPath g i j (k-1))
                           (shortestPath g i k (k-1) + shortestPath g k j (k-1))
We implement this by first computing all the shortest paths for k ==
0, then k == 1, and so on up to the maximum vertex in the graph.
For the dense version, we’re going to use an adjacency matrix; that is, a two-dimensional array indexed by pairs of vertices, where each element is the length of the path between the two vertices. Here is our representation of graphs:
fwdense.hs
typeWeight=InttypeGraphr=ArrayrDIM2Weight
The implementation of the shortest paths algorithm is as follows:
shortestPaths::GraphU->GraphUshortestPathsg0=gog00--![]()
whereZ:._:.n=extentg0--![]()
go!g!k|k==n=g--![]()
|otherwise=letg'=computeS(fromFunction(Z:.n:.n)sp)--![]()
ingog'(k+1)--![]()
wheresp(Z:.i:.j)=min(g!(Z:.i:.j))(g!(Z:.i:.k)+g!(Z:.k:.j))--
The code is quite readable and somewhat shorter than the sparse version of the algorithm we saw before. However, there are a couple of subtleties that might not be obvious, but are nevertheless important for making the code run quickly:
go, rather
  than something like foldl', even though the latter would lead to
  shorter code.  The optimizations in Repa work much better when all
  the code is visible to the compiler, and calling out to library
  functions can sometimes hide details from GHC and prevent optimizations.
  There are no hard and fast rules here; I experimented with both the
  explicit version and the foldl' version, and found the explicit
  loop faster.
go.  This is good
  practice for iterative loops like this one and helps Repa to
  optimize the loop.
Let’s go ahead and compile the program and try it out on a 500-vertex graph:
> ghc fwdense.hs -O2 -fllvm
[1 of 1] Compiling Main             ( fwdense.hs, fwdense.o )
Linking fwdense ...
> ./fwdense 500 +RTS -s
31250125000
   1,077,772,040 bytes allocated in the heap
      31,516,280 bytes copied during GC
      10,334,312 bytes maximum residency (171 sample(s))
       2,079,424 bytes maximum slop
              32 MB total memory in use (3 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       472 colls,     0 par    0.01s    0.01s     0.0000s    0.0005s
  Gen  1       171 colls,     0 par    0.03s    0.03s     0.0002s    0.0063s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.46s  (  1.47s elapsed)
  GC      time    0.04s  (  0.04s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.50s  (  1.50s elapsed)
Note that I added a couple of optimization options: -O2 turns up
GHC’s optimizer, and -fllvm enables GHC’s LLVM backend, which
significantly improves the performance of Repa code; on my machine
with this particular example, I see a 40% improvement from
-fllvm.[21]
Now to make the program run in parallel.  To compute an array in parallel, Repa provides a variant of the computeS operation, called computeP:
computeP::(Monadm,Sourcer2e,Targetr2e,Loadr1she)=>Arrayr1she->m(Arrayr2she)
Whereas computeS computes an array sequentially, computeP uses the available cores to compute the array in parallel.  It knows
the size of the array, so it can divide the work equally amongst the
cores.
The type is almost the same as computeS, except that computeP
takes place in a monad. It works with any
monad, and it doesn’t matter which monad is used because the purpose of the monad is only to ensure that computeP
operations are performed in sequence and not nested.  Hence we need
to modify our code so that the go function is in a monad, which
entails a few small changes.  Here is the code:
shortestPaths::GraphU->GraphUshortestPathsg0=runIdentity$gog00--![]()
whereZ:._:.n=extentg0go!g!k|k==n=returng--![]()
|otherwise=dog'<-computeP(fromFunction(Z:.n:.n)sp)--![]()
gog'(k+1)wheresp(Z:.i:.j)=min(g!(Z:.i:.j))(g!(Z:.i:.k)+g!(Z:.k:.j))
We need to use a monad, so the   | 
|
Remember to   | 
|
Instead of   | 
To run it in parallel, we’ll need to add the -threaded option when
compiling.  Let’s see how it performs:
> ghc -O2 fwdense1 -threaded -fllvm -fforce-recomp [1 of 1] Compiling Main ( fwdense1.hs, fwdense1.o ) Linking fwdense1 ... > ./fwdense1 500 +RTS -s 31250125000 ... Total time 1.89s ( 1.91s elapsed)
There’s some overhead for using computeP, which here seems to be
about 27%.  That’s quite high, but we can recoup it by using more
cores.  With four cores:
> ./fwdense1 500 +RTS -s -N4 31250125000 ... Total time 2.15s ( 0.57s elapsed)
That equates to a 2.63 speedup against the sequential version, for almost zero effort. Not bad!
Folds are an important class of operations over arrays; they are the
operations that perform a collective operation over all the
elements of an array to produce a single result, such as summing the
array or finding its maximum element.  For example, the function
sumAllS calculates the sum of all the elements in an array:
sumAllS::(Numa,Shapesh,Sourcera,Unboxa,Elta)=>Arrayrsha->a
For an array of elements of type a that supports addition (the Num
constraint), sumAllS produces a single result that is the sum of all
the elements:
> let arr = fromListUnboxed (Z :. 10) [1..10] :: Array U DIM1 Int > sumAllS arr 55
But sometimes we don’t want to fold over the whole array. There are occasions where we need to fold over just one dimension. For example, in the shortest paths example, suppose we wanted to take the resulting matrix of path lengths and find for each vertex the furthest distance we would have to travel from that vertex to any other vertex in the graph.
Our graph may have some nodes that are not connected, and in that case
we represent the distance between them by a special large value called
inf (the value of inf doesn’t matter as long as it is larger than
all the path lengths in the graph).  For the purposes of finding the
maximum distance to other nodes, we’ll ignore nodes that are not
reachable and hence have path length inf.  So the function to
compute the maximum of two path lengths is as follows:
maxDistance::Weight->Weight->WeightmaxDistancexy|x==inf=y|y==inf=x|otherwise=maxxy
Now we want to fold maxDistance over just one dimension of our
two-dimensional adjacency matrix.  There is a function called foldS
that does just that; here is its type:
foldS::(Shapesh,Sourcera,Elta,Unboxa)=>(a->a->a)--![]()
->a--![]()
->Arrayr(sh:.Int)a--![]()
->ArrayUsha--
The function to fold.  | 
|
The unitary value of type   | 
|
The input array.  Note that the shape is   | 
|
The output array has shape   | 
The fwdense.hs program has a small test graph of six vertices:
> extent testGraph (Z :. 6) :. 6
If we use foldS to fold maxDistance over the matrix of shortest
paths, we obtain the maximum distance from each vertex to any other
vertex:
> foldS maxDistance inf (shortestPaths testGraph) AUnboxed (Z :. 6) (fromList [20,19,31,18,15,21])
And if we fold once more, we’ll find the longest distance between any two nodes (for which a path exists) in the graph:
> foldS maxDistance inf (foldS maxDistance inf (shortestPaths testGraph)) AUnboxed Z (fromList [31])
Note that the result this time is an array with zero dimensions, otherwise known as a scalar.
A function named foldP allows us to fold in parallel:
foldP::(Shapesh,Sourcera,Elta,Unboxa,Monadm)=>(a->a->a)->a->Arrayr(sh:.Int)a->m(ArrayUsha)
For the same reasons as computeP, foldP is performed in an
arbitrary monad.  The arguments are the same as for foldS.
The function argument used with foldP must be associative.  That
is, the function f must satisfy f x (f y z) == f (f x y) z.  This
is because unlike foldS, foldP doesn’t necessarily fold the
function over the array elements in strict left-to-right order; it
folds different parts of the array in parallel and then combines the
results from those parts using the folding function.
Note that strictly speaking,
although mathematical addition is associative, floating-point addition
is not, due to rounding errors.  However, we tend to ignore this
detail when using foldP because a small amount of nondeterminism in
the floating point result is normally acceptable.
Repa is a great tool for coding image manipulation algorithms, which tend to be naturally parallel and involve a lot of data. In this section, we’ll write a program to rotate an image about its center by a specified number of degrees.
For reading and writing image data, Repa provides an interface to
the DevIL library, which is a cross-platform C library for image
manipulation.  DevIL supports reading and writing various common
image formats, including PNG and JPG.  The library is wrapped by the
Haskell package repa-devil, which provides a convenient Haskell API
to DevIL.  The two operations we’ll be using are readImage and
writeImage:
readImage::FilePath->ILImagewriteImage::FilePath->Image->IL()
Where the Image type defines various in-memory image
representations:
dataImage=RGBA(ArrayFDIM3Word8)|RGB(ArrayFDIM3Word8)|BGRA(ArrayFDIM3Word8)|BGR(ArrayFDIM3Word8)|Grey(ArrayFDIM2Word8)
A color image is represented as a three-dimensional array. The first
two dimensions are the Y and X axes, and the last dimension contains the
three color channels and optionally an alpha channel.  The first four
constructors of Image correspond to different orderings of the color
channels and the presence or not of an alpha channel.  The last
option, Grey, is a grayscale image with one byte per pixel.
Which one of these is returned by readImage depends on the type of
image file being read. For example, a color JPEG image returns
data in RGB format, but a PNG image returns in RGBA format.
You may have noticed one unfamiliar aspect to these array types: the
F representation type.  This indicates that the array data is held
in foreign memory; that is, it was allocated by C code.  Apart from
being allocated by C rather than Haskell, the F
representation is identical to U.
Note that readImage and writeImage are in the IL monad.  The
purpose of the IL monad is to ensure that the DevIL library
is initialized properly. This is done by runIL:
runIL::ILa->IOa
It’s perfectly fine to have multiple calls to runIL; the library
will be initialized only once.
Our program will take three arguments: the number of degrees to rotate the image by, the input filename, and the output filename, respectively:
main::IO()main=do[n,f1,f2]<-getArgsrunIL$do(RGBv)<-readImagef1--![]()
rotated<-computeP$rotate(readn)v::IL(ArrayFDIM3Word8)--![]()
writeImagef2(RGBrotated)--
Next we’ll write the function rotate, which actually calculates the
rotated image data.  First, we have a decision to make: what should the
size of the rotated image be?  We have the option of producing a
smaller image than the input, and discarding any pixels that fall
outside the boundaries after rotation, or to adjust the image size to
contain the rotated image, and fill in the empty areas with something
else (e.g., black).  I’ll opt, somewhat arbitrarily, to keep the output
image size the same as the input and fill in the empty areas with
black.  Please feel free to modify the program to do something more
sensible.
rotate::Double->ArrayFDIM3Word8->ArrayDDIM3Word8rotatedegg=fromFunction(Z:.y:.x:.k)f--![]()
wheresh@(Z:.y:.x:.k)=extentg!theta=pi/180*deg--![]()
!st=sintheta--![]()
!ct=costheta!cy=fromIntegraly/2::Double--![]()
!cx=fromIntegralx/2::Doublef(Z:.i:.j:.k)--![]()
|inShapeshold=g!old--![]()
|otherwise=0--![]()
wherefi=fromIntegrali-cy--![]()
fj=fromIntegralj-cxi'=round(st*fj+ct*fi+cy)--![]()
j'=round(ct*fj-st*fi+cx)old=Z:.i':.j':.k--
The formula to rotate a point (x,y) by an angle θ about the origin is given by:
However, we want to rotate our image about the center, but the origin is the upper-left corner. Hence we need to adjust the points to be relative to the center of the image before translation and adjust them back afterward.
We’re creating a delayed array, represented by the function   | 
|
Convert the angle by which to rotate the image from degrees to radians.  | 
|
Because we’ll need the values of   | 
|
  | 
|
The function   | 
|
| 
 
First, we need to check whether the  
 If the   | 
|
If the rotated position in the old image is out of bounds, then we return zero, giving a black pixel at this position in the new image.  | 
|
  | 
|
  | 
|
Finally,   | 
To see the program working, we first need an image to rotate: Figure 5-1.

Running the program like so results in the straightened image shown in Figure 5-2:
$ ./rotateimage 4 wonky.jpg straight.jpg

Let’s check the performance of the program:
$ rm straight.jpg $ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s ... Total time 0.69s ( 0.69s elapsed)
And see how much we can gain by running it in parallel, on four cores:
$ ./rotateimage 4 wonky.jpg straight.jpg +RTS -s -N4 ... Total time 0.76s ( 0.24s elapsed)
The result is a speedup of 2.88. However, this program spends 0.05s of its time reading and writing the image file (measured by modifying the program to omit the rotation step), and if we factor this into the results, we obtain a speedup for the parallel portion of the program of 3.39.
Repa provides a convenient framework for describing array operations and has some significant benefits:
computeP and foldP automatically parallelize
   using the available cores.
There are a couple of gotchas to bear in mind:
INLINE pragmas.  A good
   rule of thumb is to use both of these liberally.  You might also
   need to use simpler code and fewer library functions so that GHC
   can see all of your code and optimize it.
-fllvm option if your computer supports it.
There’s much more to Repa that we haven’t covered. For example, Repa has support for stencil convolutions: a common class of image-processing algorithms in which a transformation on each pixel is calculated as some function of the surrounding pixels. For certain kinds of stencil functions that are known at compile time, Repa can generate specialized code that runs extremely fast.
To learn more, take a look at the full Repa documentation on Hackage.
[19] Note that we’re using Repa version 3.2 here; 3.0 had a somewhat different API.
[20] There are other array representations that aren’t covered in this chapter; for more details, see the Repa documentation
[21] You might not have LLVM installed on your computer, in which case the -fllvm option will not work.  Don’t worry: Repa works perfectly well without it. The code will just be slower.