Generic programming is a powerful way to define a function that works in an analogous way for a class of types. In this article, I describe the latest approach to generic programming that is implemented in GHC. This approach goes back to the paper A Generic Deriving Mechanism for Haskell by José Pedro Magalhães, Atze Dijkstra, Johan Jeuring, and Andres Löh.
This article is a writeup of a Theory Lunch talk I gave on 4 February 2016. As usual, the source of this article is a literate Haskell file, which you can download, load into GHCi, and play with.
Motivation
Parametric polymorphism allows you to write functions that deal with values of any type. An example of such a function is the reverse
function, whose type is [a] -> [a]
. You can apply reverse
to any list, no matter what types the elements have.
However, parametric polymorphism does not allow your functions to depend on the structure of the concrete types that are used in place of type variables. So values of these types are always treated as black boxes. For example, the reverse
function only reorders the elements of the given list. A function of type [a] -> [a]
could also drop elements (like the tail
function does) or duplicate elements (like the cycle
function does), but it could never invent new elements (except for ⊥) or analyze elements.
Now there are situation where a function is suitable for a class of types that share certain properties. For example, the sum
function works for all types that have a notion of binary addition. Haskell uses type classes to support such functions. For example, the Num
class provides the method (+)
, which is used in the definition of sum
, whose type Num a => [a] -> a
contains a respective class constraint.
The methods of a class have to be implemented separately for every type that is an instance of the class. This is reasonable for methods like (+)
, where the implementations for the different instances differ fundamentally. However, it is unfortunate for methods that are implemented in an analogous way for most of the class instances. An example of such a method is (==)
, since there is a canonical way of checking values of algebraic data types for equality. It works by first comparing the outermost data constructors of the two given values and if they match, the individual fields. Only when the data constructors and all the fields match, the two values are considered equal.
For several standard classes, including Eq
, Haskell provides the deriving mechanism to generate instances with default method implementations whose precise functionality depends on the structure of the type. Unfortunately, there is no possibility in standard Haskell to extend this deriving mechanism to user-defined classes. Generic programming is a way out of this problem.
Prerequisites
For generic programming, we need several language extensions. The good thing is that only one of them, DeriveGeneric
, is specific to generic programming. The other ones have uses in other areas as well. Furthermore, DeriveGeneric
is a very small extension. So the generic programming approach we describe here can be considered very lightweight.
We state the full set of necessary extensions with the following pragma:
{-# LANGUAGE DefaultSignatures,
DeriveGeneric,
FlexibleContexts,
TypeFamilies,
TypeOperators #-}
Apart from these language extensions, we need the module GHC.Generics
:
import GHC.Generics
Our running example
As our running example, we pick serialization and deserialization of values. Serialization means converting a value into a bit string, and deserialization means parsing a bit string in order to get back a value.
We introduce a type Bit
for representing bits:
data Bit = O | I deriving (Eq, Show)
Furthermore, we define the class of all types that support serialization and deserialization as follows:
class Serializable a where
put :: a -> [Bit]
get :: [Bit] -> (a, [Bit])
There is a canonical way of serializing values of algebraic data types. It works by first encoding the data constructor of the given value as a sequence of bits and then serializing the individual fields. To show this approach in action, we define an algebraic data type Tree
, which is a type of labeled binary trees:
data Tree a = Leaf | Branch (Tree a) a (Tree a) deriving Show
An instantiation of Serializable
for Tree
that follows the canonical serialization approach can be carried out as follows:
instance Serializable a => Serializable (Tree a) where
put Leaf = [O]
put (Branch left root right) = [I] ++
put left ++
put root ++
put right
get (O : bits) = (Leaf, bits)
get (I : bits) = (Branch left root right, bits''') where
(left, bits') = get bits
(root, bits'') = get bits'
(right, bits''') = get bits''
Of course, it quickly becomes cumbersome to provide such an instance declaration for every algebraic data type that should use the canonical serialization approach. So we want to implement the canonical approach once and for all and make it easily usable for arbitrary types that are amenable to it. Generic programming makes this possible.
Representations
An algebraic data type is essentially a sum of products where the terms “sum” and “product” are understood as follows:
-
A sum is a variant type. In Haskell,
Either
is the canonical type constructor for binary sums, and the empty typeVoid
from thevoid
package is the nullary sum. -
A product is a tuple type. In Haskell,
(,)
is the canonical type constructor for binary products, and()
is the nullary product.
The key idea of generic programming is to map types to representations that make the sum-of-products structure explicit and to implement canonical behavior based on these representations instead of the actual types.
The GHC.Generics
module defines a number of type constructors for constructing representations:
data V1 p
infixr 5 :+:
data (:+:) f g p = L1 (f p) | R1 (g p)
data U1 p = U1
infixr 6 :*:
data (:*:) f g p = f p :*: g p
newtype K1 i a p = K1 { unK1 :: a }
newtype M1 i a f p = M1 { unM1 :: f p }
All of these type constructors take a final parameter p
. This parameter is relevant only when dealing with higher-order classes. In this article, however, we only discuss generic programming with first-order classes. In this case, the parameter p
is ignored. The different type constructors play the following roles:
-
V1
is for the nullary sum. -
(:+:)
is for binary sums. -
U1
is for the nullary product. -
(:*:)
is for binary products. -
K1
is a wrapper for fields of algebraic data types. Its parameteri
used to provide some information about the field at the type level, but is now obsolete. -
M1
is a wrapper for attaching meta information at the type level. Its parameteri
denotes the kind of the language construct the meta information refers to, and its parameterc
provides access to the meta information.
The GHC.Generics
module furthermore introduces a class Generic
, whose instances are the types for which a representation exists. Its definition is as follows:
class Generic a where
type Rep a :: * -> *
from :: a -> (Rep a) p
to :: (Rep a) p -> a
A type Rep a
is the representation of the type a
. The methods from
and to
convert from values of the actual type to values of the representation type and vice versa.
To see all this in action, we make Tree a
an instance of Generic
:
instance Generic (Tree a) where
type Rep (Tree a) =
M1 D D1_Tree (
M1 C C1_Tree_Leaf U1
:+:
M1 C C1_Tree_Branch (
M1 S NoSelector (K1 R (Tree a))
:*:
M1 S NoSelector (K1 R a)
:*:
M1 S NoSelector (K1 R (Tree a))
)
)
from Leaf = M1 (L1 (M1 U1))
from (Branch left root right) = M1 (
R1 (
M1 (
M1 (K1 left)
:*:
M1 (K1 root)
:*:
M1 (K1 right)
))
)
to (M1 (L1 (M1 U1))) = Leaf
to (M1 (
R1 (
M1 (
M1 (K1 left)
:*:
M1 (K1 root)
:*:
M1 (K1 right)
))
)) = Branch left root right
The types D1_Tree
, C1_Tree_Leaf
, and C1_Tree_Branch
are type-level representations of the type constructor Tree
, the data constructor Leaf
, and the data constructor Branch
, respectively. We declare them as empty types:
data D1_Tree
data C1_Tree_Leaf
data C1_Tree_Branch
We need to make these types instances of the classes Datatype
and Constructor
, which are part of GHC.Generics
as well. These classes provide a link between the type-level representations of type and data constructors and the meta information related to them. This meta information particularly covers the identifiers of the type and data constructors, which are needed when implementing canonical implementations for methods like show
and read
. The instance declarations for the Tree
-related types are as follows:
instance Datatype D1_Tree where
datatypeName _ = "Tree"
moduleName _ = "Main"
instance Constructor C1_Tree_Leaf where
conName _ = "Leaf"
instance Constructor C1_Tree_Branch where
conName _ = "Branch"
Instantiating the Generic
class as shown above is obviously an extremely tedious task. However, it is possible to instantiate Generic
completely automatically for any given algebraic data type, using the deriving
syntax. This is what the DeriveGeneric
language extension makes possible.
So instead of making Tree a
an instance of Generic
by hand, as we have done above, we could have declared the Tree
type as follows in the first place:
data Tree a = Leaf | Branch (Tree a) a (Tree a)
deriving (Show, Generic)
Implementing canonical behavior
As mentioned above, we implement canonical behavior based on representations. Let us see how this works in the case of the Serializable
class.
We introduce a new class Serializable'
whose methods provide serialization and deserialization for representation types:
class Serializable' f where
put' :: f p -> [Bit]
get' :: [Bit] -> (f p, [Bit])
We instantiate this class for all the representation types:
instance Serializable' U1 where
put' U1 = []
get' bits = (U1, bits)
instance (Serializable' r, Serializable' s) =>
Serializable' (r :*: s) where
put' (rep1 :*: rep2) = put' rep1 ++ put' rep2
get' bits = (rep1 :*: rep2, bits'') where
(rep1, bits') = get' bits
(rep2, bits'') = get' bits'
instance Serializable' V1 where
put' _ = error "attempt to put a void value"
get' _ = error "attempt to get a void value"
instance (Serializable' r, Serializable' s) =>
Serializable' (r :+: s) where
put' (L1 rep) = O : put' rep
put' (R1 rep) = I : put' rep
get' (O : bits) = let (rep, bits') = get' bits in
(L1 rep, bits')
get' (I : bits) = let (rep, bits') = get' bits in
(R1 rep, bits')
instance Serializable' r => Serializable' (M1 i a r) where
put' (M1 rep) = put' rep
get' bits = (M1 rep, bits') where
(rep, bits') = get' bits
instance Serializable a => Serializable' (K1 i a) where
put' (K1 val) = put val
get' bits = (K1 val, bits') where
(val, bits') = get bits
Note that in the case of K1
, the context mentions Serializable
, not Serializable'
, and the methods put'
and get
call put
and get
, not put'
and get'
. The reason is that the value wrapped in K1
has an ordinary type, not a representation type.
Accessing canonical behavior
We can now apply canonical behavior to ordinary types using the methods from
and to
from the Generic
class. For example, we can implement functions defaultPut
and defaultGet
that provide canonical serialization and deserialization for all instances of Generic
:
defaultPut :: (Generic a, Serializable' (Rep a)) =>
a -> [Bit]
defaultPut = put' . from
defaultGet :: (Generic a, Serializable' (Rep a)) =>
[Bit] -> (a, [Bit])
defaultGet bits = (to rep, bits') where
(rep, bits') = get' bits
We can use these functions in instance declarations for Serializable
. For example, we can make Tree a
an instance of Serializable
in the following way:
instance Serializable a => Serializable (Tree a) where
put = defaultPut
get = defaultGet
Compared to the instance declaration we had initially, this one is a real improvement, since we do not have to implement the desired behavior of put
and get
by hand anymore. However, it still contains boilerplate code in the form of the trivial method declarations. It would be better to establish defaultPut
and defaultGet
as defaults in the class declaration:
class Serializable a where
put :: a -> [Bit]
put = defaultPut
get :: [Bit] -> (a, [Bit])
get = defaultGet
However, this is not possible, since the types of defaultPut
and defaultGet
are less general than the types of put
and get
, as they put additional constraints on the type a
. Luckily, GHC supports the language extension DefaultSignatures
, which allows us to give default implementations that have less general types than the actual methods (and consequently work only for those instances that are compatible with these less general types). Using DefaultSignatures
, we can declare the Serializable
class as follows:
class Serializable a where
put :: a -> [Bit]
default put :: (Generic a, Serializable' (Rep a)) =>
a -> [Bit]
put = defaultPut
get :: [Bit] -> (a, [Bit])
default get :: (Generic a, Serializable' (Rep a)) =>
[Bit] -> (a, [Bit])
get = defaultGet
With this class declaration in place, we can make Tree a
an instance of Serializable
as follows:
instance Serializable a => Serializable (Tree a)
With the minor extension DeriveAnyClass
, which is provided by GHC starting from Version 7.10, we can even use the deriving
keyword to instantiate Serializable
for Tree a
. As a result, we only have to write the following in order to define the Tree
type and make it an instance of Serializable
:
data Tree a = Leaf | Branch (Tree a) a (Tree a)
deriving (Show, Generic, Serializable)
So finally, we can use our own classes like the Haskell standard classes regarding the use of deriving clauses, except that we have to additionally derive an instance declaration for Generic
.
Specialized implementations
Usually, not all instances of a class should or even can be generated by means of generic programming, but some instances have to be crafted by hand. For example, making Int
an instance of Serializable
requires manual work, since Int
is not an algebraic data type.
However, there is no problem with this, since we still have the opportunity to write explicit instance declarations, despite the presence of a generic solution. This is in line with the standard deriving mechanism: you can make use of it, but you are not forced to do so. So we can have the following instance declaration, for example:
instance Serializable Int where
put val = replicate val I ++ [O]
get bits = (length is, bits') where
(is, O : bits') = span (== I) bits
Of course, the serialization approach we use here is not very efficient, but the instance declaration illustrates the point we want to make.