Providence Salumu
In the last blog post, we discussed the high-level API used inside the GHC compiler. This time let’s zoom into the pieces that comprise the frontend of the compilation pipeline, namely:
The following functions are the main passes are that are used inside of the compiler driver functions in HscMain
.
-- Parse a module.
parseModule :: GhcMonad m => ModSummary -> m ParsedModule
-- Typecheck and rename a parsed module.
typecheckModule :: GhcMonad m => ParsedModule -> m TypecheckedModule
-- Desugar a typechecked module.
desugarModule :: GhcMonad m => TypecheckedModule -> m DesugaredModule
-- Extract ModGuts
coreModule :: DesugaredMod m => m -> ModGuts
-- Generated STG bindings from Core module
coreToStg :: DynFlags -> Module -> CoreProgram -> IO [StgBinding]
Each of the these functions manipulates an intermediate form used in the process of compiling a Haskell module into a form that can be compiled into machine code or interpreted.
The frontend syntax for Haskell is fairly large and is partially defined in the Haskell 2010 specification and extended by GHC. The parser is written using two custom tools:
Happy is a parser generator system for Haskell, similar to the tool `yacc’ for C. It works as a preprocessor with it’s own syntax that generates a parse table from two specifications, a lexer file and parser file. The Happy parser consists of a list of a production rules and a monad to running the parser in. Production rules consist of a set of options on the left and generating Haskell expressions on the right with indexed variables ($1
, $2
, …) mapping to the ordered terms on the left (i.e. in the second term term
~ $1
, term
~ $2
).
terms
: term { [$1] }
| term terms { $1 : $2 }
The most complicated part of the lexer is with regards to whitespace-sensitive layout. To handle whitespace the lexer will analyze the line position of certain productions and will output virtual tokens (vbrace, vsemi, etc) which are used to canonicalize the syntax. For example the use of layout rule in do-notation:
main = do
putStrLn "Hello"
putStrLn "Haskell"
Is converted into explicit virtual braces and semicolons at parse time.
main = do { putStrLn "Hello"; putStrLn "Haskell" }
The main entry points are defined in the Parser and resolve into a monad P
which holds state about the source buffer, and filename and location. The resulting AST is called HsSyn and contains position information preserved from the tokens.
parseType :: P (LHsType RdrName)
parseExpression :: P (LHsExpr RdrName)
parseModule :: P (Located (HsModule RdrName))
Frontend syntax in GHC carries position information along with it that can be used in error reporting. Location information is defined by a Located
type which has a single type paramater for the AST element to be “located”. Typically type synonyms follow the convention that a “L” before the type indicates a synonym for a located element. So a located HsExpr
is a LHsExxpr
.
data GenLocated l e = L l e
type Located e = GenLocated SrcSpan e
unLoc :: GenLocated l e -> e
unLoc (L _ e) = e
getLoc :: GenLocated l e -> l
getLoc (L l _) = l
noLoc :: e -> Located e
noLoc e = L noSrcSpan e
Most notably there is a functor instance for fmap
ing over a located element, which modifies the underlying type preserving the location.
instance Functor (GenLocated l) where
fmap f (L l e) = L l (f e)
The frontend syntax (called HsSyn) is spread across several data types which contain all possible syntax elements generated from the parser. These are parameterized over the name type so that transformations (renamer, etc) can alter the structure of bindings and identifiers while preserving the syntax tree structure.
There are many variants of identifiers used in different phases of compilation. In order of generality:
Names are disambiguated from each other by an internal unique numerical key called a Unique , which are generated from a UniqSupply . These are often embedded within the monad transformer MonadUnique
layer of the pass.
After typechecking we deal with Var and Id names which have the property that the Name and Type can always be recovered using the following functions.
varName :: Var -> Name
varType :: Var -> Kind
In addition, many lookup functions use a sum type TyThing
which wraps “things” in scope which have types, such as identifiers and type constructors.
data TyThing
= AnId Id
| AConLike ConLike.ConLike
| ATyCon TyCon
| ACoAxiom (CoAxiom Branched)
Inside the GHC API is a pretty printer which can be used to dump out a pretty printed summary of a large number of the internal types. The structure of the class is similar to Show
typeclass.
class Outputable a where
ppr :: a -> SDoc
pprPrec :: Rational -> a -> SDoc
showPpr :: Outputable a => DynFlags -> a -> String
To simplify our life a bit we can write a wrapper showGhc
which uses the default dynflags to dump out an instance of Outputable
.
showGhc :: (Outputable a) => a -> String
showGhc = showPpr unsafeGlobalDynFlags
Given the following example module we’d like to trace the transformation of the intermediate forms as it’s run through the compiler.
module Example where
data Animal = Cat | Dog
add :: Int -> Int -> Int
add x y = x + y
Using the machinery of the last post we import the module target, load it, and then instead of adding it to an interactive context we explicitly call the functions of the individual passes and dump our their structure to the screen.
main :: IO ()
main = runGhc (Just libdir) $ do
env <- getSession
dflags <- getSessionDynFlags
setSessionDynFlags $ dflags { hscTarget = HscInterpreted }
target <- guessTarget "Example.hs" Nothing
setTargets [target]
load LoadAllTargets
modSum <- getModSummary $ mkModuleName "Example"
pmod <- parseModule modSum -- ModuleSummary
tmod <- typecheckModule pmod -- TypecheckedSource
dmod <- desugarModule tmod -- DesugaredModule
let core = coreModule dmod -- ModGuts
stg <- liftIO $ coreToStg dflags (mg_module core) (mg_binds core)
liftIO $ banner "Parsed Source"
liftIO $ putStrLn $ showGhc ( parsedSource pmod )
liftIO $ banner "Renamed Module"
liftIO $ putStrLn $ showGhc ( tm_renamed_source tmod )
liftIO $ banner "Typechecked Module"
liftIO $ putStrLn $ showGhc ( tm_typechecked_source tmod )
liftIO $ banner "Typed Toplevel Definitions"
liftIO $ putStrLn $ showGhc ( modInfoTyThings (moduleInfo tmod) )
liftIO $ banner "Typed Toplevel Exports"
liftIO $ putStrLn $ showGhc ( modInfoExports (moduleInfo tmod) )
liftIO $ banner "Core Module"
liftIO $ putStrLn $ showGhc ( mg_binds core )
liftIO $ banner "STG"
liftIO $ putStrLn $ showGhc stg
From the module summary we can extract the underlying text stream of the source file Example.hs
. Running the parser over this generates a ParsedSource datatype which contains the HsSyn AST with raw OccNames.
data ParsedModule = ParsedModule
{ pm_mod_summary :: ModSummary
, pm_parsed_source :: ParsedSource
, pm_extra_src_files :: [FilePath]
, pm_annotations :: ApiAnns
}
===============================Parsed Source===============================
module Example where
data Animal = Cat | Dog
add :: Int -> Int -> Int
add x y = x + y
After parsing we run the renamer pass which is responsible for resolving scopes, operator precedences, and unique’ifying all names so that clashes cannot occur. The renamer resolves RdrName
s into Name
s across all syntax elements. In addition it annotates each name with whether they are qualified and global or local to the current module. If a name is not in scope, compilation simply halts here and reports the missing name.
===============================Renamed Module===============================
Just (add :: Int -> Int -> Int
add x y = x + y
data Animal = Cat | Dog,
[import (implicit) Prelude],
Nothing,
Nothing)
The renamed AST is then run through the typechecker which maps Name
s into Id
and enriches them with inferred types using type inference.
data TypecheckedModule = TypecheckedModule
{ tm_parsed_module :: ParsedModule
, tm_renamed_source :: Maybe RenamedSource
, tm_typechecked_source :: TypecheckedSource
, tm_checked_module_info :: ModuleInfo
, tm_internals_ :: (TcRnTypes.TcGblEnv, HscTypes.ModDetails)
}
The resulting structure is a typechecked module which contains the toplevel types of all bindings and all subexpressions.
=============================Typechecked Module=============================
{AbsBinds [] []
{Exports: [add <= add
<>]
Exported types: add :: Int -> Int -> Int
[LclId, Str=DmdType]
Binds: add x y = (+) x y}}
The resulting datatype has a field for ModuleInfo
, which contains various metadata about the typing environment, the names in scope, and typeclasses exported.
data ModuleInfo = ModuleInfo
{ minf_type_env :: HscTypes.TypeEnv
, minf_exports :: NameSet.NameSet
, minf_rdr_env :: Maybe RdrName.GlobalRdrEnv
, minf_instances :: [ClsInst]
, minf_iface :: Maybe ModIface
, minf_safe :: SafeHaskellMode
, minf_modBreaks :: ModBreaks
}
Using the coreModule
we can also construct the gruesomely named ModGuts
datatype which contains an expanded form of all the various exports and types with associated metadata.
data ModGuts = ModGuts
{ mg_module :: !Module,
, mg_boot :: IsBootInterface
, mg_exports :: ![Avail.AvailInfo]
, mg_deps :: !Dependencies
, mg_dir_imps :: !ImportedMods
, mg_used_names :: !NameSet.NameSet
, mg_used_th :: !Bool
, mg_rdr_env :: !RdrName.GlobalRdrEnv
, mg_fix_env :: !FixityEnv
, mg_tcs :: ![TyCon]
, mg_insts :: ![ClsInst]
, mg_fam_insts :: ![FamInst]
, mg_patsyns :: ![PatSyn.PatSyn]
, mg_rules :: ![CoreSyn.CoreRule]
, mg_binds :: !CoreSyn.CoreProgram
, mg_foreign :: !ForeignStubs
, mg_warns :: !Warnings
, mg_anns :: [Annotations.Annotation]
, mg_hpc_info :: !HpcInfo
, mg_modBreaks :: !ModBreaks
, mg_vect_decls :: ![CoreSyn.CoreVect]
, mg_vect_info :: !VectInfo
, mg_inst_env :: InstEnv.InstEnv
, mg_fam_inst_env :: FamInstEnv.FamInstEnv
, mg_safe_haskell :: SafeHaskellMode
, mg_trust_pkg :: Bool
, mg_dependent_files :: [FilePath]
}
The ModuleInfo can be queried for any type of information about types or names at this point. For instance if we were interesed in the types of a global name we could look it up as a TyThing
inside the typing environment.
modInfoTyThings :: ModuleInfo -> [TyThing]
modInfoExports :: ModuleInfo -> [Name]
=========================Typed Toplevel Definitions=========================
[Type constructor ‘Animal’, Data constructor ‘Cat’,
Data constructor ‘Dog’, Identifier ‘add’, Identifier ‘Cat’,
Identifier ‘Dog’]
===========================Typed Toplevel Exports===========================
[Animal, Cat, Dog, add]
The Core language is GHC’s most central data types. Core is a very small, explicitly-typed, variant of System FC. The output of the typecheckModule
function is a datatype CoreModule
which contains the Core program as the typing environment.
data CoreModule = CoreModule
{ cm_module :: !Module
, cm_types :: !HscTypes.TypeEnv
, cm_binds :: CoreSyn.CoreProgram
, cm_safe :: SafeHaskellMode
}
Looking at the core module we see that each name in the program is now annotated with an explicit type and typeclass dictionaries have been inserted.
================================Core Module================================
[add :: Int -> Int -> Int
[LclIdX, Str=DmdType]
add = \ (x :: Int) (y :: Int) -> + @ Int $fNumInt x y]
Core then is translated into a intermediate form that is amenable to various analysis passes that optimize the program for code generation. This form is an abstract machine known a Spineless Tagless G-Machine. In this syntax every closure has an associated Static Reference Table or SRT. More on this later.
====================================STG====================================
[add :: Int -> Int -> Int
[LclIdX, Str=DmdType] =
\r srt:SRT:[r6ry :-> $fNumInt] [x y] + $fNumInt x y;]
We mentioned Core and STG only briefly. The process by which frontend syntax is transformed into Core will be our next topic. The design of Core defines a great deal of aspects about how the language is structured and informs the type system as well as code-generation.