Providence Salumu
Table of Contents
The task of parsing a file, or data of various types, is a common one for programmers. We already learned about Haskell's support for regular expressions back in the section called “Regular expressions in Haskell”. Regular expressions are nice for many tasks, but they rapidly become unwieldy, or cannot be used at all, when dealing with a complex data format. For instance, we cannot use regular expressions to parse source code from most programming languages.
Parsec is a useful parser combinator library, with which we combine small parsing functions to build more sophisticated parsers. Parsec provides some simple parsing functions, as well as functions to tie them all together. It should come as no surprise that this parser library for Haskell is built around the notion of functions.
It's helpful to know where Parsec fits compared to the tools used for parsing in other languages. Parsing is sometimes divided into two stages: lexical analysis (the domain of tools like flex) and parsing itself (performed by programs such as bison). Parsec can perform both lexical analysis and parsing.
Let's jump right in by writing some code for parsing a CSV file. CSV files are often used as a plain text representation of spreadsheets or databases. Each line is a record, and each field in the record is separated from the next by a comma. There are ways of dealing with fields that contain commas, but to start with, we won't worry about it.
This first example is much longer than it really needs to be. We will introduce more Parsec features in a little bit that will shrink the parser down to only four lines!
-- file: ch16/csv1.hs import Text.ParserCombinators.Parsec {- A CSV file contains 0 or more lines, each of which is terminated by the end-of-line character (eol). -} csvFile :: GenParser Char st [[String]] csvFile = do result <- many line eof return result -- Each line contains 1 or more cells, separated by a comma line :: GenParser Char st [String] line = do result <- cells eol -- end of line return result -- Build up a list of cells. Try to parse the first cell, then figure out -- what ends the cell. cells :: GenParser Char st [String] cells = do first <- cellContent next <- remainingCells return (first : next) -- The cell either ends with a comma, indicating that 1 or more cells follow, -- or it doesn't, indicating that we're at the end of the cells for this line remainingCells :: GenParser Char st [String] remainingCells = (char ',' >> cells) -- Found comma? More cells coming <|> (return []) -- No comma? Return [], no more cells -- Each cell contains 0 or more characters, which must not be a comma or -- EOL cellContent :: GenParser Char st String cellContent = many (noneOf ",\n") -- The end of line character is \n eol :: GenParser Char st Char eol = char '\n' parseCSV :: String -> Either ParseError [[String]] parseCSV input = parse csvFile "(unknown)" input
Let's take a look at the code for this example. We didn't use many shortcuts here, so remember that this will get shorter and simpler!
We've built it from the
top down,
so our first function is csvFile
. The type of this
function is GenParser Char st [[String]]
. This
means that the type of the input is a sequence of characters, which is
exactly what a Haskell string is, since String
is
the same as [Char]
. It also means that we will
return a value of type [[String]]
: a list of a list
of strings. The st
can be ignored for now.
Parsec programmers often omit type declarations, since we write so
many small functions. Haskell's type inference can figure it out.
We've listed the types for the first example here so you can get a
better idea of what's going on. You can always use
:t
in ghci to inspect types as well.
The csvFile
uses a do
block. As this
implies, Parsec is a monadic library: it defines its own special
parsing monad[34], GenParser.
We start by running many line
.
many
is a function that takes a function as an
argument. It tries to repeatedly
parse the input using the function passed to it. It gathers up the
results from all that repeated parsing and returns a list of them. So,
here, we are storing the results of parsing all lines in
result
. Then we look for the end-of-file indicator,
called eof
. Finally, we return the
result
. So, a CSV file is made up of many lines,
then the end of file. We can often read out Parsec functions in plain
English just like this.
Now we must answer the question: what is a line? We define the
line
function to do just that. Reading the
function, we can see that a line consists of cells followed by the end
of line character.
So what are cells? We defined them in the cells
function. The cells of a line start with the content of the first
cell, then continue with the content of the remaining cells, if any.
The result is simply the first cell and the remaining cells assembled
into a list.
Let's skip over remainingCells
for a minute and
look at cellContent
. A cell contains any number of
characters, but each character must not be a comma or end of line
character. The noneOf
function matches one item, so
long as it isn't in the list of items that we pass. So, saying
many (noneOf ",\n")
defines a cell the way we want
it.
Back in remainingCells
, we have the first example of
a choice in Parsec. The choice operator is <|>
. This operator
behaves like this: it will first try the parser on the left. If it
consumed no input[35], it will try the parser on the right.
So, in remainingCells
, our task is to come up with
all the cells after the first. Recall that
cellContent
uses noneOf ",\n"
.
So it will not consume the comma or end-of-line character from the
input. If we see a comma after parsing a cell, it means that at least
one more cell follows. Otherwise, we're done. So, our first choice in
remainingCells
is char ','
. This
parser simply matches the passed character in the input. If we found a
comma, we want this function to return the remaining cells on the line.
At this point, the "remaining cells" looks exactly like the start of
the line, so we call cells
recursively to parse
them. If we didn't find a comma, we return the empty list, signifying
no remaining cells on the line.
Finally, we must define what the end-of-line indicator is. We set it
to char '\n'
, which will suit our purposes fine for
now.
At the very end of the program, we define a function
parseCSV
that takes a String
and parses it as a
CSV file. This function is just a shortcut that calls Parsec's
parse
function, filling in a few parameters.
parse
returns Either ParseError
[[String]]
for the CSV file. If there was an error, the
return value will be Left
with the error; otherwise, it will be
Right
with the result.
Now that we understand this code, let's play with it a bit and see what it does.
ghci>
:l csv1.hs
[1 of 1] Compiling Main ( csv1.hs, interpreted ) Ok, modules loaded: Main.ghci>
parseCSV ""
Loading package parsec-2.1.0.0 ... linking ... done. Right []
That makes sense: parsing the empty string returns an empty list. Let's try parsing a single cell.
ghci>
parseCSV "hi"
Left "(unknown)" (line 1, column 3): unexpected end of input expecting "," or "\n"
Look at that. Recall how we defined that each line must end with the end-of-line character, and we didn't give it. Parsec's error message helpfully indicated the line number and column number of the problem, and even told us what it was expecting! Let's give it an end-of-line character and continue experimenting.
ghci>
parseCSV "hi\n"
Right [["hi"]]ghci>
parseCSV "line1\nline2\nline3\n"
Right [["line1"],["line2"],["line3"]]ghci>
parseCSV "cell1,cell2,cell3\n"
Right [["cell1","cell2","cell3"]]ghci>
parseCSV "l1c1,l1c2\nl2c1,l2c2\n"
Right [["l1c1","l1c2"],["l2c1","l2c2"]]ghci>
parseCSV "Hi,\n\n,Hello\n"
Right [["Hi",""],[""],["","Hello"]]
You can see that parseCSV
is doing exactly what we
wanted it to do. It's even handling empty cells and empty lines
properly.
We promised you earlier that we could simplify our CSV parser significantly by using a few Parsec helper functions. There are two that will dramatically simplify this code.
The first tool is the sepBy
function. This
function takes two functions as arguments: the first function
parses some sort of content, while the second function parses a
separator. sepBy
starts by trying to parse
content, then separators, and alternates back and forth until it
can't parse a separator. It returns a list of all the content
that it was able to parse.
The second tool is endBy
. It's similar to
sepBy
, but expects the very last item to be
followed by the separator. That is, it continues parsing until
it can't parse any more content.
So, we can use endBy
to parse lines, since every
line must end with the end-of-line character. We can use
sepBy
to parse cells, since the last cell will not
end with a comma. Take a look at how much simpler our parser is now:
-- file: ch16/csv2.hs import Text.ParserCombinators.Parsec csvFile = endBy line eol line = sepBy cell (char ',') cell = many (noneOf ",\n") eol = char '\n' parseCSV :: String -> Either ParseError [[String]] parseCSV input = parse csvFile "(unknown)" input
This program behaves exactly the same as the first one. We can verify this by using ghci to re-run our examples from the earlier example. We'll get the same result from every one. Yet the program is much shorter and more readable. It won't be long before you can translate Parsec code like this into a file format definition in plain English. As you read over this code, you can see that:
Different operating systems use different characters to mark the
end-of-line. Unix/Linux systems, plus Windows in text mode, use
simply "\n"
.
DOS and Windows systems use "\r\n"
, and Macs
traditionally used "\r"
. We could add in
support for "\n\r"
too, just in case anybody
uses that.
We could easily adapt our example to be able to handle all these types
of line endings in a single file. We would need to make two
modifications: adjust eol
to recognize the different
endings, and adjust the noneOf
pattern in
cell
to ignore \r
.
This must be done carefully. Recall that our earlier definition of
eol
was simply char '\n'
. There
is a parser called string
that we can use to match
the multi-character patterns. Let's start by thinking of how we would
add support for \n\r
.
Our first attempt might look like this:
-- file: ch16/csv3.hs -- This function is not correct! eol = string "\n" <|> string "\n\r"
This isn't quite right. Recall that the <|>
operator always
tries the left alternative first. Looking for the single character
\n
will match both types of line endings, so it will
look to the system that the following line begins with
\r
. Not what we want. Try it in ghci:
ghci>
:m Text.ParserCombinators.Parsec
ghci>
let eol = string "\n" <|> string "\n\r"
Loading package parsec-2.1.0.0 ... linking ... done.ghci>
parse eol "" "\n"
Right "\n"ghci>
parse eol "" "\n\r"
Right "\n"
It may seem like the parser worked for both endings, but actually looking at it this way, we can't tell. If it left something un-parsed, we don't know, because we're not trying to consume anything else from the input. So let's look for the end-of-file after our end of line:
ghci>
parse (eol >> eof) "" "\n\r"
Left (line 2, column 1): unexpected "\r" expecting end of inputghci>
parse (eol >> eof) "" "\n"
Right ()
As expected, we got an error from the \n\r
ending.
So the next temptation may
be to try it this way:
-- file: ch16/csv4.hs -- This function is not correct! eol = string "\n\r" <|> string "\n"
This also isn't right. Recall that <|>
only attempts the option
on the right if the option on the left consumed no input. But by the
time we are able to see if there is a \r
after the
\n
, we've already consumed the
\n
. This time, we fail on the other case in ghci:
ghci>
:m Text.ParserCombinators.Parsec
ghci>
let eol = string "\n\r" <|> string "\n"
Loading package parsec-2.1.0.0 ... linking ... done.ghci>
parse (eol >> eof) "" "\n\r"
Right ()ghci>
parse (eol >> eof) "" "\n"
Left (line 1, column 1): unexpected end of input expecting "\n\r"
We've stumbled upon the lookahead problem. It turns out that, when
writing parsers, it's often very convenient to be able to "look ahead"
at the data that's coming in. Parsec supports this, but before showing
you how to use it, let's see how you would have to write this to get
along without it. You'd have to manually expand all the options after
the \n
like this:
-- file: ch16/csv5.hs eol = do char '\n' char '\r' <|> return '\n'
This function first looks for \n
. If it is found,
then it will look for \r
, consuming it if possible.
Since the return type of char '\r'
is a
Char
, the alternative action is to simply return a
Char
without attempting to parse anything. Parsec
has a function option
that can also express this
idiom as option '\n' (char '\r')
. Let's test this
with ghci.
ghci>
:l csv5.hs
[1 of 1] Compiling Main ( csv5.hs, interpreted ) Ok, modules loaded: Main.ghci>
parse (eol >> eof) "" "\n\r"
Loading package parsec-2.1.0.0 ... linking ... done. Right ()ghci>
parse (eol >> eof) "" "\n"
Right ()
This time, we got the right result! But we could have done it easier with Parsec's lookahead support.
Parsec has a function called try
that is used to express
lookaheads. try
takes one function, a parser. It applies that
parser. If the parser doesn't succeed, try
behaves as if it hadn't
consumed any input at all. So, when you use try
on the left side
of <|>
, Parsec will try the option on the right even if the
left side failed after consuming some input. try
only has an
effect if it is on the left of a <|>
. Keep in mind, though,
that many functions use
<|>
internally. Here's a way to add
expanded end-of-line support to our CSV parser using try
:
-- file: ch16/csv6.hs import Text.ParserCombinators.Parsec csvFile = endBy line eol line = sepBy cell (char ',') cell = many (noneOf ",\n\r") eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" parseCSV :: String -> Either ParseError [[String]] parseCSV input = parse csvFile "(unknown)" input
Here we put both of the two-character endings first, and run both
tests under try
. Both of them occur to the left of a <|>
, so
they will do the right thing. We could have put string
"\n"
within a try
, but it wouldn't have altered any
behavior since they look at only one character anyway. We can load
this up and test the eol
function in ghci.
ghci>
:l csv6.hs
[1 of 1] Compiling Main ( csv6.hs, interpreted ) Ok, modules loaded: Main.ghci>
parse (eol >> eof) "" "\n\r"
Loading package parsec-2.1.0.0 ... linking ... done. Right ()ghci>
parse (eol >> eof) "" "\n"
Right ()ghci>
parse (eol >> eof) "" "\r\n"
Right ()ghci>
parse (eol >> eof) "" "\r"
Right ()
All four endings were handled properly. You can also test the full CSV parser with some different endings like this:
ghci>
parseCSV "line1\r\nline2\nline3\n\rline4\rline5\n"
Right [["line1"],["line2"],["line3"],["line4"],["line5"]]
As you can see, this program even supports different line endings within a single file.
At the beginning of this chapter, you saw how Parsec could generate error messages that list the location where the error occurred as well as what was expected. As parsers get more complex, the list of what was expected can become cumbersome. Parsec provides a way for you to specify custom error messages in the event of parse failures.
Let's look at what happens when our current CSV parser encounters an error:
ghci>
parseCSV "line1"
Left "(unknown)" (line 1, column 6): unexpected end of input expecting ",", "\n\r", "\r\n", "\n" or "\r"
That's a pretty long, and technical, error message. We could make an
attempt to resolve this by using the monad fail
function like so:
-- file: ch16/csv7.hs eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <|> fail "Couldn't find EOL"
Under ghci, we can see the result:
ghci>
:l csv7.hs
[1 of 1] Compiling Main ( csv7.hs, interpreted ) Ok, modules loaded: Main.ghci>
parseCSV "line1"
Loading package parsec-2.1.0.0 ... linking ... done. Left "(unknown)" (line 1, column 6): unexpected end of input expecting ",", "\n\r", "\r\n", "\n" or "\r" Couldn't find EOL
We added to the error result, but didn't really help clean up the
output. Parsec has an <?>
operator that is designed for just these
situations. It is similar to <|>
in that it first tries the
parser on its left. Instead of trying another parser in the event of
a failure, it presents an error message. Here's how we'd use it:
-- file: ch16/csv8.hs eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line"
Now, when you generate an error, you'll get more helpful output:
ghci>
:l csv8.hs
[1 of 1] Compiling Main ( csv8.hs, interpreted ) Ok, modules loaded: Main.ghci>
parseCSV "line1"
Loading package parsec-2.1.0.0 ... linking ... done. Left "(unknown)" (line 1, column 6): unexpected end of input expecting "," or end of line
That's pretty helpful! The general rule of thumb is that you put a
human description of what you're looking for to the right of
<?>
.
Our earlier CSV examples have had an important flaw: they weren't able to handle cells that contain a comma. CSV generating programs typically put quotation marks around such data. But then you have another problem: what to do if a cell contains a quotation mark and a comma. In these cases, the embedded quotation marks are doubled up.
Here is a full CSV parser. You can use this from ghci, or if you compile it to a standalone program, it will parse a CSV file on standard input and convert it to a different format on output.
-- file: ch16/csv9.hs import Text.ParserCombinators.Parsec csvFile = endBy line eol line = sepBy cell (char ',') cell = quotedCell <|> many (noneOf ",\n\r") quotedCell = do char '"' content <- many quotedChar char '"' <?> "quote at end of cell" return content quotedChar = noneOf "\"" <|> try (string "\"\"" >> return '"') eol = try (string "\n\r") <|> try (string "\r\n") <|> string "\n" <|> string "\r" <?> "end of line" parseCSV :: String -> Either ParseError [[String]] parseCSV input = parse csvFile "(unknown)" input main = do c <- getContents case parse csvFile "(stdin)" c of Left e -> do putStrLn "Error parsing input:" print e Right r -> mapM_ print r
That's a full-featured CSV parser in just 21 lines of code, plus an
additional 10 lines for the parseCSV
and
main
utility functions.
Let's look at the changes in this program from the previous versions.
First, a cell may now be either a bare cell or a "quoted" cell. We
give the quotedCell
option first, because we want to
follow that path if the first character in a cell is the quote mark.
The quotedCell
begins and ends with a quote mark,
and contains zero or more characters. These characters can't be copied
directly, though, because they may contain embedded, doubled-up, quote
marks themselves. So we define a custom quotedChar
to process them.
When we're processing characters inside a quoted cell, we first say
noneOf "\""
. This will match and return any single
character as long as it's not the quote mark. Otherwise, if it is the
quote mark, we see if we have two of them in a row. If so, we return a
single quote mark to go on our result string.
Notice that try
in quotedChar
on the
right side of <|>
. Recall that I said that
try
only has an effect if it is on the left side of <|>
. This
try
does occur on the left side of a <|>
, but on the left of
one that must be within the implementation of many
.
This try
is important. Let's say we are parsing a
quoted cell, and are getting towards the end of it. There will be
another cell following. So we will expect to see a quote to end the
current cell, followed by a comma. When we hit
quotedChar
, we will fail the
noneOf
test and proceed to the test that looks for
two quotes in a row. We'll also fail that one because we'll have a
quote, then a comma. If we hadn't used try
, we'd crash with an
error at this point, saying that it was expecting the second quote,
because the first quote was already consumed. Since we use try
, this
is properly recognized as not a character that's part of the cell, so
it terminates the many quotedChar
expression as
expected. Lookahead has once again proven very useful, and the fact
that it is so easy to add makes it a remarkable tool in Parsec.
We can test this program with ghci over some quoted cells.
ghci>
:l csv9.hs
[1 of 1] Compiling Main ( csv9.hs, interpreted ) Ok, modules loaded: Main.ghci>
parseCSV "\"This, is, one, big, cell\"\n"
Loading package parsec-2.1.0.0 ... linking ... done. Right [["This, is, one, big, cell"]]ghci>
parseCSV "\"Cell without an end\n"
Left "(unknown)" (line 2, column 1): unexpected end of input expecting "\"\"" or quote at end of cell
Let's run it over a real CSV file. Here's one generated by a spreadsheet program:
"Product","Price" "O'Reilly Socks",10 "Shirt with ""Haskell"" text",20 "Shirt, ""O'Reilly"" version",20 "Haskell Caps",15
Now, we can run this under our test program and watch:
$ runhaskell csv9.hs < test.csv
["Product","Price"]
["O'Reilly Socks","10"]
["Shirt with \"Haskell\" text","20"]
["Shirt, \"O'Reilly\" version","20"]
["Haskell Caps","15"]
Parsec's GenParser monad is an
instance of the MonadPlus typeclass that we
introduced in
the section called “Looking for alternatives”. The value
mzero
represents a parse failure, while
mplus
combines two alternative parses into
one, using (<|>)
.
-- file: ch16/ParsecPlus.hs instance MonadPlus (GenParser tok st) where mzero = fail "mzero" mplus = (<|>)
When we introduced
application/x-www-form-urlencoded
text in the section called “Golfing practice: association lists”, we mentioned that we'd write a
parser for these strings. We can quickly and easily do this
using Parsec.
Each key-value pair is separated by the
&
character.
-- file: ch16/FormParse.hs p_query :: CharParser () [(String, Maybe String)] p_query = p_pair `sepBy` char '&'
Notice that in the type signature, we're using Maybe to represent a value: the HTTP specification is unclear about whether a key must have an associated value, and we'd like to be able to distinguish between “no value” and “empty value”.
-- file: ch16/FormParse.hs p_pair :: CharParser () (String, Maybe String) p_pair = do name <- many1 p_char value <- optionMaybe (char '=' >> many p_char) return (name, value)
The many1
function is similar to
many
: it applies its parser repeatedly,
returning a list of their results. While
many
will succeed and return an empty list
if its parser never succeeds, many1
will
fail if its parser never succeeds, and will otherwise return a
list of at least one element.
The optionMaybe
function modifies the
behaviour of a parser. If the parser fails,
optionMaybe
doesn't fail: it returns
Nothing
. Otherwise, it wraps the parser's
successful result with Just
. This gives us the
ability to distinguish between “no value” and
“empty value”, as we mentioned above.
Individual characters can be encoded in one of several ways.
-- file: ch16/FormParse.hs p_char :: CharParser () Char p_char = oneOf urlBaseChars <|> (char '+' >> return ' ') <|> p_hex urlBaseChars = ['a'..'z']++['A'..'Z']++['0'..'9']++"$-_.!*'()," p_hex :: CharParser () Char p_hex = do char '%' a <- hexDigit b <- hexDigit let ((d, _):_) = readHex [a,b] return . toEnum $ d
Some characters can be represented literally. Spaces are
treated specially, using a +
character. Other
characters must be encoded as a %
character
followed by two hexadecimal digits. The Numeric
module's readHex
parses a hex string as
a number.
ghci>
parseTest p_query "foo=bar&a%21=b+c"
Loading package parsec-2.1.0.0 ... linking ... done. [("foo",Just "bar"),("a!",Just "b c")]
As appealing and readable as this parser is, we can profit from stepping back and taking another look at some of our building blocks.
In many popular languages, people tend to put regular expressions to work for “casual” parsing. They're notoriously tricky for this purpose: hard to write, difficult to debug, nearly incomprehensible after a few months of neglect, and provide no error messages on failure.
If we can write compact Parsec parsers, we'll gain in readability, expressiveness, and error reporting. Our parsers won't be as short as regular expressions, but they'll be close enough to negate much of the temptation of regexps.
A few of our parsers above use do
notation and bind the
result of an intermediate parse to a variable, for later use.
One such function is p_pair
.
-- file: ch16/FormParse.hs p_pair :: CharParser () (String, Maybe String) p_pair = do name <- many1 p_char value <- optionMaybe (char '=' >> many p_char) return (name, value)
We can get rid of the need for explicit variables by using
the liftM2
combinator from
Control.Monad
.
-- file: ch16/FormParse.hs p_pair_app1 = liftM2 (,) (many1 p_char) (optionMaybe (char '=' >> many p_char))
This parser has exactly the same type and behaviour as
p_pair
, but it's one line long. Instead of
writing our parser in a “procedural” style, we've
simply switched to a programming style that emphasises that
we're applying parsers and
combining their results.
We can take this applicative style of writing a parser much further. In most cases, the extra compactness that we will gain will not come at any cost in readability, beyond the initial effort of coming to grips with the idea.
The standard Haskell libraries include a module named
Control.Applicative
, which we already encountered
in the section called “Infix use of fmap”. This module defines a
typeclass named Applicative, which represents an
applicative functor. This is a little bit
more structured than a functor, but a little bit less than a
monad. It also defines Alternative, which is
similar to MonadPlus
As usual, we think that the best way to introduce applicative functors is by putting them to work. In theory, every monad is an applicative functor, but not every applicative functor is a monad. Because applicative functors were added to the standard Haskell libraries long after monads, we often don't get an Applicative instance for free: frequently, we have to declare the monad we're using to be Applicative or Alternative.
To do this for Parsec
, we'll write a small
module that we can import instead of the normal
Parsec
module.
-- file: ch16/ApplicativeParsec.hs module ApplicativeParsec ( module Control.Applicative , module Text.ParserCombinators.Parsec ) where import Control.Applicative import Control.Monad (MonadPlus(..), ap) -- Hide a few names that are provided by Applicative. import Text.ParserCombinators.Parsec hiding (many, optional, (<|>)) -- The Applicative instance for every Monad looks like this. instance Applicative (GenParser s a) where pure = return (<*>) = ap -- The Alternative instance for every MonadPlus looks like this. instance Alternative (GenParser s a) where empty = mzero (<|>) = mplus
For convenience, our module's export section
exports all the names we imported from both the
Applicative
and Parsec
modules.
Because we hid Parsec's version of
(<|>)
when importing, the one that
will be exported is from Control.Applicative
, as we
would like.
We'll begin by rewriting our existing form parser from the
bottom up, beginning with p_hex
, which
parses a hexadecimal escape sequence. Here's the code in normal
do
-notation style.
-- file: ch16/FormApp.hs p_hex :: CharParser () Char p_hex = do char '%' a <- hexDigit b <- hexDigit let ((d, _):_) = readHex [a,b] return . toEnum $ d
And here's our applicative version.
-- file: ch16/FormApp.hs a_hex = hexify <$> (char '%' *> hexDigit) <*> hexDigit where hexify a b = toEnum . fst . head . readHex $ [a,b]
Although the individual parsers are mostly untouched, the
combinators that we're gluing them together with have changed.
The only familiar one is (<$>)
, which
we already know is a synonym for
fmap
.
From our definition of Applicative, we know
that (<*>)
is
ap
.
The remaining unfamiliar combinator is
(*>)
, which applies its first argument,
throws away its result, then applies the second and returns its
result. In other words, it's similar to
(>>)
.
Although the concepts here should mostly be familiar from
our earlier coverage of functors and monads, we'll walk through
this function to explain what's happening. First, to get a grip
on our types, we'll hoist hexify
to the top
level and give it a signature.
-- file: ch16/FormApp.hs hexify :: Char -> Char -> Char hexify a b = toEnum . fst . head . readHex $ [a,b]
Parsec's hexDigit
parser parses a
single hexadecimal digit.
ghci>
:type hexDigit
hexDigit :: CharParser st Char
Therefore, char '%' *> hexDigit
has the same
type, since (*>)
returns the result on
its right. (The CharParser type is nothing
more than a synonym for GenParser Char.)
ghci>
:type char '%' *> hexDigit
char '%' *> hexDigit :: GenParser Char st Char
The expression hexify <$> (char '%' *>
hexDigit)
is a parser that matches a “%”
character followed by hex digit, and whose result is a
function.
ghci>
:type hexify <$> (char '%' *> hexDigit)
hexify <$> (char '%' *> hexDigit) :: GenParser Char st (Char -> Char)
Finally, (<*>)
applies the parser
on its left, then the parser on its right, and applies the
function that's the result of the left parse to the value that's
the result of the right.
If you've been able to follow this, then you understand the
(<*>)
and ap
combinators: (<*>)
is plain old
($)
lifted to applicative functors, and
ap
the same thing lifted to monads.
ghci>
:type ($)
($) :: (a -> b) -> a -> bghci>
:type (<*>)
(<*>) :: (Applicative f) => f (a -> b) -> f a -> f bghci>
:type ap
ap :: (Monad m) => m (a -> b) -> m a -> m b
Next, we'll consider the p_char
parser.
-- file: ch16/FormApp.hs p_char :: CharParser () Char p_char = oneOf urlBaseChars <|> (char '+' >> return ' ') <|> p_hex urlBaseChars = ['a'..'z']++['A'..'Z']++['0'..'9']++"$-_.!*'(),"
This remains almost the same in an applicative style, save for one piece of convenient notation.
-- file: ch16/FormApp.hs a_char = oneOf urlBaseChars <|> (' ' <$ char '+') <|> a_hex
Here, the (<$)
combinator uses the
value on the left if the parser on the right succeeds.
Finally, the equivalent of p_pair_app1
is
almost identical.
-- file: ch16/FormParse.hs p_pair_app1 = liftM2 (,) (many1 p_char) (optionMaybe (char '=' >> many p_char))
All we've changed is the combinator we use for lifting: the
liftA
functions act in the same ways as
their liftM
cousins.
-- file: ch16/FormApp.hs a_pair :: CharParser () (String, Maybe String) a_pair = liftA2 (,) (many1 a_char) (optionMaybe (char '=' *> many a_char))
To give ourselves a better feel for parsing with applicative functors, and to explore a few more corners of Parsec, we'll write a JSON parser that follows the definition in RFC 4627.
At the top level, a JSON value must be either an object or an array.
-- file: ch16/JSONParsec.hs p_text :: CharParser () JValue p_text = spaces *> text <?> "JSON text" where text = JObject <$> p_object <|> JArray <$> p_array
These are structurally similar, with an opening character, followed by one or more items separated by commas, followed by a closing character. We capture this similarity by writing a small helper function.
-- file: ch16/JSONParsec.hs p_series :: Char -> CharParser () a -> Char -> CharParser () [a] p_series left parser right = between (char left <* spaces) (char right) $ (parser <* spaces) `sepBy` (char ',' <* spaces)
Here, we finally have a use for the
(<*)
combinator that we introduced
earlier. We use it to skip over any white space that might
follow certain tokens. With this p_series
function, parsing an array is simple.
-- file: ch16/JSONParsec.hs p_array :: CharParser () (JAry JValue) p_array = JAry <$> p_series '[' p_value ']'
Dealing with a JSON object is hardly more complicated, requiring just a little additional effort to produce a name/value pair for each of the object's fields.
-- file: ch16/JSONParsec.hs p_object :: CharParser () (JObj JValue) p_object = JObj <$> p_series '{' p_field '}' where p_field = (,) <$> (p_string <* char ':' <* spaces) <*> p_value
Parsing an individual value is a matter of calling an existing parser, then wrapping its result with the appropriate JValue constructor.
-- file: ch16/JSONParsec.hs p_value :: CharParser () JValue p_value = value <* spaces where value = JString <$> p_string <|> JNumber <$> p_number <|> JObject <$> p_object <|> JArray <$> p_array <|> JBool <$> p_bool <|> JNull <$ string "null" <?> "JSON value" p_bool :: CharParser () Bool p_bool = True <$ string "true" <|> False <$ string "false"
The choice
combinator allows us to
represent this kind of ladder-of-alternatives as a list. It
returns the result of the first parser to succeed.
-- file: ch16/JSONParsec.hs p_value_choice = value <* spaces where value = choice [ JString <$> p_string , JNumber <$> p_number , JObject <$> p_object , JArray <$> p_array , JBool <$> p_bool , JNull <$ string "null" ] <?> "JSON value"
This leads us to the two most interesting parsers, for numbers and strings. We'll deal with numbers first, since they're simpler.
-- file: ch16/JSONParsec.hs p_number :: CharParser () Double p_number = do s <- getInput case readSigned readFloat s of [(n, s')] -> n <$ setInput s' _ -> empty
Our trick here is to take advantage of Haskell's standard
number parsing library functions, which are defined in the
Numeric
module. The readFloat
function reads an unsigned floating point number, and
readSigned
takes a parser for an unsigned
number and turns it into a parser for possibly signed
numbers.
Since these functions know nothing about Parsec, we have to
work with them specially. Parsec's
getInput
function gives us direct access to
Parsec's unconsumed input stream. If readSigned
readFloat
succeeds, it returns both the parsed number
and the rest of the unparsed input. We then use
setInput
to give this back to Parsec as its
new unconsumed input stream.
Parsing a string isn't difficult, merely detailed.
-- file: ch16/JSONParsec.hs p_string :: CharParser () String p_string = between (char '\"') (char '\"') (many jchar) where jchar = char '\\' *> (p_escape <|> p_unicode) <|> satisfy (`notElem` "\"\\")
We can parse and decode an escape sequence with
the help of the choice
combinator that we
just met.
-- file: ch16/JSONParsec.hs p_escape = choice (zipWith decode "bnfrt\\\"/" "\b\n\f\r\t\\\"/") where decode c r = r <$ char c
Finally, JSON lets us encode a Unicode character in
a string as “\u
” followed by four
hexadecimal digits.
-- file: ch16/JSONParsec.hs p_unicode :: CharParser () Char p_unicode = char 'u' *> (decode <$> count 4 hexDigit) where decode x = toEnum code where ((code,_):_) = readHex x
The only piece of functionality that applicative functors are missing, compared to monads, is the ability to bind a value to a variable, which we need here in order to be able to validate the value we're trying to decode.
This is the one place in our parser that we've needed to use a monadic function. This pattern extends to more complicated parsers, too: only infrequently do we need the extra bit of power that monads offer.
As we write this book, applicative functors are still quite new to Haskell, and people are only beginning to explore the possible uses for them beyond the realm of parsing.
As another example of applicative parsing, we will develop a basic parser for HTTP requests.
-- file: ch16/HttpRequestParser.hs module HttpRequestParser ( HttpRequest(..) , Method(..) , p_request , p_query ) where import ApplicativeParsec import Numeric (readHex) import Control.Monad (liftM4) import System.IO (Handle)
An HTTP request consists of a method, an identifier, a
series of headers, and an optional body. For simplicity, we'll
focus on just two of the six method types specified by the HTTP
1.1 standard. A POST
method has a body; a
GET
has none.
-- file: ch16/HttpRequestParser.hs data Method = Get | Post deriving (Eq, Ord, Show) data HttpRequest = HttpRequest { reqMethod :: Method , reqURL :: String , reqHeaders :: [(String, String)] , reqBody :: Maybe String } deriving (Eq, Show)
Because we're writing in an applicative style, our parser can be both brief and readable. Readable, that is, if you're becoming used to the applicative parsing notation.
-- file: ch16/HttpRequestParser.hs p_request :: CharParser () HttpRequest p_request = q "GET" Get (pure Nothing) <|> q "POST" Post (Just <$> many anyChar) where q name ctor body = liftM4 HttpRequest req url p_headers body where req = ctor <$ string name <* char ' ' url = optional (char '/') *> manyTill notEOL (try $ string " HTTP/1." <* oneOf "01") <* crlf
Briefly, the q
helper
function accepts a method name, the type constructor to apply to
it, and a parser for a request's optional body. The
url
helper does not attempt to validate a
URL, because the HTTP specification does not specify what
characters an URL contain. The function just consumes input
until either the line ends or it reaches a HTTP version
identifier.
The try
combinator has to hold onto input in
case it needs to restore it, so that an alternative parser can
be used. This practice is referred to as
backtracking. Because
try
must save input, it is expensive to
use. Sprinkling a parser with unnecessary uses of
try
is a very effective way to slow it
down, sometimes to the point of unacceptable
performance.
The standard way to avoid the need for backtracking is to tidy up a parser so that we can decide whether it will succeed or fail using only a single token of input. In this case, the two parsers consume the same initial tokens, so we turn them into a single parser.
ghci>
let parser = (++) <$> string "HT" <*> (string "TP" <|> string "ML")
ghci>
parseTest parser "HTTP"
"HTTP"ghci>
parseTest parser "HTML"
"HTML"
Even better, Parsec gives us an improved error message if we feed it non-matching input.
ghci>
parseTest parser "HTXY"
parse error at (line 1, column 3): unexpected "X" expecting "TP" or "ML"
Following the first line of a HTTP request is a series of zero or more headers. A header begins with a field name, followed by a colon, followed by the content. If the lines that follow begin with spaces, they are treated as continuations of the current content.
-- file: ch16/HttpRequestParser.hs p_headers :: CharParser st [(String, String)] p_headers = header `manyTill` crlf where header = liftA2 (,) fieldName (char ':' *> spaces *> contents) contents = liftA2 (++) (many1 notEOL <* crlf) (continuation <|> pure []) continuation = liftA2 (:) (' ' <$ many1 (oneOf " \t")) contents fieldName = (:) <$> letter <*> many fieldChar fieldChar = letter <|> digit <|> oneOf "-_" crlf :: CharParser st () crlf = (() <$ string "\r\n") <|> (() <$ newline) notEOL :: CharParser st Char notEOL = noneOf "\r\n"
Our HTTP request parser is too simple to be useful in real deployments. It is missing vital functionality, and is not resistant to even the most basic denial of service attacks.
1. | Make the parser honour the
|
2. | A popular denial of service attack against naive web servers is simply to send unreasonably long headers. A single header might contain tens or hundreds of megabytes of garbage text, causing a server to run out of memory. Restructure the header parser so that it will fail if any line is longer than 4096 characters. It must fail immediately when this occurs; it cannot wait until the end of a line eventually shows up. |
3. | Add the ability to honour the
|
4. | Another popular attack is to open a
connection and either leave it idle or send data
extremely slowly. Write a wrapper in the IO monad
that will invoke the parser. Use the
|
[34] For more on monads, refer to Chapter 14, Monads
[35] For information on dealing with choices that may consume some input before failing, see the section called “Lookahead”.