Providence Salumu Appendix B. Characters, strings, and escaping rules

Appendix B. Characters, strings, and escaping rules

Table of Contents

Writing character and string literals
International language support
Escaping text
Single-character escape codes
Multiline string literals
ASCII control codes
Control-with-character escapes
Numeric escapes
The zero-width escape sequence

This appendix covers the escaping rules used to represent non-ASCII characters in Haskell character and string literals. Haskell's escaping rules follow the pattern established by the C programming language, but expand considerably upon them.

Writing character and string literals

A single character is surrounded by ASCII single quotes, ', and has type Char.

ghci> 'c'
'c'
ghci> :type 'c'
'c' :: Char

A string literal is surrounded by double quotes, ", and has type [Char] (more often written as String).

ghci> "a string literal"
"a string literal"
ghci> :type "a string literal"
"a string literal" :: [Char]

The double-quoted form of a string literal is just syntactic sugar for list notation.

ghci> ['a', ' ', 's', 't', 'r', 'i', 'n', 'g'] == "a string"
True

International language support

Haskell uses Unicode internally for its Char data type. Since String is just an alias for [Char], a list of Chars, Unicode is also used to represent strings.

Different Haskell implementations place limitations on the character sets they can accept in source files. GHC allows source files to be written in the UTF-8 encoding of Unicode, so in a source file, you can use UTF-8 literals inside a character or string constant. Do be aware that if you use UTF-8, other Haskell implementations may not be able to parse your source files.

When you run the ghci interpreter interactively, it may not be able to deal with international characters in character or string literals that you enter at the keyboard.

[Note]Note

Although Haskell represents characters and strings internally using Unicode, there is no standardised way to do I/O on files that contain Unicode data. Haskell's standard text I/O functions treat text as a sequence of 8-bit characters, and do not perform any character set conversion.

There exist third-party libraries that will convert between the many different encodings used in files and Haskell's internal Unicode representation.

Escaping text

Some characters must be escaped to be represented inside a character or string literal. For example, a double quote character inside a string literal must be escaped, or else it will be treated as the end of the string.

Single-character escape codes

Haskell uses essentially the same single-character escapes as the C language and many other popular languages.

Table B.1. Single-character escape codes

EscapeUnicodeCharacter
\0U+0000null character
\aU+0007alert
\bU+0008backspace
\fU+000Cform feed
\nU+000Anewline (line feed)
\rU+000Dcarriage return
\tU+0009horizontal tab
\vU+000Bvertical tab
\"U+0022double quote
\&n/aempty string
\'U+0027single quote
\\U+005Cbackslash

Multiline string literals

To write a string literal that spans multiple lines, terminate one line with a backslash, and resume the string with another backslash. An arbitrary amount of whitespace (of any kind) can fill the gap between the two backslashes.

"this is a \
	\long string,\
    \ spanning multiple lines"

ASCII control codes

Haskell recognises the escaped use of the standard two- and three-letter abbreviations of ASCII control codes.

Table B.2. ASCII control code abbreviations

EscapeUnicodeMeaning
\NULU+0000null character
\SOHU+0001start of heading
\STXU+0002start of text
\ETXU+0003end of text
\EOTU+0004end of transmission
\ENQU+0005enquiry
\ACKU+0006acknowledge
\BELU+0007bell
\BSU+0008backspace
\HTU+0009horizontal tab
\LFU+000Aline feed (newline)
\VTU+000Bvertical tab
\FFU+000Cform feed
\CRU+000Dcarriage return
\SOU+000Eshift out
\SIU+000Fshift in
\DLEU+0010data link escape
\DC1U+0011device control 1
\DC2U+0012device control 2
\DC3U+0013device control 3
\DC4U+0014device control 4
\NAKU+0015negative acknowledge
\SYNU+0016synchronous idle
\ETBU+0017end of transmission block
\CANU+0018cancel
\EMU+0019end of medium
\SUBU+001Asubstitute
\ESCU+001Bescape
\FSU+001Cfile separator
\GSU+001Dgroup separator
\RSU+001Erecord separator
\USU+001Funit separator
\SPU+0020space
\DELU+007Fdelete

Control-with-character escapes

Haskell recognises an alternate notation for control characters, which represents the archaic effect of pressing the control key on a keyboard and chording it with another key. These sequences begin with the characters \^, followed by a symbol or uppercase letter.

Table B.3. Control-with-character escapes

EscapeUnicodeMeaning
\^@U+0000null character
\^A through \^ZU+0001 through U+001Acontrol codes
\^[U+001Bescape
\^\U+001Cfile separator
\^]U+001Dgroup separator
\^^U+001Erecord separator
\^_U+001Funit separator

Numeric escapes

Haskell allows Unicode characters to be written using numeric escapes. A decimal character begins with a digit, e.g. \1234. A hexadecimal character begins with an x, e.g. \xbeef. An octal character begins with an o, e.g. \o1234.

The maximum value of a numeric literal is \1114111, which may also be written \x10ffff or \o4177777.

The zero-width escape sequence

String literals can contain a zero-width escape sequence, written \&. This is not a real character, as it represents the empty string.

ghci> "\&"
""
ghci> "foo\&bar"
"foobar"

The purpose of this escape sequence is to make it possible to write a numeric escape followed immediately by a regular ASCII digit.

ghci> "\130\&11"
"\130\&11"

Because the empty escape sequence represents an empty string, it is not legal in a character literal.

Want to stay up to date? Subscribe to the comment feed for this chapter, or the entire book.

Copyright 2007, 2008 Bryan O'Sullivan, Don Stewart, and John Goerzen. This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 License. Icons by Paul Davey aka Mattahan.

Providence Salumu