lördag 30 maj 2009

Parametrising haskell-src-exts on extensions

I've started working on my GSoC project, and the first task to tackle is to make haskell-src-exts aware of what extensions it should look for when parsing. Most importantly this is due to the fact that some extensions steal syntax, so a valid Haskell 98 program may no longer be valid when some extension is turned on.

The typical example of this is Template Haskell, which steals the $ident syntax to mean a splice, rather than the application of the operator $ to whatever follows. Since the aim of haskell-src-exts is to be able to parse all valid Haskell programs, that should of course include those that don't rely on extensions as well.

As it turns out, there already exists a list of all Haskell extensions to date, both syntactic and semantic. This list resides with Cabal, to allow installation to dictate the proper flags to compilers. Since that list is in some sense the de facto standard list of extensions, I want to use the same, even if some of the items on the list are purely semantic and thus of no consequence to haskell-src-exts. There are two abers with that plan though.
  1. The list does not contain entries for the extensions enabled by HSP and HaRP, which are crucial components in haskell-src-exts (indeed those are the reason I created this library in the first place).

  2. The list is very poorly documented, and I've had to dig deep in user manuals to figure out what some of the extensions actually entail.

To solve the first, I've submitted a patch to Cabal to include two new entries: XmlSyntax and RegularPatterns. I find these names suitably telling for what they actually introduce to the language, as opposed to calling them something like HSP and HARP. Until that patch is actually applied and released, I will keep a local copy of the list.

To solve the second, meaning to be sure I actually understand these listed extensions right, I intend to post a list with my understanding and hope that a kind soul will point out any potential errors that may have arisen. So here goes.

I've categorized this list according to the way I would need to handle the cases in the lexer or parser respectively. The cases that can be handled purely in the lexer, simply because all productions relating to some extension rely on new lexemes, are the easiest to handle. If all cases could be handled in the lexer I would be very happy, but alas. There are cases that rely only on the usual lexemes but put them together in new ways, which means they must be handled in the parser. A few extensions introduce both kinds, and then there are a whole bunch that don't require any action at all.

Purely lexical extensions:
==========================

UnboxedTuples: Enables (# and #) lexemes.

MagicHash: Enables literals and idents ending with # and ##.

RecursiveDo: Enables the mdo keyword.

ImplicitParams: Introduces the ?ident and %ident lexemes, which should govern all
  the parsing cases.

KindSignatures: Introduces * as a special symbol. Note that we need to ensure that it is handled as a normal symbol if this extension is *not* on.

TemplateHaskell: Oh boy. Enables the following lexemes: $(, $ident, [|, |], [d|, [e|, [p|, as well as 'ident and ''ident. But it should be enough to handle on the lexer side, since all productions use one or more of the above.

ForeignFunctionInterface: Enables foreign, safe, unsafe, threadsafe, stdcall, ccall and export as keywords.

XmlSyntax: Enables the <ident, </ident, >, />, <%, %>, <[ and ]> lexemes, as well as the contextually introduced PCDATA lexeme.

The following extensions are also handled completely on the lexer side, but are not yet supported by haskell-src-exts:

Arrows: Enables the following lexemes: proc, -<, >-, -<<, >>-.

NewQualifiedOperators: Uses Prelude.(+) instead of Prelude.+ in prefix mode, and `Prelude.(+)` instead of `Prelude.+` in infix mode. Note that this means we need to parametrize prettyprinting on extensions as well.

TransformListComp: Should be called GeneralizedListComp! Adds the group, by and using keywords.

QuasiQuotes: Enables the [$ident| lexeme. This one will be tricky since we need to introduce a new lexer context that tells us to lex anything between [$ident| and |] as just a string lexeme.

Generics: Introduces {| and |} as lexemes.

UnicodeSyntax: Enables the ::, ->, <-, =>, .. and forall (when enabled) lexemes in Unicode versions. There is a proposal to add unicode versions of * (with KindSignatures) and -<, >-, -<<, >>- (with Arrows). I see no reason why haskell-src-exts couldn't support those out of the box. Supporting these lexemes when parsing is trivial, but it shows once more that we need to parametrize pretty-printing as well. We couldn't hope to support both unicode and non-unicode versions in the same document when printing (parsing is no problem, but I have *no* intention of embellishing the AST to remember what kind of symbol the parser saw for each arrow etc), but I doubt anyone using Unicode versions would mind if we "accidentally" transform their non-unicode variants to unicode as well.
  
ExtensibleRecords: Enables #ident as a lexeme, and the Rec keyword. Also enables trex-record syntax (a = True, b = 1), record update syntax (c = "hello" | r), and corresponding types Rec (a::Bool,b::Int) and contexts r\a => Rec (a :: Int | r).

Purely parsing extensions:
==========================

PatternGuards: Enables all standard statements in guard position, default would be just a single qualifier.

ViewPatterns: Enables patterns of the form (exp -> pat).

ParallelListComp: Enables several | clauses in list comprehensions.

NamedFieldPuns, RecordPuns: Enables C {a} syntax in both expressions and patterns. RecordPuns is deprecated.

RecordWildcards: Enables C {..} syntax in both expressions and patterns.

PackageImports: Allows an optional package name as a String between import and module name.

EmptyDataDecls: Allows data declarations without a body.

TypeOperators: Allows operators in type, data and class declarations. Also allows infix `Either` etc.

GADTs: Allows the where-style declaration of data types. These are already handled in a special way in the parser, so should be easy.

StandaloneDeriving: Parsing only, enables deriving decls.

MultiParamTypeClasses: Parsing only, allow more than one parameter or argument to class and instance declarations respectively.

FunctionalDependencies: Parsing only, add the | a -> b syntax to class heads.

FlexibleInstances: Allow arbitary class assertions in context and head for instance declarations.

FlexibleContexts: Allows contexts of forms other than (C a) or (C (a1 .. an)).

ScopedTypeVariables, PatternSignatures: Allows type signatures to be given to patterns. PatternSignatures is deprecated.
  
BangPatterns: Allows the use of the ! symbol in patterns, marking them as strict.

RestrictedTypeSynonyms: Hugs extension that should probably be deprecated. Adds an in block to type declarations, to give the type a scope.


Extensions that need both lexing and parsing:
===================================

LiberalTypeSynonyms: Enables the forall keyword. Also allows a number of liberal uses of types:

  • forall quantified type synonyms: 
    type Foo = forall a . Show a => a
  • Allows unboxed tuples in type synonyms.
  • Apply type synonyms to forall types: 
    f :: Foo (forall a . a -> a)
    Note that syntactically this is the same as Rank2Types.
  • Applying type synonyms to partially applied type synonyms (not syntactic so we don't need to care).


ExistentialQuantification: Enables the forall keyword. Also allows the use of forall binders to precede data constructor declarations.

TypeFamilies: Enables the family keyword, which alone governs a number of parser rules. But associated types and data types must still be handled in the parser, as well as data and type instances. Also type equality constraints must be handled in the parser.

PolymorphicComponents: Enables the forall keyword. Also allows forall quantified arguments to data constructors.

Rank2Types: Enables the forall keyword. Allows any function to have a forall-quantified argument, including data constructors (hence implies PolymorphicComponents).

RankNTypes: Enables the forall keyword. Allows the use of forall-quantified arguments anywhere that would allow a type, except as arguments to type constructors (hence implies Rank2Types).

ImpredicativeTypes: Enables the forall keyword. Allows the use of forall-quantified arguments as arguments to type constructors.

RegularPatterns: Enables the (|, |) and @: lexemes. Also enables the parsing of regular operators like pat+, pat? etc, hence the need to treat in both.

HereDocuments: I'm honestly not sure I will ever support these, since they need a complete bypass of the layout rule, as well as a weird mix of string and expression lexemes.


Completely separate:
==========================

CPP: We already depend on cpphs, we should probably run that on the document before parsing if this extension is enabled. There is no way we're going to be able to print the document with the CPP stuff included though, so refactoring won't work for CPP-ed documents. At least I don't see any immediate solution... (Hmm, one way could be to follow every possible CPP path through the document, and parse them all. Ouch.)


No action needed:
==========================

NoImplicitPrelude: Purely semantic, determines what symbols are in scope automatically.

PostfixOperators: Purely semantic, uses standard left-sections but gives them a slightly new meaning.

DisambiguateRecordFields: Only about disambiguation of names, which goes beyond parsing.

DeriveDataTypeable: Adds more cases to the normal deriving clause, which the parser makes no attempt to check anyway, that's semantics and not syntax.
 
GeneralizedNewtypeDeriving: Same as the former case, newtype deriving is already allowed by H98, all this does is allow more cases. Semantics, not syntax.

ConstrainedClassMethods: The syntax is already allowed, it's just that in H98 the type class variable and the constrained variable may not be the same variable. We don't check that anyway.

OverlappingInstances, UndecidableInstances, IncoherentInstances, TypeSynonymInstances: These only declare what instances are valid, possibly together, but we don't do any typing validation, and we don't check to see what's a type constructor and what's a data constructor.

OverloadedStrings: This only changes the type of string literals, not their syntax in any way.

RelaxedPolyRec: This only governs what recursive bindings are allowed, we don't check names so won't even see a normal recursion when it happens.

NoMonomorphismRestriction, NoMonoPatBinds: Will only affect the type interpretations of bindings - semantics, not syntax. It should really be MonoPatBinds though, not the negated version, regardless of what GHC uses as default.

ExtendedDefaultRules: Guides the type inference engine to defaults - obviously nothing syntactic.

UnliftedFFITypes: Completely undocumented, but I'm guessing it has to do with allowing things like Int# as corresponding unlifted C types when writing FFI. Not sure if it implies MagicHash or not, presumably not (and I don't think it should).


And that's the end of it. Any and all comments on my categorizations and interpretations of extensions from a syntactic point of view are more than welcome.

So far I've embellished the parser with an extension environment, and implemented the cases listed above as (being implemented and) requiring lexer attention. So far nothing tricky, but the parser will add some interesting cases, in particular the code for validating class and instance heads and contexts for the many different cases that are allowed.

Oh, and the code 

foo :: forall a . a

can mean very different things depending on what extensions are on. Is it the type operator . applied to forall a and a respectively, with forall being a type variable of kind * -> * ...?

Inga kommentarer:

Skicka en kommentar