tisdagen den 11:e augusti 2009

Quick update

Just a quick note since I realise I haven't been too good at writing updates of late...

I'm currently working on a complete revamp of the AST, lexer and parser to allow for exact source info to be kept in the tree, which in turn will allow exact printing of the code as it was read. In the process I've included support for annotated AST trees (and the source info makes use of such labels). The problem with this is that it's a huge amount of work to get it all to a state where I can start testing it. I'm currently bashing away on the parser, which is by far the most difficult component to update, being happy-based and all.

I should add that the "old" simpler AST will still be there, the new complex AST with annotations will live in a different module, so this should all be backwards compatible. There will only be one parser though, which parses to the complex version, and then a simple translation function from annotated to simple. But the stuff exported from the parser will also still be backwards compatible, so stay your pitchforks and torches! :)

söndagen den 26:e juli 2009

Starting on the comment support

After a brief hiatus due to writing a paper, I'm back to work on the project again, and I just released haskell-src-exts-1.1.0. What's interesting in this release is that it contains the first rudimentary support for handling comments, the true goal of this project.

Current support is limited to a single data type for comments distinguishing between single-line and multi-line comments (not sure that's even useful), and added functions that return all comments found while parsing as a list alongside the AST.

One feature I would like to see at this point is the ability to match comments to the AST elements they are attached to (if any) and vice versa, however this is not a trivial task. First of all, the AST as stated doesn't include enough source location information for e.g. expressions to give any reference point for comments. That's just a matter of adding though. But even assuming we have that, imagine the following two functions:

astToComment :: ast -> [Comment] -> [Comment]

commentToAst :: Comment -> ast -> ???

The first issue is with the type 'ast', which needs to be a member of some type class implementing these functions, since we will want to use this for any kind of AST entities e.g. Module, Decl, Exp etc. That's pretty easy to fix though, and it's possible that using SrcLoc as argument for the first one is more sensible anyway.

The second issue is what to return from the first function. How do we know what comments, if any, are actually attached to a given AST element if we only look at that element in isolation? Do we need to pass the entire AST tree as well, for reference?

Third, the ??? is rather problematic. If we start from a comment that we don't know what AST entity it is attached to (if any), we don't even know the type of the thing we want to return. The comment could be attached to a declaration, an expression, a function parameter or whathaveyou. Just think of Haddock comments for a good idea of what I mean.

So, all in all the design space here is far from trivial to navigate. If anyone has any input on this then please speak up!

lördagen den 20:e juni 2009

GSoC status report, week 4

Just a quick note to get something in the HWN. :-)

As promised last week, I've implemented "everything" now, and I have released a series of release candidates for an actual stable version 1.0.0. All help in testing it for bugs will be greatly appreciated!

lördagen den 13:e juni 2009

GSoC status report, week 3

Maybe I can get one of these written before the weekly news go up this time... 

My first milestone for haskell-src-exts is to have full and correct support for (almost) everything code-related. When I'm there, I will release haskell-src-exts 1.0.0. And I'm really quite close now.

On the issue of extensions, haskell-src-exts now supports nearly all extensions implemented by GHC. The two exceptions are UnicodeSyntax and NewQualifiedOperators, both of which are considered low priority. Neither should be hard to fix, but other things take precedence currently. There is also ExtensibleRecords, RestrictedTypeSynonyms and HereDocuments, all Hugs extensions, which are not yet supported. These are also considered less important for now.

On the issue of parametrisation, I've had to restructure quite a bit to enable the proper checks for types and contexts. In particular the fact that contexts are parsed as types, and passed in a ctype, means I've had to implement a two-tier check that they are correct. First to check that the type actually represents a valid context, which also transforms it into a list of assertions. Second to check that the assertions are correct in the context they are used. With FlexibleInstances on then anything goes, but without it there's a difference between contexts in class and instance declarations compared to elsewhere. This two-tier system is now in place, and seems to work just fine. In the process, it also gets rid of the wart that was the TyPred constructor in the Type data type, just like I previously got rid of e.g. Wildcard in the Exp data type. I'll probably write a separate entry about this stuff.

I've also improved the interface a lot, with multiple entry points to the parser and some tidying up of the top-level functions. In particular, parseFile now looks at the LANGUAGE pragmas to see what extensions to look for while parsing the rest.

So what remains for 1.0.0? Two things basically. First there are a few bugs in the trac left to squash, which (with one ugly but uncommon exception) shouldn't be much work. Second, there's that issue of handling operator fixities. That shouldn't be a lot of work either, so I have good hopes of having a release candidate out some time early next week. Stay tuned!

tisdagen den 9:e juni 2009

What's in a forall?

Look at the following module:

module Foo where
type Foo = Int
Is this module syntactically correct? I doubt anyone would say no to that, it's just a standard type synonym declaration, nothing strange. Fully Haskell 98 compliant too.

What about this module then:
module Foo where
type Foo = Bar
Is it syntactically correct? Sure it is, the fact that Bar isn't in scope is completely irrelevant where syntax is concerned. Loading it into GHCi would give a
Not in scope: type constructor or class `Bar'
error, which comes from the name resolver/type inference stage, not from the parsing stage. There's a clear cut line between a syntactic error and a semantic error here.

Let's look at yet another module:
module Foo where
type Foo = forall a . a -> a
Is it syntactically correct? Obviously that depends on your angle - it isn't Haskell 98 due to the explicit forall, but GHC would accept it readily with the proper extension enabled, namely Rank2Types or RankNTypes. But does that mean that this is only syntactically correct when one of those extensions are enabled? Let's see.

With no extension flags given, GHCi refuses the module with the error message
Illegal operator `.' in type `forall a . (a -> a)'
It doesn't recognize the explicit quantification syntax, and instead thinks that . is a type-level operator, which isn't allowed since I don't have TypeOperators on. Just for fun, trying it with TypeOperators enabled makes GHCi complain that the names forall and a are not in scope. I should add that in both these cases, GHCi actually also suggests that I might be looking for something like Rank2Types, since it cleverly recognizes the forall word. At any rate, with no extension given or with only TypeOperators it is clear that we don't get the syntax we're after.

With either Rank2Types or RankNTypes enabled, GHCi obviously correctly recognizes the forall as a quantifier and will accept the program. But those two flags are not the only ones that enable the forall syntax. In fact there are three other flags that do - ExistentialQuantification, LiberalTypeSynonyms and PolymorphicComponents. If we try to load our module into GHCi with either of these enabled, we get a different error message:
Illegal polymorphic or qualified type: 
forall a. a -> a
Apparently GHCi recognized and accepted the syntax of the type declaration, but the type checker balked at it because we haven't enabled sufficiently liberal types. But this is not a syntactic issue, but something else entirely, at least where GHCi is concerned.

I don't fully agree with GHCi here though, because of ExistentialQuantification. For the two other cases - LiberalTypeSynonyms and PolymorphicComponents - I think it's perfectly fine that forall-quantified types are made generally available syntactically, even if the type checker will exclude some programs that try to use them in the wrong ways. Both those extention deal with how and where to allow forall-quantified types, the first as arguments to type constructors and the second as arguments to data constructors. But for ExistentialQuantification, the category of syntactically allowed programs is actually different. What ExistentialQuantification allows us is to write forall quantifiers that precede constructors in data type declarations, e.g:
module Bar where
data Bar = forall a . Bar a
Indeed when I try to run this with no extentions enabled, GHCi complains that forall is not a viable data constructor and suggests that I use ExistentialQuantification. I would definitely consider it a syntactic issue to allow the forall keyword to precede data constructor declarations like this, and hence I don't agree with GHCi.

On a side note, something funny happens when I turn on e.g. PolymorphicComponents which enables forall as a keyword. Then GHCi complains that
Data constructor `Bar' has existential 
type variables, or a context
and suggests I use ExistentialQuantification or GADTs to allow this. Indeed, GADTs are a generalisation of ExistentialQuantification, except that it doesn't enable forall as a keyword. So using any extension that enables the forall keyword, plus GADTs, or just ExistentialQuantification, will make GHCi accept the program.

Going back to types, what about the following module:
module Bar where
import Foo (Foo)
type Bar = (Foo -> Int) -> Int
Syntactically correct? Absolutely, there's nothing strange here at all, fully Haskell 98 compliant. Will GHCi accept it? That depends on Foo. If the type Foo is defined as we did earlier as an explicitly polymorphic type, then Bar is a rank-3 type, which requires the RankNTypes extension. The correctness of Bar doesn't depend on module Bar alone, but on its dependent module Foo. Even if I enable Rank2Types for Bar (and Foo of course, which needs it), I still get an error since this type is actually rank 3. And in fact, if I inline Foo in the definition of Bar to instead write
module Bar where
type Bar = ((forall a . a -> a) -> Int) -> Int
GHCi gives me the exact same error message (except of course it mentions the explicit polymorphic type instead of the equivalent Foo):
Illegal polymorphic or qualified type:
forall a. a -> a
So clearly this last program is also syntactically correct, as long as the forall keyword is enabled, just like the simple one that imported Foo. Neither is type correct without RankNTypes, but both are syntactically correct, assuming explicit quantification is enabled at all.

What I'm trying to get at here with all of this is that it's not easy to draw a line between what constitutes a syntactic error and what is rather a semantic error when it comes to forall-quantified types. There are a number of different stances to take here, and I need to pick one for haskell-src-exts to implement.

The core issue is this: Is there a general syntactic difference between programs accepted with Rank2Types and programs accepted with RankNTypes? It is certainly possible to syntactically detect some programs that would require the full RankNTypes, by inspection of the types used. If a type has a higher rank than allowed, and this can be seen right in the syntax by inspecting the arguments to the function arrows, then it would be possible to rule the program out. But as shown by the import example above, there is no way we can syntactically see the difference between these classes of programs in general, we would need to do actual type analysis (and import chasing) for that. Fundamentally it's simply not a syntactic issue.

The question then becomes, to what extent do we bother to even try to detect the difference for those cases where it is possible? I don't think we should, at all. There may not be a clean cut separation between syntactic and type/semantic issues at all times, but I feel we are best off trying to adhere to what separation we can, and making it as clear cut as possible. The principle of least surprise suggests to me that it would be better to not try to be clever, and instead document clearly what haskell-src-exts considers actual syntax and what is left for other analyses.

The stance I intend to take is thus the following:
  • Allow forall-quantified types in all shapes and forms as long as one of Rank2Types, RankNTypes, LiberalTypeSynonyms and PolymorphicComponents is enabled.

  • Allow forall-quantified data constructor declarations (using normal non-GADT-style syntax) if ExistentialQuantification is enabled. GADTs would not lexically enable the forall keyword, and even when coupled with another extension that does I don't think it makes sense to let it enable this syntax.

  • Don't make any other difference between any of these classes of programs in what is allowed and what isn't.
If you think I am making the wrong choice here, please speak up, I would love to get input on this. And of course, please speak up if you agree with me as well. :-)

lördagen den 30:e maj 2009

Parametrising haskell-src-exts on extensions

I've started working on my GSoC project, and the first task to tackle is to make haskell-src-exts aware of what extensions it should look for when parsing. Most importantly this is due to the fact that some extensions steal syntax, so a valid Haskell 98 program may no longer be valid when some extension is turned on.

The typical example of this is Template Haskell, which steals the $ident syntax to mean a splice, rather than the application of the operator $ to whatever follows. Since the aim of haskell-src-exts is to be able to parse all valid Haskell programs, that should of course include those that don't rely on extensions as well.

As it turns out, there already exists a list of all Haskell extensions to date, both syntactic and semantic. This list resides with Cabal, to allow installation to dictate the proper flags to compilers. Since that list is in some sense the de facto standard list of extensions, I want to use the same, even if some of the items on the list are purely semantic and thus of no consequence to haskell-src-exts. There are two abers with that plan though.
  1. The list does not contain entries for the extensions enabled by HSP and HaRP, which are crucial components in haskell-src-exts (indeed those are the reason I created this library in the first place).

  2. The list is very poorly documented, and I've had to dig deep in user manuals to figure out what some of the extensions actually entail.

To solve the first, I've submitted a patch to Cabal to include two new entries: XmlSyntax and RegularPatterns. I find these names suitably telling for what they actually introduce to the language, as opposed to calling them something like HSP and HARP. Until that patch is actually applied and released, I will keep a local copy of the list.

To solve the second, meaning to be sure I actually understand these listed extensions right, I intend to post a list with my understanding and hope that a kind soul will point out any potential errors that may have arisen. So here goes.

I've categorized this list according to the way I would need to handle the cases in the lexer or parser respectively. The cases that can be handled purely in the lexer, simply because all productions relating to some extension rely on new lexemes, are the easiest to handle. If all cases could be handled in the lexer I would be very happy, but alas. There are cases that rely only on the usual lexemes but put them together in new ways, which means they must be handled in the parser. A few extensions introduce both kinds, and then there are a whole bunch that don't require any action at all.

Purely lexical extensions:
==========================

UnboxedTuples: Enables (# and #) lexemes.

MagicHash: Enables literals and idents ending with # and ##.

RecursiveDo: Enables the mdo keyword.

ImplicitParams: Introduces the ?ident and %ident lexemes, which should govern all
  the parsing cases.

KindSignatures: Introduces * as a special symbol. Note that we need to ensure that it is handled as a normal symbol if this extension is *not* on.

TemplateHaskell: Oh boy. Enables the following lexemes: $(, $ident, [|, |], [d|, [e|, [p|, as well as 'ident and ''ident. But it should be enough to handle on the lexer side, since all productions use one or more of the above.

ForeignFunctionInterface: Enables foreign, safe, unsafe, threadsafe, stdcall, ccall and export as keywords.

XmlSyntax: Enables the <ident, </ident, >, />, <%, %>, <[ and ]> lexemes, as well as the contextually introduced PCDATA lexeme.

The following extensions are also handled completely on the lexer side, but are not yet supported by haskell-src-exts:

Arrows: Enables the following lexemes: proc, -<, >-, -<<, >>-.

NewQualifiedOperators: Uses Prelude.(+) instead of Prelude.+ in prefix mode, and `Prelude.(+)` instead of `Prelude.+` in infix mode. Note that this means we need to parametrize prettyprinting on extensions as well.

TransformListComp: Should be called GeneralizedListComp! Adds the group, by and using keywords.

QuasiQuotes: Enables the [$ident| lexeme. This one will be tricky since we need to introduce a new lexer context that tells us to lex anything between [$ident| and |] as just a string lexeme.

Generics: Introduces {| and |} as lexemes.

UnicodeSyntax: Enables the ::, ->, <-, =>, .. and forall (when enabled) lexemes in Unicode versions. There is a proposal to add unicode versions of * (with KindSignatures) and -<, >-, -<<, >>- (with Arrows). I see no reason why haskell-src-exts couldn't support those out of the box. Supporting these lexemes when parsing is trivial, but it shows once more that we need to parametrize pretty-printing as well. We couldn't hope to support both unicode and non-unicode versions in the same document when printing (parsing is no problem, but I have *no* intention of embellishing the AST to remember what kind of symbol the parser saw for each arrow etc), but I doubt anyone using Unicode versions would mind if we "accidentally" transform their non-unicode variants to unicode as well.
  
ExtensibleRecords: Enables #ident as a lexeme, and the Rec keyword. Also enables trex-record syntax (a = True, b = 1), record update syntax (c = "hello" | r), and corresponding types Rec (a::Bool,b::Int) and contexts r\a => Rec (a :: Int | r).

Purely parsing extensions:
==========================

PatternGuards: Enables all standard statements in guard position, default would be just a single qualifier.

ViewPatterns: Enables patterns of the form (exp -> pat).

ParallelListComp: Enables several | clauses in list comprehensions.

NamedFieldPuns, RecordPuns: Enables C {a} syntax in both expressions and patterns. RecordPuns is deprecated.

RecordWildcards: Enables C {..} syntax in both expressions and patterns.

PackageImports: Allows an optional package name as a String between import and module name.

EmptyDataDecls: Allows data declarations without a body.

TypeOperators: Allows operators in type, data and class declarations. Also allows infix `Either` etc.

GADTs: Allows the where-style declaration of data types. These are already handled in a special way in the parser, so should be easy.

StandaloneDeriving: Parsing only, enables deriving decls.

MultiParamTypeClasses: Parsing only, allow more than one parameter or argument to class and instance declarations respectively.

FunctionalDependencies: Parsing only, add the | a -> b syntax to class heads.

FlexibleInstances: Allow arbitary class assertions in context and head for instance declarations.

FlexibleContexts: Allows contexts of forms other than (C a) or (C (a1 .. an)).

ScopedTypeVariables, PatternSignatures: Allows type signatures to be given to patterns. PatternSignatures is deprecated.
  
BangPatterns: Allows the use of the ! symbol in patterns, marking them as strict.

RestrictedTypeSynonyms: Hugs extension that should probably be deprecated. Adds an in block to type declarations, to give the type a scope.


Extensions that need both lexing and parsing:
===================================

LiberalTypeSynonyms: Enables the forall keyword. Also allows a number of liberal uses of types:

  • forall quantified type synonyms: 
    type Foo = forall a . Show a => a
  • Allows unboxed tuples in type synonyms.
  • Apply type synonyms to forall types: 
    f :: Foo (forall a . a -> a)
    Note that syntactically this is the same as Rank2Types.
  • Applying type synonyms to partially applied type synonyms (not syntactic so we don't need to care).


ExistentialQuantification: Enables the forall keyword. Also allows the use of forall binders to precede data constructor declarations.

TypeFamilies: Enables the family keyword, which alone governs a number of parser rules. But associated types and data types must still be handled in the parser, as well as data and type instances. Also type equality constraints must be handled in the parser.

PolymorphicComponents: Enables the forall keyword. Also allows forall quantified arguments to data constructors.

Rank2Types: Enables the forall keyword. Allows any function to have a forall-quantified argument, including data constructors (hence implies PolymorphicComponents).

RankNTypes: Enables the forall keyword. Allows the use of forall-quantified arguments anywhere that would allow a type, except as arguments to type constructors (hence implies Rank2Types).

ImpredicativeTypes: Enables the forall keyword. Allows the use of forall-quantified arguments as arguments to type constructors.

RegularPatterns: Enables the (|, |) and @: lexemes. Also enables the parsing of regular operators like pat+, pat? etc, hence the need to treat in both.

HereDocuments: I'm honestly not sure I will ever support these, since they need a complete bypass of the layout rule, as well as a weird mix of string and expression lexemes.


Completely separate:
==========================

CPP: We already depend on cpphs, we should probably run that on the document before parsing if this extension is enabled. There is no way we're going to be able to print the document with the CPP stuff included though, so refactoring won't work for CPP-ed documents. At least I don't see any immediate solution... (Hmm, one way could be to follow every possible CPP path through the document, and parse them all. Ouch.)


No action needed:
==========================

NoImplicitPrelude: Purely semantic, determines what symbols are in scope automatically.

PostfixOperators: Purely semantic, uses standard left-sections but gives them a slightly new meaning.

DisambiguateRecordFields: Only about disambiguation of names, which goes beyond parsing.

DeriveDataTypeable: Adds more cases to the normal deriving clause, which the parser makes no attempt to check anyway, that's semantics and not syntax.
 
GeneralizedNewtypeDeriving: Same as the former case, newtype deriving is already allowed by H98, all this does is allow more cases. Semantics, not syntax.

ConstrainedClassMethods: The syntax is already allowed, it's just that in H98 the type class variable and the constrained variable may not be the same variable. We don't check that anyway.

OverlappingInstances, UndecidableInstances, IncoherentInstances, TypeSynonymInstances: These only declare what instances are valid, possibly together, but we don't do any typing validation, and we don't check to see what's a type constructor and what's a data constructor.

OverloadedStrings: This only changes the type of string literals, not their syntax in any way.

RelaxedPolyRec: This only governs what recursive bindings are allowed, we don't check names so won't even see a normal recursion when it happens.

NoMonomorphismRestriction, NoMonoPatBinds: Will only affect the type interpretations of bindings - semantics, not syntax. It should really be MonoPatBinds though, not the negated version, regardless of what GHC uses as default.

ExtendedDefaultRules: Guides the type inference engine to defaults - obviously nothing syntactic.

UnliftedFFITypes: Completely undocumented, but I'm guessing it has to do with allowing things like Int# as corresponding unlifted C types when writing FFI. Not sure if it implies MagicHash or not, presumably not (and I don't think it should).


And that's the end of it. Any and all comments on my categorizations and interpretations of extensions from a syntactic point of view are more than welcome.

So far I've embellished the parser with an extension environment, and implemented the cases listed above as (being implemented and) requiring lexer attention. So far nothing tricky, but the parser will add some interesting cases, in particular the code for validating class and instance heads and contexts for the many different cases that are allowed.

Oh, and the code 

foo :: forall a . a

can mean very different things depending on what extensions are on. Is it the type operator . applied to forall a and a respectively, with forall being a type variable of kind * -> * ...?

torsdagen den 28:e maj 2009

Haskell Platform, I'm in love

Sitting here with my shiny new laptop, I can tell already it's going to increase my productivity multifold. No more waiting 5 minutes or more for haskell-src-exts to compile!

I'm going through the standard motions, first install is Opera of course since I can't stand IE more than the time it takes me to type http://www.opera.com/download . And I need a browser for everything else I want to install.

Second install is GH... no! Halfway through the download I realize my error. No longer will I have to go through the chain of installing GHC, cabal, alex, happy, etc. Haskell Platform, yay! One download, one installer, and I'm ready to roll. A most pleasant experience I must say, and everything seems to work perfectly out of the box on my 64-bit Vista. Way to go, glorious Haskell community!