söndag 26 juli 2009

Starting on the comment support

After a brief hiatus due to writing a paper, I'm back to work on the project again, and I just released haskell-src-exts-1.1.0. What's interesting in this release is that it contains the first rudimentary support for handling comments, the true goal of this project.

Current support is limited to a single data type for comments distinguishing between single-line and multi-line comments (not sure that's even useful), and added functions that return all comments found while parsing as a list alongside the AST.

One feature I would like to see at this point is the ability to match comments to the AST elements they are attached to (if any) and vice versa, however this is not a trivial task. First of all, the AST as stated doesn't include enough source location information for e.g. expressions to give any reference point for comments. That's just a matter of adding though. But even assuming we have that, imagine the following two functions:

astToComment :: ast -> [Comment] -> [Comment]

commentToAst :: Comment -> ast -> ???

The first issue is with the type 'ast', which needs to be a member of some type class implementing these functions, since we will want to use this for any kind of AST entities e.g. Module, Decl, Exp etc. That's pretty easy to fix though, and it's possible that using SrcLoc as argument for the first one is more sensible anyway.

The second issue is what to return from the first function. How do we know what comments, if any, are actually attached to a given AST element if we only look at that element in isolation? Do we need to pass the entire AST tree as well, for reference?

Third, the ??? is rather problematic. If we start from a comment that we don't know what AST entity it is attached to (if any), we don't even know the type of the thing we want to return. The comment could be attached to a declaration, an expression, a function parameter or whathaveyou. Just think of Haddock comments for a good idea of what I mean.

So, all in all the design space here is far from trivial to navigate. If anyone has any input on this then please speak up!

3 kommentarer:

  1. You might look at what the DIANA standard did for Ada so many years ago. See "Goos, G., Wulf, W.A., Evans, A., Butler, K.J. (ed.). DIANA: An Intermediate Language For Ada (Revised Edition). Springer-Verlag. (C) 1983. ISBN 3-540-12695-3." (Book 161 in Springer's "Lecture Notes in Computer Science" series.)

    Anyway, quoting page 168:
    "In order properly to reconstruct the source, DIANA must be capable of recording comments. To this end, every DIANA node that has a source position attribute (i.e. all those which correspond to points in the source program) has the additional attribute 'lx_comments : comments;' which is an implementation-dependent type."

    Would you object to storing comments using something like Data.Witness (or Data.OpenWitness) so that they can be encapsulated behind an existential and cracked open only by their producers?

  2. I think the comments that should be returned are those that are between the node and the previous node at the same level of the tree, and between the node and the next node on the same level of the tree.

    This is quite general, so this information can then be refined to meet the specific needs of different applications. Some applications may only be interested in the comments between the node and previous non-comment token. Haddock, on the other hand, sometimes wants all the comments between the nodes. Here's an example:

    Data A = B {- | comment for 'C' -} | C

    Here the comment applies to C even though there's a non-comment token between it and C's node.

    Haddock also wants the comments after the node, in most cases.

    I think a good way to implement this is to walk the tree and the token list at the same time, using generics to focus only on the SrcSpans. This procedure would build something like Map SrcSpan ([Comment],[Comment]), which can later be used to implement the interface.

    I don't have a solution for the type of commentToAst.

  3. On second thought, this may not be such a good idea.

    In the example from the previous post, if a comment is put before =, it would be attached to both A and B. And it's going to be annoying to fix up things like that in a later step.