How to parse C-style comments with Alex lexer?

How to parse C-style comments with Alex lexer? - haskell

NB. I'm using this Alex template from Simon Marlow.
I'd like to create lexer for C-style comments. My current approach creates separate tokens for starting comments, ending, middle and oneline
%wrapper "monad"
tokens :-
<0> $white+ ;
<0> "/*" { mkL LCommentStart `andBegin` comment }
<comment> . { mkL LComment }
<comment> "*/" { mkL LCommentEnd `andBegin` 0 }
<0> "//" .*$ { mkL LSingleLineComment }
data LexemeClass
= LEOF
| LCommentStart
| LComment
| LCommentEnd
| LSingleLineComment
How can I reduce number of middle tokens? For input /*blabla*/ I will get 8 tokens instead of one!
How can I strip the // part from single line comment token?
Is it possible to lex comments without monad wrapper?

Have a look at this:
http://lpaste.net/107377
Test with something like:
echo "This /* is a */ test" | ./c_comment
which should print:
Right [W "This",CommentStart,CommentBody " is a ",CommentEnd,W "test"]
The key alex routines you need to use are:
alexGetInput -- gets the current input state
alexSetInput -- sets the current input state
alexGetByte -- returns the next byte and input state
andBegin -- return a token and set the current start code
Each of the routines commentBegin, commentEnd and commentBody have the following signature:
AlexInput -> Int -> Alex Lexeme
where Lexeme stands for the your token type. The AlexInput parameter has the form (for the monad wrapper):
(AlexPosn, Char, [Bytes], String)
The Int parameter is the length of the match stored in the String field. Therefore the form of most token handlers will be:
handler :: AlexInput -> Int -> Alex Lexeme
handler (pos,_,_,inp) len = ... do something with (take len inp) and pos ...
In general it seems that a handler can ignore the Char and [Bytes] fields.
The handlers commentBegin and commentEnd can ignore both the AlexInput and Int arguments because they just match fixed length strings.
The commentBody handler works by calling alexGetByte to accumulate the comment body until "*/" is found. As far as I know C comments may not be nested so the comment ends at the first occurrence of "*/".
Note that the first character of the comment body is in the match0 variable. In fact, my code has a bug in it since it will not match "/**/" correctly. It should look at match0 to decide whether to start at loop or loopStar.
You can use the same technique to parse "//" style comments - or any token where a non-greedy match is required.
Another key point is that patterns like $white+ are qualified with a start code:
<0>$white+
This is done so that they are not active while processing comments.
You can use another wrapper, but note that the structure of the AlexInput type may be different -- e.g. for the basic wrapper it is just a 3-tuple: (Char,[Byte],String). Just look at the definition of AlexInput in the generated .hs file.
A final note... accumulating characters using ++ is, of course, rather inefficient. You probably want to use Text (or ByteString) for the accumulator.

Related

Including comments in AST

I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.
At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having
data Expr = Add Expr Expr | ...
one would have
data Expr a = Add a Expr Expr
and use a for whatever annotation (e.g. for comments that come after the expression).
However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:
for (i in 1:10)
{
... // list of statements
}
Now, excluding the body there are at least 10 places where one could put one (or more) comments:
/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/
{ /*K*/
...
In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations.
This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?

You can probably capture most of those positions with just two generic comment productions:
Expr -> Comment Expr
Stmt -> Comment Stmt
This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):
Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt
In other words: one before each literal string but the first. Seems not too onerous, ultimately.

Permutation parsing with megaparsec + parser-combinators too lenient

I'm attempting to parse permutations of flags. The behavior I want is "one or more flags in any order, without repetition". I'm using the following packages:
megaparsec
parser-combinators
The code I have is outputting what I want, but is too lenient on inputs. I don't understand why it's accepting multiples of the same flags. What am I doing wrong here?
pFlags :: Parser [Flag]
pFlags = runPermutation $ f <$>
toPermutation (optional (GroupFlag <$ char '\'')) <*>
toPermutation (optional (LeftJustifyFlag <$ char '-'))
where f a b = catMaybes [a, b]
Examples:
"'-" = [GroupFlag, LeftJustifyFlag] -- CORRECT
"-'" = [LeftJustifyFlag, GroupFlag] -- CORRECT
"''''-" = [GroupFlag, LeftJustifyFlag] -- INCORRECT, should fail if there's more than one of the same flag.

Instead of toPermutation with optional, I believe you need to use toPermutationWithDefault, something like this (untested):
toPermutationWithDefault Nothing (Just GroupFlag <$ char '\'')
The reasoning is described in the paper “Parsing Permutation Phrases” (PDF) in §4, “adding optional elements” (emph. added):
Consider, for example […] all permutations of a, b and c. Suppose b can be empty and we want to recognise ac. This can be done in three different ways since the empty b can be recognised before a, after a or after c. Fortunately, it is irrelevant for the result of a parse where exactly the empty b is derived, since order is not important. This allows us to use a strategy similar to the one proposed by Cameron: parse nonempty constituents as they are seen and allow the parser to stop if all remaining elements are optional. When the parser stops the default values are returned for all optional elements that have not been recognised.
To implement this strategy we need to be able to determine whether a parser can derive the empty string and split it into its default value and its non-empty part, i.e. a parser that behaves the same except that it does not recognise the empty string.
That is, the permutation parser needs to know which elements can succeed without consuming input, otherwise it will be too eager to commit to a branch. I don’t know why this would lead to accepting multiples of an element, though; perhaps you’re also missing an eof?

Bison simple postfix calculator using union with struct

Im making a simple calculator that prints postfix to learn bison. i could make the postfix part work ,but now i need to do assigments to variables (a-z) tex: a=3+2; should print: a32+=; for example. Im trying modified my working postfix code to be able to read a char too.
If I understand correctly, to be able to put different types in the $$ i need a union and to make the nodes i should use a struct because in my case 'expr' can be a int or a char.
This is my parser:
%code requires{
struct nPtr{
char *val;
int num;
};
}
%union {
int iValue;
char sIndex;
struct nPtr *e;
};
%token PLUS MINUS STAR LPAREN RPAREN NEWLINE DIV ID NUMBER POW EQL
%type <iValue> NUMBER
%type <sIndex> ID
%type <e> expr line
%left PLUS MINUS
%left STAR DIV
%left POW
%left EQL
%%
line : /* empty */
|line expr NEWLINE { printf("=\n%d\n", $2.num); }
expr : LPAREN expr RPAREN { $$.num = $2.num; }
| expr PLUS expr { $$.num = $1.num + $3.num; printf("+"); }
| expr MINUS expr { $$.num = $1.num - $3.num; printf("-"); }
| expr STAR expr { $$.num = $1.num * $3.num; printf("*"); }
| expr DIV expr { $$.num = $1.num / $3.num; printf("/");}
| expr POW expr { $$.num = pow($1.num, $3.num); printf("**");}
| NUMBER { $$.num = $1.num; printf("%d", yylval); }
| ID EQL expr { printf("%c", yylval); }
;
%%
I have this in the lex to handle the "=" and variables
"=" { return EQL; }
[a-z] { yylval.sIndex = strdup(yytext); return ID; }
i get and error
warning empty rule for typed nonterminal and no action
the only answer i found here about that is this:
Bison warning: Empty rule for typed nonterminal
It says to just remove the /* empty */ part in:
line: /* empty */
| line expr NEWLINE { printf("=\n%d\n", $2.num); }
when i do that i get 3 new warnings:
warning 3 nonterminal useless in grammar
warning 10 rules useless in grammar
fatal error: start symbol line does not derive any sentence
I googled and got some solutions that just gave me other problems, for example.
when i change:
line:line expr NEWLINE { printf("=\n%d\n", $2.num); }
to
line:expr NEWLINE { printf("=\n%d\n", $2.num); }
bison works but when i try to run the code in visual studio i get a lot of errors like:
left of '.e' must have struct/union type
'pow': too few arguments for call
'=': cannot convert from 'int' to 'YYSTYPE'
thats as afar as i got. I cant find a simple example similar to my needs. I just want to make 'expr' be able to read a char and print it. If someone could check my code and recommend some changes . it will be really really appreciated.

It is important to actually understand what you are asking Bison to do, and what it is telling you when it suggests that your instructions are wrong.
Applying a fix from someone else's problem will only work if they made the same mistake as you did. Just making random changes based on Google searching is neither a very structured way to debug nor to learn a new tool.
Now, what you said was:
%type <e> expr line
Which means that expr and line both produce a value whose union tag is e. (Tag e is a pointer to a struct nPtr. I doubt that's what you wanted either, but we'll start with the Bison warnings.)
If a non-terminal produces a value, it produce a value in every production. To produce a value, the semantic rule needs to assign a value (of the correct tag-type) to $$. For convenience, if there is no semantic action and if $1 has the correct tag-type, then Bison will provide the default rule $$ = $1. (That sentence is not quite correct but it's a useful approximation.) Many people think that you should not actually rely on this default, but I think it's OK as long as you know that it is happening, and have verified that the preconditions are valid.
In your case, you have two productions for line. Neither of them assigns a value to $$. This might suggest that you didn't really intend for line to even have a semantic value, and if we look more closely, we'll see that you never attempt to use a semantic value for any instance of the non-terminal line. Bison does not, as far as I know, attempt such a detailed code examination, but it is capable of noting that the production:
line : /* empty */
has no semantic action, and
does not have any symbols on the right-hand side, so no default action can be constructed.
Consequently, it warns you that there is some circumstance in which line will not have its value set. Think of this as the same warning as your C compiler might give you if it notices that a variable you use has never had a value assigned to it.
In this case, as I said, Bison has not noticed that you never use value of line. It just assumes that you wouldn't have declared the type of the value if you don't intend to somehow provide a value. You did declare the type, and you never provide a value. So your instructions to Bison were not clear, right?
Now, the solution is not to remove the production. There is no problem with the production, and if you remove it then line is left with only one production:
line : line expr NEWLINE
That's a recursive production, and the recursion can only terminate if there is some other production for line which does not recurse. That would be the production line: %empty, which you just deleted. So now Bison can see that the non-terminal line is useless. It is useless because it can never derive any string without non-terminals.
But that's a big problem, because that's the non-terminal your grammar is supposed to recognize. If line can't derive any string, then your grammar cannot recognize anything. Also, because line is useless, and expr is only used by a production in line (other than recursive uses), expr itself can never actually be used. So it, too, becomes useless. (You can think of this as the equivalent of your C compiler complaining that you have unreachable code.)
So then you attempt to fix the problem by making (the only remaining) rule for line non-recursive. Now you have:
line : expr NEWLINE
That's fine, and it makes line useful again. But it does not recognise the language you originally set out to recognise, because it will now only recognise a single line. (That is, an expr followed by a NEWLINE, as the production says.)
I hope you have now figured out that your original problem was not the grammar, which was always correct, but rather the fact that you told Bison that line produced a value, but you did not actually do anything to provide that value. In other words, the solution would have been to not tell Bison that line produces a value. (You could, instead, provide a value for line in the semantic rule associated with every production for line, but why would you do that if you have no intention of every using that value?)
Now, let's go back to the actual type declarations, to see why the C compiler complains. We can start with a simple one:
expr : NUM { $$.num = $1.num; }
$$ refers the the expr itself, which is declared as having type tag e, which is a struct nPtr *. The star is important: it indicates that the semantic value is a pointer. In order to get at a field in a struct from a pointer to the struct, you need to use the -> operator, just like this simple C example:
struct nPtr the_value; /* The actual struct */
struct nPtr *p = &the_value; /* A pointer to the struct */
p.num = 3; /* Error: p is not a struct */
p->num = 3; /* Fine. p points to a struct */
/* which has a field called num */
(*p).num = 3; /* Also fine, means exactly the same thing. */
/* But don't write this. p->num is much */
/* more readable. */
It's also worth noting that in the C example, we had to make sure that p was initialised to the address of some actual memory. If we hadn't done that and we attempted to assign something to p->num, then we would be trying to poke a value into an undefined address. If you're lucky, that would segfault. Otherwise, it might overwrite anything in your executable.
If any of that was not obvious, please go back to your C textbook and make sure that you understand how C pointers work, before you try to use them.
So, that was the $$.num part. The actual statement was
$$.num = $1.num;
So now lets look at the right-hand side. $1 refers to a NUM which has tag type iValue, which is an int. An int is not a struct and it has no named fields. $$.num = $1.num is going to be processed into something like:
int i = <some value>;
$$.num = i.num;
which is totally meaningless. Here, the compiler will again complain that iValue is not a struct, but for a different reason: on the left-hand side, e was a pointer to a struct (which is not a struct); on the right-hand side iValue is an integer (which is also not a struct).
I hope that gives you some ideas on what you need to do. If in doubt, there are a variety of fully-worked examples in the Bison manual.

Megaparsec: macro expansion during parsing

In a small DSL, I'm parsing macro definitions, similarly to #define C pre-processor directives (here a simplistic example):
_def mymacro(a,b) = a + b / a
When the following call is encountered by the parser
c = mymacro(pow(10,2),3)
it is expanded to
c = pow(10,2) + 3 / pow(10,2)
My current approach is:
wrap the parser in a State monad
when parsing macro definitions, store them in the state, with their body unparsed (parse it as a string)
when parsing a macro call, find the definition in the state, replace the arguments in the body text, replace the call with this body and resume the parsing.
Some code from the last step:
macrocallStmt
= do -- capture starting position and content of old input before macro call
oldInput <- getInput
oldPos <- getPosition
-- parse the call
ret <- identifier
symbolCS "="
i <- identifier
args <- parens $ commaSep anyExprStr
-- expand the macro call
us <- get
let inlinedCall = replaceMacroArgs i args ret us
-- set up new input with macro call expanded
remainder <- getInput
let newInput = T.append inlinedCall (T.cons '\n' remainder)
setPosition oldPos
setInput newInput
-- update the expanded input script
modify (updateExpandedInput oldInput newInput)
anyExprStr = fmap praShow expression <|> fmap praShow algexpr
This approach does the job decently. However, it has a number of drawbacks.
Parsing multiple times
Any valid DSL expression can be an argument of the macro call. Therefore, even though I only need their textual representation (to be replaced in the macro body), I need to parse them and then convert them again to string - simply looking for the next comma wouldn't work. Then the complete and customised macro will be parsed. So in practice, macro arguments get parsed twice (and also show-ed, which has its cost). Moreover, each call requires a new parsing of the (almost same) body. The reason to keep the body unparsed in memory is to allow maximum flexibility: in the body, even DSL keywords could be constructed out of the macro arguments.
Error handling
Because the expanded body is inserted in front of the unconsumed input (replacing the call), the initial and final input can be quite different. In the event of a parse error, the position where the error occurred in the expanded input is available. However, when processing the error, I only have the original, not expanded, input. So the error position won't match.
That is why, in the code snippet above, I use the state to save the expanded input, so that it is available when the parser exits with an error.
This works well, but I noticed that it becomes quite costly, with new Text arrays (the input stream is Text) being allocated for the whole stream at every expansion. Perhaps keeping the expanded input in the state as String, rather than Text, would be cheaper in this case, i.e. when a middle part needs to be replaced?
The reasons for this question are:
I would appreciate suggestions / comments on the two issues described above
Can anyone suggest a better approach altogether?

How can I write a parser using Parsec that only accepts unique elements?

I have recently started learning Haskell and have been trying my hand at Parsec. However, for the past couple of days I have been stuck with a problem that I have been unable to find the solution to. So what I am trying to do is write a parser that can parse a string like this:
<"apple", "pear", "pineapple", "orange">
The code that I wrote to do that is:
collection :: Parser [String]
collection = (char '<') *> (string `sepBy` char ',')) <* (char '>')
string :: Parser String
string = char '"' *> (many (noneOf ['\"', '\r', '\n', '"'])) <* char '"'
This works fine for me as it is able to parse the string that I have defined above. Nevertheless, I would now like to enforce the rule that every element in this collection must be unique and that is where I am having trouble. One of the first results I found when searching on the internet was this one, which suggest the usage of the nub function. Although the problem stated in that question is not the same, it would in theory solve my problem. But what I don't understand is how I can apply this function within a Parser. I have tried adding the nub function to several parts of the code above without any success. Later I also tried doing it the following way:
collection :: Parser [String]
collection = do
char '<'
value <- (string `sepBy` char ','))
char '>'
return nub value
But this does not work as the type does not match what nub is expecting, which I believe is one of the problems I am struggling with. I am also not entirely sure whether nub is the right way to go. My fear is that I am going in the wrong direction and that I won't be able to solve my problem like this. Is there perhaps something I am missing? Any advice or help anyone could provide would be greatly appreciated.

The Parsec Parser type is an instance of MonadPlus which means that we can always fail (ie cause a parse error) whenever we want. A handy function for this is guard:
guard :: MonadPlus m => Bool -> m ()
This function takes a boolean. If it's true, it return () and the whole computation (a parse in this case) does not fail. If it's false, the whole thing fails.
So, as long as you don't care about efficiency, here's a reasonable approach: parse the whole list, check for whether all the elements are unique and fail if they aren't.
To do this, the first thing we have to do is write a predicate that checks if every element of a list is unique. nub does not quite do the right thing: it return a list with all the duplicates taken out. But if we don't care much about performance, we can use it to check:
allUnique ls = length (nub ls) == length ls
With this predicate in hand, we can write a function unique that wraps any parser that produces a list and ensures that list is unique:
unique parser = do res <- parser
guard (allUnique res)
return res
Again, if guard is give True, it doesn't affect the rest of the parse. But if it's given False, it will cause an error.
Here's how we could use it:
λ> parse (unique collection) "<interactive>" "<\"apple\",\"pear\",\"pineapple\",\"orange\">"
Right ["apple","pear","pineapple","orange"]
λ> parse (unique collection) "<interactive>" "<\"apple\",\"pear\",\"pineapple\",\"orange\",\"apple\">"
Left "<interactive>" (line 1, column 46):unknown parse error
This does what you want. However, there's a problem: there is no error message supplied. That's not very user friendly! Happily, we can fix this using <?>. This is an operator provided by Parsec that lets us set the error message of a parser.
unique parser = do res <- parser
guard (allUnique res) <?> "unique elements"
return res
Ahhh, much better:
λ> parse (unique collection) "<interactive>" "<\"apple\",\"pear\",\"pineapple\",\"orange\",\"apple\">"
Left "<interactive>" (line 1, column 46):
expecting unique elements
All this works but, again, it's worth noting that it isn't efficient. It parses the whole list before realizing elements aren't unique, and nub takes quadratic time. However, this works and it's probably more than good enough for parsing small to medium-sized files: ie most things written by hand rather than autogenerated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string