Pushing back tokens in Happy and Alex - haskell

I'm parsing a language which has both < and <<. In my Alex definition I've got something that contains something like
tokens :-
"<" { token Lt }
"<<" { token (BinOp Shl) }
so whenever I encounter <<, that gets tokenized as a left shift and not as to less-than's. This is generally a good thing, since I end up throwing out whitespace after tokenization and want to differentiate between 1 < < 2 and 1 << 2. However, there are other times I wish << had been read as two <. For example, I have things like
<<A>::B>
which I want read like
< < A > :: B >
Obviously I can try to adjust my Happy parser rules to accommodate for the extra cases, but that scales badly. In other imperative parser generators, I might try to do something like push back "part" of the token (something like push_back("<") when I encountered << but I only needed <).
Has anyone else had such a problem and, if so, how did you deal with it? Are there ways of "pushing back" tokens in Happy? Should I instead try to keep a whitespace token around (I'm actually leaning towards the last alternative - although being a huge headache, it would let me deal with << by just making sure there is no whitespace between the two <).

I don’t know how to express this in Happy, but you don’t need a separate “whitespace” token. You can parse < or > as a distinct “angle bracket” token when immediately followed by an operator symbol in the input, with no intervening whitespace.
Then, when you want to parse an operator, you join a sequence of angles and operators into a single token. When you want to treat them as brackets, you just deal with them separately as usual.
So a << b would be tokenised as:
identifier "a"
left angle -- joined with following operator
operator "<"
identifier "b"
When parsing an operator, you concatenate angle tokens with the following operator token, producing a single operator "<<" token.
<<A>::B> would be tokenised as:
left angle
operator "<" -- accepted as bracket
identifier "A"
right angle
operator "::"
identifier "B"
operator ">" -- accepted as bracket
When parsing angle-bracketed terms, you accept both angle tokens and </> operators.
This relies on your grammar not being ambiguous wrt. whether you should parse an operator name or a bracketed thing.

While I initially went with #Jon's answer, I ended up running into a variety of precedence related issues (think precedence around expr < expr vs expr << expr) which caused me a lot of headaches. I recently (successfully) back to lexing << as one token. The solution was twofold:
I bit the bullet and added extra rules for << (where previously I only had rules for <). For the example in the question (<<A>::B>) my rule went from something like
ty_qual_path
: '<' ty_sum '>' '::' ident
to
ty_qual_path
: '<' ty_sum '>' '::' ident
| '<<' ty_sum '>' '::' ident '>' '::' ident
(The actual rule was actually a bit more involved, but that is not for this answer).
I found a clever way to deal with token that started with > (these would cause problems around things like vector<i32,vector<i32>> where the last >> was a token): use a threaded lexer (section 2.5.2), exploit the {%% ... } RHS of rules which lets you reconsider the lookahead token, and add a pushToken facility to my parser monad (this turned out to be quite simple - here is exactly what I did). I then added a dummy rule - something like
gt :: { () }
: {- empty -} {%% \tok ->
case tok of
Tok ">>" -> pushToken (Tok ">") *> pushToken (Tok ">")
Tok ">=" -> pushToken (Tok "=") *> pushToken (Tok ">")
Tok ">>=" -> pushToken (Tok ">=") *> pushToken (Tok ">")
_ -> pushToken tok
}
And every time in some other rule I expected a > but there could also be any other token starting with >, I would precede the > token with gt. This has the effect of looking ahead to the next token which may could start with > without being >, and try to convert that token into one > token and another token for the "rest" of the initial token.

Related

Recognizing euler's constant (e) only when relevant

I'm learning ANTLR4 to write a parser for a simple language specific to the app developed by the company. So far I've managed to have working arithmetic operations, logic operations, and conditional branchments. When tackling variables though, I ran into a problem. The language defines multiple mathematical constants, such as 'e'. When parsing variables, the parser would recognize the letter e as the constant and not part of the variable.
Below is a small test grammar I wrote to test this specific case, the euler and letter parser rules are there for visual clarity in the trees below
grammar Test; r: str '\r\n' EOF;
str: euler | (letter)* ;
euler: EULER;
letter: LETTER;
EULER: 'e';
LETTER: [a-zA-Z];
Recognition of different strings with this grammar:
"e"
"test"
"qsdf"
"eee"
I thought maybe parser rule precedence had something to do with it, but whatever order the parser rules are in, the output is the same. Swapping the lexer rules allows for correct recognition of "test", but recognizes "e" using the letter rule and not the euler rule. I also thought about defining EULER as:
EULER: ~[a-zA-Z] 'e' ~[a-zA-Z]
but this wouldn't recognize var a=e correctly. Another rule i have in my lexer is the ELSE: 'else' rule, which recognizes the 'else' keyword, which works and doesn't conflict with rule EULER. This is because antlr recognizes the longest input possible, but then why doesn't it recognize "test" as (r (str (letter t) (letter e) (letter s) (letter t)) \r\n <EOF>) as it would for "qsdf"?
You should not have a lexer rule like LETTER that matches a single letter and then "glue" these letters together in a parser rule. Instead, match a variable (consisting of multiple letters) as a single lexer rule:
EULER: 'e';
VARIABLE: [a-zA-Z]+;
I suggest changing your grammar to this:
grammar Test;
r: str '\n' EOF;
str: euler | WORD ;
euler: EULER;
EULER: 'e';
WORD: [a-zA-Z]+;
It appears you wanted a stand-alone "e" to be an euler element, and any other word to be a letter element, but that's not what you coded. Your grammar is doing exactly what you told it to do: Match every "e" as an EULER token (and therefore an euler element), and any other letter as a LETTER token (and therefore a letter element), and build strs out of those two tokens.
An ANTLR4 lexer tokenizes the input stream, trying to build the longest tokens possible, and processing the tokenization rules in the order you code them. Thus EULER will capture every "e", and LETTER will capture "a"-"d", "f"-"z", and "A"-"Z". An ANTLR4 parser maps the stream of tokens (from the lexer) into elements based on the order of tokens and the rules you code. Since the parser will never get a LETTER token for "e", your str elements will always get chopped apart at the "e"s.
The fix for this is to code a lexer rule that collects sequences of letters that aren't stand-alone "e"s into a LETTER token (or, as #pavel-ganelin says, a WORD), and to present that to the parser instead of the individual letters. It's a little more complicated than that, though, becuase you probably want "easy" to be the WORD "easy", not an EULER ("e") followed by the WORD "asy". So, you need to ensure that the "e" starting a string of letters isn't captured as an EULER token. You do that by ensuring that the WORD lexer rule comes before the EULER rule, and that it ignores stand-alone "e"s:
grammar Test;
r: str '\r\n' EOF;
str: euler | word ;
euler: EULER;
word: WORD;
WORD: ('e' [a-zA-Z]+) | [a-zA-Z]+;
EULER: 'e';

ANTLR4 lexer rule ensuring expression does not end with character

I have a syntax where I need to match given the following example:
some-Text->more-Text
From this example, I need ANTLR4 lexer rules that would match 'some-Text' and 'more-Text' into one lexer rule, and the '->' as another rule.
I am using the lexer rules shown below as my starting point, but the trouble is, the '-' character is allowed in the NAMEDELEMENT rule, which causes the first NAMEDELEMENT match to become 'some-Text-', which then causes the '->' to not be captured by the EDGE rule.
I'm looking for a way to ensure that the '-' is not captured as the last character in the NAMEDELEMENT rule (or some other alternative that produces the desired result).
EDGE
: '->'
;
NAMEDELEMENT
: ('a'..'z'|'A'..'Z'|'_'|'#') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')* { _input.LA(1) != '-' && _input.LA(2) != '>' }?
;
Im trying to use the predicate above to look ahead for a sequence of '-' and '>', but it doesn't seem to work. It doesn't seem to do anything at all, actually, as get the same parsing results both with and without the predicate.
The parser rules are as follows, where I am matching on 'selector' rules:
selector
: namedelement (edge namedelement)*
;
edge
: EDGE
;
namedelement
: NAMEDELEMENT
;
Thanks in advance!
After messing around with this for hours, I have a syntax that works, though I fail to see how it is functionally any different than what I posted in the original question.
(I use the uncommented version so that I can put a break point in the generated lexer to ensure that the equality test is evaluating correctly.)
NAMEDELEMENT
//: [a-zA-Z_#] [a-zA-Z_-]* { String.fromCharCode(this._input.LA(1)) != ">" }?
: [a-zA-Z_#] [a-zA-Z_-]* { (function(a){
var c = String.fromCharCode(a._input.LA(1));
return c != ">";
})(this)
}?
;
My target language is JavaScript and both the commented and uncommented forms of the predicate work fine.
Try this:
NAMEDELEMENT
: [a-zA-Z_#] ( '-' {_input.LA(1) != '>'}? | [a-zA-Z0-9_] )*
;
Not sure if _input.LA(1) != '>' is OK with the JavaScript runtime, but in Java it properly tokenises "some-->more" into "some-", "->" and "more".

Token collision (??) writing ANTLR4 grammar

I have what I thought a very simple grammar to write:
I want it to allow token called fact. These token can start with a letter and then allow a any kind of these: letter, digit, % or _
I want to concat two facts with a . but the the second fact does not have to start by a letter (a digit, % or _ are also valid from the second token)
Any "subfact" (even the initial one) in the whole fact can be "instantiated" like an array (you will get it by reading my examples)
For example:
Foo
Foo%
Foo_12%
Foo.Bar
Foo.%Bar
Foo.4_Bar
Foo[42]
Foo['instance'].Bar
etc
I tried to write such grammar but I can't get it working:
grammar Common;
/*
* Parser Rules
*/
fact: INITIALFACT instance? ('.' SUBFACT instance?)*;
instance: '[' (LITERAL | NUMERIC) (',' (LITERAL | NUMERIC))* ']';
/*
* Lexer Rules
*/
INITIALFACT: [a-zA-Z][a-zA-Z0-9%_]*;
SUBFACT: [a-zA-Z%_]+;
ASSIGN: ':=';
LITERAL: ('\'' .*? '\'') | ('"' .*? '"');
NUMERIC: ([1-9][0-9]*)?[0-9]('.'[0-9]+)?;
WS: [ \t\r\n]+ -> skip;
For example, if I tried to parse Foo.Bar, I get: Syntax error line 1 position 4: mismatched input 'Bar' expecting SUBFACT.
I think this is because ANTLR first finds Bar match INITIALFACT and stops here. How can I fix this ?
If it is relevent, I am using Antlr4cs.

Parsing block comments with Megaparsec using symbols for start and end

I want to parse text similar to this in Haskell using Megaparsec.
# START SKIP
def foo(a,b):
c = 2*a # Foo
return a + b
# END SKIP
, where # START SKIP and # END SKIP marks the start and end of the block of text to parse.
Compared to skipBlockComment I want the parser to return the lines between the start and end marker.
This is my parser.
skip :: Parser String
skip = s >> manyTill anyChar e
where s = string "# START SKIP"
e = string "# END SKIP"
The skip parser works as intended.
To allow for a variable amount of white space within the start and end marker, for example # START SKIP I've tried the following:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = symbol "#" >> symbol "END" >> symbol "SKIP"
Using skip' to parse the above text gives the following error.
3:15:
unexpected 'F'
expecting "END", space, or tab
I would like to understand the cause of this error and how I can fix it.
As Alec already commented, the problem is that as soon as e encounters '#', it counts as a consumed character. And the way parsec and its derivatives work is that as soon as you've consumed any characters, you're committed to that parsing branch – i.e. the manyTill anyChar alternative is then not considered anymore, even though e ultimately fails here.
You can easily request backtracking though, by wrapping the end delimiter in try:
skip' :: Parser String
skip' = s >> manyTill anyChar e
where s = symbol "#" >> symbol "START" >> symbol "SKIP"
e = try $ symbol "#" >> symbol "END" >> symbol "SKIP"
This then will before consuming '#' set a “checkpoint”, and when e fails later on (in your example, at "Foo"), it will act as if no characters had matched at all.
In fact, traditional parsec would give the same behaviour also for skip. Just, because looking for a string and only succeeding if it matches entirely is such a common task, megaparsec's string is implemented like try . string, i.e. if the failure occurs within that fixed string then it will always backtrack.
However, compound parsers still don't backtrack by default, like they do in attoparsec. The main reason is that if anything can backtrack to any point, you can't really get a clear point of failure to show in the error message.

Lexical analysis of string token using Parsec

I have this parser for string parsing using Haskell Parsec library.
myStringLiteral = lexeme (
do str <- between (char '\'')
(char '\'' <?> "end of string")
(many stringChar)
; return (U.replace "''" "'" (foldr (maybe id (:)) "" str))
<?> "literal string"
)
Strings in my language are defined as alpha-num characters inside of '' (example: 'this is my string'), but these string can also contain ' inside of it (in this case ' must be escaped by another ', ex 'this is my string with '' inside of it').
What I need to do, is to look forward when ' appears during parsing of string and decide, if there is another ' after or not (if no, return end of string). But I dont know how to do it. Any ideas? Thanks!
If the syntax is as simple as it seems, you can make a special case for the escaped single quote,
escapeOrStringChar :: Parser Char
escapeOrStringChar = try (string "''" >> return '\'') <|> stringChar
and use that in
myStringLiteral = lexeme $ do
char '\''
str <- many escapeOrStringChar
char '\'' <?> "end of string"
return str
You can use stringLiteral for that.
Parsec deals only with LL(1) languages (details). It means the parser can look only one symbol a time. Your language is LL(2). You can write your own FSM for parsing your language. Or you can transform the text before parsing to make it LL(1).
In fact, Parsec is designed for syntactic analysis not lexical. The good idea is to make lexical analysis with other tool and than use Parsec for parsing the sequence of lexemes instead of sequence of chars.

Resources