I'm trying to get a simple grammar to work using ANTLR4. Basically a list of keywords separated by ; that can be negated using Not. Something like this, for example:
Not negative keyword;positive
I wrote the following grammar:
grammar input;
input : clauses;
keyword : NOT? WORD;
clauses : keyword (SEPARATOR clauses)?;
fragment N : ('N'|'n') ;
fragment O : ('O'|'o') ;
fragment T : ('T'|'t') ;
fragment SPACE : ' ' ;
SEPARATOR : ';';
NOT : N O T SPACE;
WORD : ~[;]+;
My issue is that in the keyword rule, WORD seems to have more priority than NOT. Not something is recognized as the Not something word instead of a negated something.
For instance, the parse tree I get is this
.
What I'm trying to achieve is something like this
How can you give an expression more priority over another on ANTLR4? Any tip on fixing this?
Please note that while this grammar is very simple and ANTLR4 can seem unecessary here, the true grammar I want to make is more complex and I have just simplified it here to demonstrate my issue.
Thank you for your time!
You have no explicit whitespace rule and you include whitespaces in your WORD rule. Yet you want words separated by whitespaces. That cannot work. Don't include whitespaces in words (that's against the usual meaning of a word anyway). Instead specify exactly what a word is really (usually a combination of letters and digits, not led by a letter). Additionally, I would restructure the grammar such that positive and negative are not part of keyword, but separate entitites. Here I defined them as own keywords, but if that is not what you want replace them with just WORD:
grammar input;
input : clauses EOF;
keyword : NOT? (POSITIVE | NEGATIVE) WORD?;
clauses : keyword (SEPARATOR keyword)*;
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];
SEPARATOR : ';';
NOT : N O T;
POSITIVE : P O S I T I V E;
NEGATIVE : N E G A T I V E;
fragment LETTER: DIGIT | LETTER_NO_DIGIT;
fragment LETTER_NO_DIGIT: [a-zA-Z_$\u0080-\uffff];
WORD: LETTER_NO_DIGIT LETTER*;
WHITESPACE: [ \t\f\r\n] -> channel(HIDDEN);
fragment DIGIT: [0-9];
fragment DIGITS: DIGIT+;
which gives you this parse tree for your input:
Related
I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR
You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!
I need to implement a parser for this type of logic:the specified grammar
The S character is the initial character of the grammar; L, T, R, V, K, D, F, and E denote nonterminal characters. The terminal character c corresponds to one of the two scalar types specified in the task. The terminal character t corresponds to one of the data types that can be described in the type section.
I created the following grammar:
grammar Parse;
compileString: S+;
S: TYPE L VAR R;
L: T (SEPARATOR|SEPARATOR L);
R: V (SEPARATOR|SEPARATOR R);
V: [a-zA-Z] ([a-zA-Z]| [0-9]|'_')* DEFINITION (D|C);
T: D|C;
TYPE:'type';
VAR:'var';
D: // acceptable data types
'struct'
| 'union'
| 'array'
;
C: 'byte'
|'word' //scalar type
;
SEPARATOR:';';
DEFINITION :':';
WS : [ \t\n\r]+ -> skip ; // whitespaces
But when I try to execute it for the construction: "type byte; var p1:word;", I get the following output:
Tokens:
[#0,0:3='type',<6>,1:0]
[#1,5:9='byte;',<2>,1:5]
[#2,11:13='var',<7>,1:11]
[#3,15:22='p1:word;',<3>,1:15]
[#4,23:22='<EOF>',<-1>,1:23]
Parse Tree:
compileString (
<Error>"type"
<Error>"byte;"
<Error>"var"
<Error>"p1:word;"
)
I do not understand what the problem may be, debugging was performed in VS Code with a plugin from Antlr. I will be glad to any answer!
In ANTLR lexer rules start with capital letters and parser rules with lower case letters. So all of your rules except compileString are lexer rules.
S: TYPE L VAR R; does not match the input type byte; var p1:word; because there are spaces in it and nothing in the definition of S matches spaces. You're probably thinking that shouldn't matter because you're skipping spaces, but tokens are only skipped between lexer rules not inside of them. So it would work if S were a parser rule, but not as a lexer rule.
The same applies to spaces between the separator and L/R in L and R.
PS: I strongly suggest to give your rules longer names as it is quite hard to follow your grammar. You might also consider using the + operator in L and R instead of recursion.
I have (probably) a quite simple question. I am writing a ANTLR grammer to evaluate dice expressions. With this I would like to parse something like 4d6 as well as 4d6d2. The first means "roll 4 six-sided dice" and the second means "roll 4 six-sided dice an drop the 2 lowest". My current grammer is:
grammar Dice;
start : dice ;
dice : NUMBER? DSEPERATOR NUMBER ( KEEP | DROP NUMBER )?;
KEEP : 'k';
DROP : 'd';
DSEPERATOR : [dD];
NUMBER : [0-9]+;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I seem to be getting a problem with the definition of KEEP and DSEPERATOR as they both use the letter d. The parser stops after the first NUMBER in the dice expression. What is the work around here? What do I have to change in my grammar?
The implicit structure being parsed is
e - number of executions
d - type of object
s - number of object sides
o - op to apply*
a - number of op applications*
So, the match rule is:
decode : e d s ( o a )? ;
Consolidating numbers and expanding:
decode : e=NUM type s=NUM ( op a=NUM )? ; // label for convenience
type : dice | .... ;
op : drop | .... ;
dice : D ;
drop : D ;
NUM : [0-9]+ ;
D : 'd' ;
I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance
This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.
I'm trying to write an ANTLR4 grammar for a fortran-inspired DSL. I'm having difficulty with the 'ole classic ".op." operators:
if (1.and.1) then
where both "1"s should be intepreted as integer. I looked at the OpenFortranParser for insight, but I can't make sense out of it.
Initially, I had suitable definitions for INTEGER and REAL in my lexer. Consequently, the first "1" above always parsed as a REAL, no matter what I tried. I tried moving things into the parser, and got it to the point where I could reliably recognize the ".and." along with numbers around it as appropriately INTEGER or REAL.
if (1.and.1) # INT/INT
if (1..and..1) # REAL/REAL
...etc...
I of course want to recognize variable-names in such statements:
if (a.and.b)
and have an appropriate rule for ID. In the small grammar below, however, any literals in quotes (ex, 'and', 'if', all the single-character numerical suffixes) are not accepted as an ID, and I get an error; any other ID-conforming string is accepted:
if (a.and.b) # errs, as 'b' is valid INTEGER suffix
if (a.and.c) # OK
Any insights into this behavior, or better suggestions on how to parse the .op. operators in fortran would be greatly appreciated -- Thanks!
grammar Foo;
start : ('if' expr | ID)+ ;
DOT : '.' ;
DIGITS: [0-9]+;
ID : [a-zA-Z0-9][a-zA-Z0-9_]* ;
andOp : DOT 'and' DOT ;
SIGN : [+-];
expr
: ID
| expr andOp expr
| numeric
| '(' expr ')'
;
integer : DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
real
: DIGITS DOT DIGITS? (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
| DOT DIGITS (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
;
numeric : integer | real;
EOLN : '\r'? '\n' -> skip;
WS : [ \t]+ -> skip;
To disambiguate DOT, add a lexer rule with a predicate just before the DOT rule.
DIT : DOT { isDIT() }? ;
DOT : '.' ;
Change the 'andOp'
andOp : DIT 'and' DIT ;
Then add a predicate method
#lexer::members {
public boolean isDIT() {
int offset = _tokenStartCharIndex;
String r = _input.getText(Interval.of(offset-4, offset));
String s = _input.getText(Interval.of(offset, offset+4));
if (".and.".equals(s) || ".and.".equals(r)) {
return true;
}
return false;
}
}
But, that is not really the source of your current problem. The integer parser rule defines lexer constants effectively outside of the lexer, which is why 'b' is not matched to an ID.
Change it to
integer : INT ;
INT: DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
and the lexer will figure out the rest.