I have the following grammar ( minimized for SO)
grammar Hello;
odataIdentifier : identifierLeadingCharacter identifierCharacter*;
identifierLeadingCharacter : Alpha| UNDERSCORE;
identifierCharacter : identifierLeadingCharacter | Digit;
identifierUnreserved : identifierCharacter | (MINUS | DOT | TILDE);
Digit : ZERO_TO_FIVE |[6-9];
ONEHUNDRED_TO_ONEHUNDREDNINETYNINE : '1' Digit Digit; // 100-199
TWOHUNDRED_TO_TWOHUNDREDFOURTYNINE : '2' ZERO_TO_FOUR Digit; // 200-249
TWOHUNDREDFIFTY_TO_TWOHUNDREDFIFTYFIVE : '25' ZERO_TO_FIVE; // 250-255
TEN_TO_NINETYNINE : ONE_TO_NINE Digit; // 10-99
ZERO_TO_ONE : [0-1];
ZERO_TO_TWO : ZERO_TO_ONE | [2];
ZERO_TO_THREE : ZERO_TO_TWO | [3];
ZERO_TO_FOUR : ZERO_TO_THREE | [4];
ZERO_TO_FIVE : ZERO_TO_FOUR | [5];
ONE_TO_TWO : [1-2];
ONE_TO_THREE : ONE_TO_TWO | [3];
ONE_TO_FOUR : ONE_TO_THREE | [4];
ONE_TO_NINE : ONE_TO_FOUR | [5-9];
Alpha : [a-zA-Z];
MINUS : [-];
DOT : '.';
UNDERSCORE : '_';
TILDE : '~';
WS : (' '|'\r'|'\t'|'\u000C'|'\n') -> skip
;
for input c9 it works fine, but when i have 2 digits for example c10 it says:
extraneous input '92' expecting {<EOF>, Digit, Alpha, '_'}
so i guess it parses 9 and parses 2 and doesn't know if this should be TEN_TO_NINETYNINE or 2 Digit Digit.
i am a noob to this, so wondering if my analysis is right and how could i alleviate this ...
Your input is resulting in an Alpha token followed by a TEN_TO_NINETYNINE token. While the parser rule identifierLeadingCharacter does allow the Alpha token, the identifierCharacter rule cannot match a TEN_TO_NINETYNINE token.
The input 10 will always produce a TEN_TO_NINETYNINE token rather than two Digit tokens, because the former matches more of the input and lexer rules are greedy.
Related
First of all, thanks a lot for your time.
Practicing a little bit more with antlr4, I made this grammar (below).
Input
The tested input is the following:
text to search query_on:fielda,fieldab fielda:"123" sort_by:+fielda,-fieldabc
This produces the next output starting to fail on the query_on-varname rule.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda,fieldab fielda)))) : "123" sort_by : + fielda, - fieldabc\n)
If instead of this input I separate the commas with spaces:
text to search query_on:fielda , fieldab fielda:"123" sort_by:+fielda , -fieldabc
The output is much more similar to "my" expexted output:
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc\n))))) <EOF>)
The only failing part is the last \n.
Expected
The expected results is the same as before but accepting the varname fieldabc and skipping the \n.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc))))))
Questions
Therefore:
Why the grammar is sensitive to the spaces around a comma ?
Similarly, why the \n char is not skipped at the end ?
Thanks!
GRAMMAR
grammar SearchEngine;
// Grammar
start: query EOF;
query
: '(' query+ ')'
| query (OR query)+
| expr
;
expr: text_query query_on? filters* sorting_fields?;
text_query: STRING+;
query_on: QUERY_ON ':' varname (',' varname)*;
filters: binary_op+;
binary_op: varname ':' value;
sorting_fields: SORT_BY ':' sorting_field (',' sorting_field)*;
sorting_field: sorting_order varname;
sorting_order: (asc|desc);
asc: '+';
desc: '-';
varname
: FIELDA
| FIELDAB
| FIELDABC
;
value: STRING;
// Lexer rules (tokens)
WHITE_SPACE: [ \t\r\n] -> skip;
OR: O R;
QUERY_ON: Q U E R Y '_' O N;
SORT_BY: S O R T '_' B Y;
FIELDA: F I E L D A;
FIELDAB: F I E L D A B;
FIELDABC: F I E L D A B C;
STRING: ~[ :()+-]+;
// Fragments (not tokens)
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];
Your STRING Lexer rule accepts tabs and linefeeds. Try:
STRING: ~[ :()+-,\t\r\n]+;
(Having your WHITESPACE rule above it won't affect this, because ANTLRs Lexer rules will select the longest sequence of characters that match any Lexer rule). This is also, why you'll usually see grammars require some sort of delimiter on strings. (The delimiters also distinguish between identifiers and string literals in most languages)
I am creating my own language with ANTLR 4 and I would like to create a rule to define variables with their types for example.
string = "string"
boolean = true
integer = 123
double = 12.3
string = string // reference to variable
Here is my grammar.
// lexer grammar
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
ID : LETTER+;
STRING : '"' ( ~ '"' )* '"' ;
BOOLEAN: ( 'true' | 'fase');
INTEGER: DIGIT+ ;
DOUBLE: DIGIT+ ('.' DIGIT+)*;
// parser grammar
program: main EOF;
main: study ;
study : studyBlock (assignVariableBlock)? ;
simpleAssign: name = ID '=' value = (STRING | BOOLEAN | INTEGER | BOOLEAN | ID);
listAssign: name = ID '=' value = listString #listStringAssign;
assign: simpleAssign #simpleVariableAssign
| listAssign #listOfVariableAssign
;
assignVariableBlock: assign+;
key: name = ID '[' value = STRING ']';
listString: '{' STRING (',' STRING)* '}';
studyParameters: (| ( simpleAssign (',' simpleAssign)*) );
studyBlock: 'study' '(' studyParameters ')' ;
When I test with this example ANTLR displays the following error
study(timestamp = "10:30", region = "region", businessDate="2020-03-05", processType="ID")
bool = true
region = "region"
region = region
line 4:7 no viable alternative at input 'bool=true'
line 6:9 no viable alternative at input 'region=region'
How can I fix that?.
When I test your grammar and start at the program rule for the given input, I get the following parse tree (without any errors or warnings):
You either don't start with the correct parser rule, or are testing an old parser and need to generate new classes from your grammar.
I've been using antlr for 3 days. I can parse expressions, write Listeners, interpret parse trees... it's a dream come true.
But then I tried to match a literal string 'foo%' and I'm failing. I can find plenty of examples that claim to do this. I have tried them all.
So I created a tiny project to match a literal string. I must be doing something silly.
grammar Test;
clause
: stringLiteral EOF
;
fragment ESCAPED_QUOTE : '\\\'';
stringLiteral : '\'' ( ESCAPED_QUOTE | ~('\n'|'\r') ) + '\'';
Simple test:
public class Test {
#org.junit.Test
public void test() {
String input = "'foo%'";
TestLexer lexer = new TestLexer(new ANTLRInputStream(input));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree clause = parser.clause();
System.out.println(clause.toStringTree(parser));
ParseTreeWalker walker = new ParseTreeWalker();
}
}
The result:
Running com.example.Test
line 1:1 token recognition error at: 'f'
line 1:2 token recognition error at: 'o'
line 1:3 token recognition error at: 'o'
line 1:4 token recognition error at: '%'
line 1:6 no viable alternative at input '<EOF>'
(clause (stringLiteral ' ') <EOF>)
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.128 sec - in com.example.Test
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
The full maven-ized build tree is available for a quick review here
31 lines of code... most of it borrowed from small examples.
$ mvn clean test
Using antlr-4.5.2-1.
fragment rules can only be used by other lexer rules. So, you need to make stringLiteral a lexer rule instead of a parser rule. Just let it start with an upper case letter.
Also, it's better to expand your negated class ~('\n'|'\r') to include a backslash and quote, and you might want to include a backslash to be able to be escaped:
clause
: StringLiteral EOF
;
StringLiteral : '\'' ( Escape | ~('\'' | '\\' | '\n' | '\r') ) + '\'';
fragment Escape : '\\' ( '\'' | '\\' );
I need to handle this sequences: <1>, <1-2>, <3-5 /0.5/>.
In ANTLR v3 I used these rules:
LPOINTY : ('<' REPEAT (PROBABILITY)? '>') => '<' // will consume only '<'
repeatOperator : LPOINTY_OR_ABNF_URI (XML_NM_TOKEN (weightOrProbability'>')?
In ANTLR v4, there is not allowed this opertor "=>", so I wrote this like that:
LPOINTY_OR_ABNF_URI // will return only digit, ex: 1, 1-2, 3-5
: '<' REPEAT '>' { setText(getText().substring(1, getText().length() - 1)); }
| '<' REPEAT WS+ { setText(getText().substring(1, getText().length())); }
;
repeatOperator
: LPOINTY_OR_ABNF_URI (WEIGHT_OR_PROBABILITY)? SHARP_BRACKET_RIGHT?
;
where tokens:
XML_NM_TOKEN - it match content of '<..>'
weightOrProbability and WEIGHT_OR_PROBABILITY - it match /0.5/
PROBABILITY - it match /0.5/
WS - it match white spaces
SHARP_BRACKET_RIGHT - it matches '>'
Is there better way to this ? I would like to use look ahead functionality and consume only the first charcter, like in old version. Is there a way do this ?
My solution:
REPEAT_OP1
: '<' REPEAT '>' { setText(getText().substring(1, getText().length()-1)); }
;
REPEAT_OP2
: '<' REPEAT { setText(getText().substring(1, getText().length())); }
;
repeatOperator
: REPEAT_OP1
| REPEAT_OP2 WEIGHT_OR_PROBABILITY? SHARP_BRACKET_RIGHT
| REPEAT_OP2 WEIGHT_OR_PROBABILITY? {notifyErrorListeners("Missing closing '>'!");}
;
I am trying to parse a boolean expression of the following type
B1=p & A4=p | A6=p &(~A5=c)
I want a tree that I can use to evaluate the above expression. So I tried this in Antlr3 with the example in Antlr parser for and/or logic - how to get expressions between logic operators?
It worked in Antlr3. Now I want to do the same thing for Antlr 4. I came up the grammar below and it compiles. But I am having trouble writing the Java code.
Start of Antlr4 grammar
grammar TestAntlr4;
options {
output = AST;
}
tokens { AND, OR, NOT }
AND : '&';
OR : '|';
NOT : '~';
// parser/production rules start with a lower case letter
parse
: expression EOF! // omit the EOF token
;
expression
: or
;
or
: and (OR^ and)* // make `||` the root
;
and
: not (AND^ not)* // make `&&` the root
;
not
: NOT^ atom // make `~` the root
| atom
;
atom
: ID
| '('! expression ')'! // omit both `(` and `)`
;
// lexer/terminal rules start with an upper case letter
ID
:
(
'a'..'z'
| 'A'..'Z'
| '0'..'9' | ' '
| ('+'|'-'|'*'|'/'|'_')
| '='
)+
;
I have written the Java Code (snippet below) for getting a tree for the expression "B1=p & A4=p | A6=p &(~A5=c)". I am expecting & with children B1=p and |. The child | operator will have children A4=p and A6=p &(~A5=c). And so on.
Here is that Java code but I am stuck trying to figure out how I will get the tree. I was able to do this in Antlr 3.
Java Code
String src = "B1=p & A4=p | A6=p &(~A5=c)";
CharStream stream = (CharStream)(new ANTLRInputStream(src));
TestAntlr4Lexer lexer = new TestAntlr4Lexer(stream);
parser.setBuildParseTree(true);
ParserRuleContext tree = parser.parse();
tree.inspect(parser);
if ( tree.children.size() > 0) {
System.out.println(" **************");
test.getChildren(tree, parser);
}
The get Children method is below. But this does not seem to extract any tokens.
public void getChildren(ParseTree tree, TestAntlr4Parser parser ) {
for (int i=0; i<tree.getChildCount(); i++){
System.out.println(" Child i= " + i);
System.out.println(" expression = <" + tree.toStringTree(parser) + ">");
if ( tree.getChild(i).getChildCount() != 0 ) {
this.getChildren(tree.getChild(i), parser);
}
}
}
Could someone help me figure out how to write the parser in Java?
The output=AST option was removed in ANTLR 4, as well as the ^ and ! operators you used in the grammar. ANTLR 4 produces parse trees instead of ASTs, so the root of the tree produced by a rule is the rule itself. For example, given the following rule:
and : not (AND not)*;
You will end up with an AndContext tree containing NotContext and TerminalNode children for the not and AND references, respectively. To make it easier to work with the trees, AndContext will contain a generated method not() which returns a list of context objects returned by the invocations of the not rule (return type List<? extends NotContext>). It also contains a generated method AND which returns a list of the TerminalNode instances created for each AND token that was matched.