Comma separated with/without spaces change behaviour. Spaces are skipped though - antlr4

First of all, thanks a lot for your time.
Practicing a little bit more with antlr4, I made this grammar (below).
Input
The tested input is the following:
text to search query_on:fielda,fieldab fielda:"123" sort_by:+fielda,-fieldabc
This produces the next output starting to fail on the query_on-varname rule.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda,fieldab fielda)))) : "123" sort_by : + fielda, - fieldabc\n)
If instead of this input I separate the commas with spaces:
text to search query_on:fielda , fieldab fielda:"123" sort_by:+fielda , -fieldabc
The output is much more similar to "my" expexted output:
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc\n))))) <EOF>)
The only failing part is the last \n.
Expected
The expected results is the same as before but accepting the varname fieldabc and skipping the \n.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc))))))
Questions
Therefore:
Why the grammar is sensitive to the spaces around a comma ?
Similarly, why the \n char is not skipped at the end ?
Thanks!
GRAMMAR
grammar SearchEngine;
// Grammar
start: query EOF;
query
: '(' query+ ')'
| query (OR query)+
| expr
;
expr: text_query query_on? filters* sorting_fields?;
text_query: STRING+;
query_on: QUERY_ON ':' varname (',' varname)*;
filters: binary_op+;
binary_op: varname ':' value;
sorting_fields: SORT_BY ':' sorting_field (',' sorting_field)*;
sorting_field: sorting_order varname;
sorting_order: (asc|desc);
asc: '+';
desc: '-';
varname
: FIELDA
| FIELDAB
| FIELDABC
;
value: STRING;
// Lexer rules (tokens)
WHITE_SPACE: [ \t\r\n] -> skip;
OR: O R;
QUERY_ON: Q U E R Y '_' O N;
SORT_BY: S O R T '_' B Y;
FIELDA: F I E L D A;
FIELDAB: F I E L D A B;
FIELDABC: F I E L D A B C;
STRING: ~[ :()+-]+;
// Fragments (not tokens)
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];

Your STRING Lexer rule accepts tabs and linefeeds. Try:
STRING: ~[ :()+-,\t\r\n]+;
(Having your WHITESPACE rule above it won't affect this, because ANTLRs Lexer rules will select the longest sequence of characters that match any Lexer rule). This is also, why you'll usually see grammars require some sort of delimiter on strings. (The delimiters also distinguish between identifiers and string literals in most languages)

Related

Antlr string token without a certain character sequence

I'm trying to define a lexer grammar that matches string tokens that don't contain a certain sequence of characters. For instance "AB"
Example of strings I want to capture
""
"asda A rewr A"
"asda A"
"asdas B ad"
but not
"asdas AB fdsdf"
I tried a few things but I always seem to miss some case
Could be done with a little mode magic: when you're in the first string-mode and you encounter a AB, you just push into the second string-mode:
lexer grammar MyLexer;
QUOTE : '"' -> more, pushMode(MODE_1);
SPACES : [ \t\r\n]+ -> skip;
mode MODE_1;
STR_1 : '"' -> popMode;
AB : 'AB' -> more, pushMode(MODE_2);
CONTENTS_1 : ~["] -> more;
mode MODE_2;
STR_2 : '"' -> popMode, popMode;
CONTENTS_2 : ~["]+ -> more;
The Java demo:
String source = "\"\"\n" +
"\"asda A rewr A\"\n" +
"\"asdas AB fdsdf\"\n" +
"\"asda A\"\n" +
"\"asdas B ad\"\n";
Lexer lexer = new MyLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
System.out.println(source);
for (Token t : stream.getTokens()) {
System.out.printf("%-20s `%s`%n",
MyLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
will print the following:
""
"asda A rewr A"
"asdas AB fdsdf"
"asda A"
"asdas B ad"
STR_1 `""`
STR_1 `"asda A rewr A"`
STR_2 `"asdas AB fdsdf"`
STR_1 `"asda A"`
STR_1 `"asdas B ad"`

no viable alternative at input ANTLR4?

I am creating my own language with ANTLR 4 and I would like to create a rule to define variables with their types for example.
string = "string"
boolean = true
integer = 123
double = 12.3
string = string // reference to variable
Here is my grammar.
// lexer grammar
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
ID : LETTER+;
STRING : '"' ( ~ '"' )* '"' ;
BOOLEAN: ( 'true' | 'fase');
INTEGER: DIGIT+ ;
DOUBLE: DIGIT+ ('.' DIGIT+)*;
// parser grammar
program: main EOF;
main: study ;
study : studyBlock (assignVariableBlock)? ;
simpleAssign: name = ID '=' value = (STRING | BOOLEAN | INTEGER | BOOLEAN | ID);
listAssign: name = ID '=' value = listString #listStringAssign;
assign: simpleAssign #simpleVariableAssign
| listAssign #listOfVariableAssign
;
assignVariableBlock: assign+;
key: name = ID '[' value = STRING ']';
listString: '{' STRING (',' STRING)* '}';
studyParameters: (| ( simpleAssign (',' simpleAssign)*) );
studyBlock: 'study' '(' studyParameters ')' ;
When I test with this example ANTLR displays the following error
study(timestamp = "10:30", region = "region", businessDate="2020-03-05", processType="ID")
bool = true
region = "region"
region = region
line 4:7 no viable alternative at input 'bool=true'
line 6:9 no viable alternative at input 'region=region'
How can I fix that?.
When I test your grammar and start at the program rule for the given input, I get the following parse tree (without any errors or warnings):
You either don't start with the correct parser rule, or are testing an old parser and need to generate new classes from your grammar.

ANTLR matching wrong tokens

mytest.g4
lexer grammar mytest;
fragment HEX: '0' [xX] [0-9a-fA-F]+;
fragment INT: [0-9]+;
fragment WS: [\t ]+;
fragment NL: WS? ('\r'* '\n')+;
INFO: 'InfoFromDb' -> mode(INFO_MODE);
ID: 'ID from database' -> mode(ID_MODE);
mode INFO_MODE;
INFO_INTERMEDIATE: ':' WS*;// -> channel(HIDDEN);
//INFO_DATA: ~[\r\n]+; //(~('\r' | '\n'))+;
INFO_DATA: HEX ',' WS* [A-Za-z]+ WS* INT; //line 24:13 token recognition error at: '0xF8, F'
INFO_END: NL -> mode(DEFAULT_MODE);
mode ID_MODE;
ID_INTERMEDIATE: ':' WS* -> channel(HIDDEN);
ID_DATA: ~[\r\n]+;
// let's throw away the next token, but also use it to exit this mode
ID_END: NL -> skip, pushMode(DEFAULT_MODE);
pars.g4
parser grammar pars;
options {
tokenVocab=mytest;
}
the_id: ID ID_DATA;
info: INFO INFO_INTERMEDIATE INFO_DATA INFO_END;
input
InfoFromDb: 0xF8, FooData 3
ID from database: 0x3, Blah ID: 0, Meta ID: 0, MetaB: 1
when I test the parser rule the_id with the input... it yields a parse tree of:
InfoFromDb
0xF8, FooData 3
\n
ID from database
which just makes no sense...
similarly hard to understand, the info parser rule yields:
InfoFromDb
<missing INFO_INTERMEDIATE>
: 0xF8, FooData 3
\n
what's going on here? Are lexer rules somehow optional and being ignored? Am I misusing the mode stuff?

ANRLR4 lexer semantic predicate issue

I'm trying to use a semantic predicate in the lexer to look ahead one token but somehow I can't get it right. Here's what I have:
lexer grammar
lexer grammar TLLexer;
DirStart
: { getCharPositionInLine() == 0 }? '#dir'
;
DirEnd
: { getCharPositionInLine() == 0 }? '#end'
;
Cont
: 'contents' [ \t]* -> mode(CNT)
;
WS
: [ \t]+ -> channel(HIDDEN)
;
NL
: '\r'? '\n'
;
mode CNT;
CNT_DirEnd
: '#end' [ \t]* '\n'?
{ System.out.println("--matched end--"); }
;
CNT_LastLine
: ~ '\n'* '\n'
{ _input.LA(1) == CNT_DirEnd }? -> mode(DEFAULT_MODE)
;
CNT_Line
: ~ '\n'* '\n'
;
parser grammar
parser grammar TLParser;
options { tokenVocab = TLLexer; }
dirs
: ( dir
| NL
)*
;
dir
: DirStart Cont
contents
DirEnd
;
contents
: CNT_Line* CNT_LastLine
;
Essentially each line in the stuff in the CNT mode is free-form, but it never begins with #end followed by optional whitespace. Basically I want to keep matching the #end tag in the default lexer mode.
My test input is as follows:
#dir contents
..line..
#end
If I run this in grun I get the following
$ grun TL dirs test.txt
--matched end--
line 3:0 extraneous input '#end\n' expecting {CNT_LastLine, CNT_Line}
So clearly CNT_DirEnd gets matched, but somehow the predicate doesn't detect it.
I know that this this particular task doesn't require a semantic predicate, but that's just the part that doesn't work. The actual parser, while it may be written without the predicate, will be a lot less clean if I simply move the matching of the the #end tag into the mode CNT.
Thanks,
Kesha.
I think I figured it out. The member _input represents the characters of the original input, thus _input.LA returns characters, not lexer token IDs (is that the correct term?). Either way, the numbers returned by the lexer to the parser have nothing to do with the values returned by _input.LA, hence the predicate fails unless by some weird luck the character value returned by _input.LA(1) is equal to the lexer ID of CNT_DirEnd.
I modified the lexer as shown below and now it works, even though it is not as elegant as I hoped it would be (maybe someone knows a better way?)
lexer grammar TLLexer;
#lexer::members {
private static final String END_DIR = "#end";
private boolean isAtEndDir() {
StringBuilder sb = new StringBuilder();
int n = 1;
int ic;
// read characters until EOF
while ((ic = _input.LA(n++)) != -1) {
char c = (char) ic;
// we're interested in the next line only
if (c == '\n') break;
if (c == '\r') continue;
sb.append(c);
}
// Does the line begin with #end ?
if (sb.indexOf(END_DIR) != 0) return false;
// Is the #end followed by whitespace only?
for (int i = END_DIR.length(); i < sb.length(); i++) {
switch (sb.charAt(i)) {
case ' ':
case '\t':
continue;
default: return false;
}
}
return true;
}
}
[skipped .. nothing changed in the default mode]
mode CNT;
/* removed CNT_DirEnd */
CNT_LastLine
: ~ '\n'* '\n'
{ isAtEndDir() }? -> mode(DEFAULT_MODE)
;
CNT_Line
: ~ '\n'* '\n'
;

Antlr4 grammar ambiguity

I have the following grammar ( minimized for SO)
grammar Hello;
odataIdentifier : identifierLeadingCharacter identifierCharacter*;
identifierLeadingCharacter : Alpha| UNDERSCORE;
identifierCharacter : identifierLeadingCharacter | Digit;
identifierUnreserved : identifierCharacter | (MINUS | DOT | TILDE);
Digit : ZERO_TO_FIVE |[6-9];
ONEHUNDRED_TO_ONEHUNDREDNINETYNINE : '1' Digit Digit; // 100-199
TWOHUNDRED_TO_TWOHUNDREDFOURTYNINE : '2' ZERO_TO_FOUR Digit; // 200-249
TWOHUNDREDFIFTY_TO_TWOHUNDREDFIFTYFIVE : '25' ZERO_TO_FIVE; // 250-255
TEN_TO_NINETYNINE : ONE_TO_NINE Digit; // 10-99
ZERO_TO_ONE : [0-1];
ZERO_TO_TWO : ZERO_TO_ONE | [2];
ZERO_TO_THREE : ZERO_TO_TWO | [3];
ZERO_TO_FOUR : ZERO_TO_THREE | [4];
ZERO_TO_FIVE : ZERO_TO_FOUR | [5];
ONE_TO_TWO : [1-2];
ONE_TO_THREE : ONE_TO_TWO | [3];
ONE_TO_FOUR : ONE_TO_THREE | [4];
ONE_TO_NINE : ONE_TO_FOUR | [5-9];
Alpha : [a-zA-Z];
MINUS : [-];
DOT : '.';
UNDERSCORE : '_';
TILDE : '~';
WS : (' '|'\r'|'\t'|'\u000C'|'\n') -> skip
;
for input c9 it works fine, but when i have 2 digits for example c10 it says:
extraneous input '92' expecting {<EOF>, Digit, Alpha, '_'}
so i guess it parses 9 and parses 2 and doesn't know if this should be TEN_TO_NINETYNINE or 2 Digit Digit.
i am a noob to this, so wondering if my analysis is right and how could i alleviate this ...
Your input is resulting in an Alpha token followed by a TEN_TO_NINETYNINE token. While the parser rule identifierLeadingCharacter does allow the Alpha token, the identifierCharacter rule cannot match a TEN_TO_NINETYNINE token.
The input 10 will always produce a TEN_TO_NINETYNINE token rather than two Digit tokens, because the former matches more of the input and lexer rules are greedy.

Resources