I'm trying to define a lexer grammar that matches string tokens that don't contain a certain sequence of characters. For instance "AB"
Example of strings I want to capture
""
"asda A rewr A"
"asda A"
"asdas B ad"
but not
"asdas AB fdsdf"
I tried a few things but I always seem to miss some case
Could be done with a little mode magic: when you're in the first string-mode and you encounter a AB, you just push into the second string-mode:
lexer grammar MyLexer;
QUOTE : '"' -> more, pushMode(MODE_1);
SPACES : [ \t\r\n]+ -> skip;
mode MODE_1;
STR_1 : '"' -> popMode;
AB : 'AB' -> more, pushMode(MODE_2);
CONTENTS_1 : ~["] -> more;
mode MODE_2;
STR_2 : '"' -> popMode, popMode;
CONTENTS_2 : ~["]+ -> more;
The Java demo:
String source = "\"\"\n" +
"\"asda A rewr A\"\n" +
"\"asdas AB fdsdf\"\n" +
"\"asda A\"\n" +
"\"asdas B ad\"\n";
Lexer lexer = new MyLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
System.out.println(source);
for (Token t : stream.getTokens()) {
System.out.printf("%-20s `%s`%n",
MyLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
will print the following:
""
"asda A rewr A"
"asdas AB fdsdf"
"asda A"
"asdas B ad"
STR_1 `""`
STR_1 `"asda A rewr A"`
STR_2 `"asdas AB fdsdf"`
STR_1 `"asda A"`
STR_1 `"asdas B ad"`
Related
First of all, thanks a lot for your time.
Practicing a little bit more with antlr4, I made this grammar (below).
Input
The tested input is the following:
text to search query_on:fielda,fieldab fielda:"123" sort_by:+fielda,-fieldabc
This produces the next output starting to fail on the query_on-varname rule.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda,fieldab fielda)))) : "123" sort_by : + fielda, - fieldabc\n)
If instead of this input I separate the commas with spaces:
text to search query_on:fielda , fieldab fielda:"123" sort_by:+fielda , -fieldabc
The output is much more similar to "my" expexted output:
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc\n))))) <EOF>)
The only failing part is the last \n.
Expected
The expected results is the same as before but accepting the varname fieldabc and skipping the \n.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc))))))
Questions
Therefore:
Why the grammar is sensitive to the spaces around a comma ?
Similarly, why the \n char is not skipped at the end ?
Thanks!
GRAMMAR
grammar SearchEngine;
// Grammar
start: query EOF;
query
: '(' query+ ')'
| query (OR query)+
| expr
;
expr: text_query query_on? filters* sorting_fields?;
text_query: STRING+;
query_on: QUERY_ON ':' varname (',' varname)*;
filters: binary_op+;
binary_op: varname ':' value;
sorting_fields: SORT_BY ':' sorting_field (',' sorting_field)*;
sorting_field: sorting_order varname;
sorting_order: (asc|desc);
asc: '+';
desc: '-';
varname
: FIELDA
| FIELDAB
| FIELDABC
;
value: STRING;
// Lexer rules (tokens)
WHITE_SPACE: [ \t\r\n] -> skip;
OR: O R;
QUERY_ON: Q U E R Y '_' O N;
SORT_BY: S O R T '_' B Y;
FIELDA: F I E L D A;
FIELDAB: F I E L D A B;
FIELDABC: F I E L D A B C;
STRING: ~[ :()+-]+;
// Fragments (not tokens)
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];
Your STRING Lexer rule accepts tabs and linefeeds. Try:
STRING: ~[ :()+-,\t\r\n]+;
(Having your WHITESPACE rule above it won't affect this, because ANTLRs Lexer rules will select the longest sequence of characters that match any Lexer rule). This is also, why you'll usually see grammars require some sort of delimiter on strings. (The delimiters also distinguish between identifiers and string literals in most languages)
I am creating my own language with ANTLR 4 and I would like to create a rule to define variables with their types for example.
string = "string"
boolean = true
integer = 123
double = 12.3
string = string // reference to variable
Here is my grammar.
// lexer grammar
fragment LETTER : [A-Za-z];
fragment DIGIT : [0-9];
ID : LETTER+;
STRING : '"' ( ~ '"' )* '"' ;
BOOLEAN: ( 'true' | 'fase');
INTEGER: DIGIT+ ;
DOUBLE: DIGIT+ ('.' DIGIT+)*;
// parser grammar
program: main EOF;
main: study ;
study : studyBlock (assignVariableBlock)? ;
simpleAssign: name = ID '=' value = (STRING | BOOLEAN | INTEGER | BOOLEAN | ID);
listAssign: name = ID '=' value = listString #listStringAssign;
assign: simpleAssign #simpleVariableAssign
| listAssign #listOfVariableAssign
;
assignVariableBlock: assign+;
key: name = ID '[' value = STRING ']';
listString: '{' STRING (',' STRING)* '}';
studyParameters: (| ( simpleAssign (',' simpleAssign)*) );
studyBlock: 'study' '(' studyParameters ')' ;
When I test with this example ANTLR displays the following error
study(timestamp = "10:30", region = "region", businessDate="2020-03-05", processType="ID")
bool = true
region = "region"
region = region
line 4:7 no viable alternative at input 'bool=true'
line 6:9 no viable alternative at input 'region=region'
How can I fix that?.
When I test your grammar and start at the program rule for the given input, I get the following parse tree (without any errors or warnings):
You either don't start with the correct parser rule, or are testing an old parser and need to generate new classes from your grammar.
mytest.g4
lexer grammar mytest;
fragment HEX: '0' [xX] [0-9a-fA-F]+;
fragment INT: [0-9]+;
fragment WS: [\t ]+;
fragment NL: WS? ('\r'* '\n')+;
INFO: 'InfoFromDb' -> mode(INFO_MODE);
ID: 'ID from database' -> mode(ID_MODE);
mode INFO_MODE;
INFO_INTERMEDIATE: ':' WS*;// -> channel(HIDDEN);
//INFO_DATA: ~[\r\n]+; //(~('\r' | '\n'))+;
INFO_DATA: HEX ',' WS* [A-Za-z]+ WS* INT; //line 24:13 token recognition error at: '0xF8, F'
INFO_END: NL -> mode(DEFAULT_MODE);
mode ID_MODE;
ID_INTERMEDIATE: ':' WS* -> channel(HIDDEN);
ID_DATA: ~[\r\n]+;
// let's throw away the next token, but also use it to exit this mode
ID_END: NL -> skip, pushMode(DEFAULT_MODE);
pars.g4
parser grammar pars;
options {
tokenVocab=mytest;
}
the_id: ID ID_DATA;
info: INFO INFO_INTERMEDIATE INFO_DATA INFO_END;
input
InfoFromDb: 0xF8, FooData 3
ID from database: 0x3, Blah ID: 0, Meta ID: 0, MetaB: 1
when I test the parser rule the_id with the input... it yields a parse tree of:
InfoFromDb
0xF8, FooData 3
\n
ID from database
which just makes no sense...
similarly hard to understand, the info parser rule yields:
InfoFromDb
<missing INFO_INTERMEDIATE>
: 0xF8, FooData 3
\n
what's going on here? Are lexer rules somehow optional and being ignored? Am I misusing the mode stuff?
I'm seeing an "extraneous input" error with input "\aa a" and the following grammar:
Cool.g4
grammar Cool;
import Lex;
expr
: STR_CONST # str_const
;
Lex.g4
lexer grammar Lex;
#lexer::members {
public static boolean initial = true;
public static boolean inString = false;
public static boolean inStringEscape = false;
}
BEGINSTRING: '"' {initial}? {
inString = true;
initial = false;
System.out.println("Entering string");
} -> more;
INSTRINGSTARTESCAPE: '\\' {inString && !inStringEscape}? {
inStringEscape = true;
System.out.println("The next character will be escaped!");
} -> more;
INSTRINGAFTERESCAPE: ~[\n] {inString && inStringEscape}? {
inStringEscape = false;
System.out.println("Escaped a character.");
} -> more;
INSTRINGOTHER: (~[\n\\"])+ {inString && !inStringEscape}? {
System.out.println("Consumed some other characters in the string!");
} -> more;
STR_CONST: '"' {inString && !inStringEscape}? {
inString = false;
initial = true;
System.out.println("Exiting string");
};
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
ID: [a-z][_A-Za-z0-9]*;
Here's the output:
$ grun Cool expr -tree
"\aa a"
Entering string
The next character will be escaped!
Escaped a character.
Consumed some other characters in the string!
Exiting string
line 1:0 extraneous input '"\aa' expecting STR_CONST
(expr "\aa a")
Interestingly, if I remove the ID rule, antlr parses the input fine. Here's the output when I remove the ID rule:
$ grun Cool expr -tree
"\aa a"
Entering string
The next character will be escaped!
Escaped a character.
Consumed some other characters in the string!
Exiting string
(expr "\aa a")
Any idea what might be going on? Why does antlr throw an error when ID is one of the Lexer rules?
That's a surprisingly complex way to parse strings with escape sequences. Did you print the resulting tokens to see what your lexer produced?
I recommond a different (and much simpler) approach:
STR_CONST: '"' ('\\"' | .)*? '"';
Then in your semantic phase, when you post process your parse tree, examine the matched text to find escape sequences. Convert them to the real chars and print a good error message, when an invalid escape sequence was found (something you cannot do when trying to match escape sequences in the lexer).
Copying the answer I received from #sharwell on GitHub.
"Your ID rule is unpredicated, so it matches aa following the \ (aa is longer than the a matched by INSTRINGAFTERESCAPE, so it's preferred even though it's later in the grammar). If you add a println to WS and ID you'll see the strange behavior in the output."
I'm new to ANTLR so I hope you guy explains for me explicitly.
I have a /* comment */ (BC) lexer in ANTLR, I want it to be like this:
/* sample */ => BC
/* s
a
m
p
l
e */ => BC
"" => STRING
" " => STRING
"a" => STRING
"hello world \1" => STRING
but I got this:
/* sample */
/* s
a
m
p
l
e */ => BC
""
" "
"a"
"hello world \1" => STRING
it only take the 1st /* and the last */, same with my String token. Here's the code of Comments:
BC: '/*'.*'*/';
And the String:
STRING: '"'(~('"')|(' '|'\b'|'\f'|'r'|'\n'|'\t'|'\"'|'\\'))*'"';
Lexer rules are greedy by default, meaning they try to consume the longest matching sequence. So they stop at the last closing delimiter.
To make a rule non-greedy, use, well, nongreedy rules:
BC: '/*' .*? '*/';
This will stop at the first closing */ which is exactly what you need.
Same with your STRING. Read about it in The Definitive ANTLR4 Reference, page 285.
Also you can use the following code fragment without non-greedy syntax (more general soultion):
MultilineCommentStart: '/*' -> more, mode(COMMENTS);
mode COMMENTS;
MultilineComment: '*/' -> mode(DEFAULT_MODE);
MultilineCommentNotAsterisk: ~'*'+ -> more;
MultilineCommentAsterisk: '*' -> more;