Extraneous input error when using "lexer rule actions" and "lexer commands" - antlr4

I'm seeing an "extraneous input" error with input "\aa a" and the following grammar:
Cool.g4
grammar Cool;
import Lex;
expr
: STR_CONST # str_const
;
Lex.g4
lexer grammar Lex;
#lexer::members {
public static boolean initial = true;
public static boolean inString = false;
public static boolean inStringEscape = false;
}
BEGINSTRING: '"' {initial}? {
inString = true;
initial = false;
System.out.println("Entering string");
} -> more;
INSTRINGSTARTESCAPE: '\\' {inString && !inStringEscape}? {
inStringEscape = true;
System.out.println("The next character will be escaped!");
} -> more;
INSTRINGAFTERESCAPE: ~[\n] {inString && inStringEscape}? {
inStringEscape = false;
System.out.println("Escaped a character.");
} -> more;
INSTRINGOTHER: (~[\n\\"])+ {inString && !inStringEscape}? {
System.out.println("Consumed some other characters in the string!");
} -> more;
STR_CONST: '"' {inString && !inStringEscape}? {
inString = false;
initial = true;
System.out.println("Exiting string");
};
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
ID: [a-z][_A-Za-z0-9]*;
Here's the output:
$ grun Cool expr -tree
"\aa a"
Entering string
The next character will be escaped!
Escaped a character.
Consumed some other characters in the string!
Exiting string
line 1:0 extraneous input '"\aa' expecting STR_CONST
(expr "\aa a")
Interestingly, if I remove the ID rule, antlr parses the input fine. Here's the output when I remove the ID rule:
$ grun Cool expr -tree
"\aa a"
Entering string
The next character will be escaped!
Escaped a character.
Consumed some other characters in the string!
Exiting string
(expr "\aa a")
Any idea what might be going on? Why does antlr throw an error when ID is one of the Lexer rules?

That's a surprisingly complex way to parse strings with escape sequences. Did you print the resulting tokens to see what your lexer produced?
I recommond a different (and much simpler) approach:
STR_CONST: '"' ('\\"' | .)*? '"';
Then in your semantic phase, when you post process your parse tree, examine the matched text to find escape sequences. Convert them to the real chars and print a good error message, when an invalid escape sequence was found (something you cannot do when trying to match escape sequences in the lexer).

Copying the answer I received from #sharwell on GitHub.
"Your ID rule is unpredicated, so it matches aa following the \ (aa is longer than the a matched by INSTRINGAFTERESCAPE, so it's preferred even though it's later in the grammar). If you add a println to WS and ID you'll see the strange behavior in the output."

Related

ANTLR4 String and Comments Lexer

I'm new to ANTLR so I hope you guy explains for me explicitly.
I have a /* comment */ (BC) lexer in ANTLR, I want it to be like this:
/* sample */ => BC
/* s
a
m
p
l
e */ => BC
"" => STRING
" " => STRING
"a" => STRING
"hello world \1" => STRING
but I got this:
/* sample */
/* s
a
m
p
l
e */ => BC
""
" "
"a"
"hello world \1" => STRING
it only take the 1st /* and the last */, same with my String token. Here's the code of Comments:
BC: '/*'.*'*/';
And the String:
STRING: '"'(~('"')|(' '|'\b'|'\f'|'r'|'\n'|'\t'|'\"'|'\\'))*'"';
Lexer rules are greedy by default, meaning they try to consume the longest matching sequence. So they stop at the last closing delimiter.
To make a rule non-greedy, use, well, nongreedy rules:
BC: '/*' .*? '*/';
This will stop at the first closing */ which is exactly what you need.
Same with your STRING. Read about it in The Definitive ANTLR4 Reference, page 285.
Also you can use the following code fragment without non-greedy syntax (more general soultion):
MultilineCommentStart: '/*' -> more, mode(COMMENTS);
mode COMMENTS;
MultilineComment: '*/' -> mode(DEFAULT_MODE);
MultilineCommentNotAsterisk: ~'*'+ -> more;
MultilineCommentAsterisk: '*' -> more;

ANRLR4 lexer semantic predicate issue

I'm trying to use a semantic predicate in the lexer to look ahead one token but somehow I can't get it right. Here's what I have:
lexer grammar
lexer grammar TLLexer;
DirStart
: { getCharPositionInLine() == 0 }? '#dir'
;
DirEnd
: { getCharPositionInLine() == 0 }? '#end'
;
Cont
: 'contents' [ \t]* -> mode(CNT)
;
WS
: [ \t]+ -> channel(HIDDEN)
;
NL
: '\r'? '\n'
;
mode CNT;
CNT_DirEnd
: '#end' [ \t]* '\n'?
{ System.out.println("--matched end--"); }
;
CNT_LastLine
: ~ '\n'* '\n'
{ _input.LA(1) == CNT_DirEnd }? -> mode(DEFAULT_MODE)
;
CNT_Line
: ~ '\n'* '\n'
;
parser grammar
parser grammar TLParser;
options { tokenVocab = TLLexer; }
dirs
: ( dir
| NL
)*
;
dir
: DirStart Cont
contents
DirEnd
;
contents
: CNT_Line* CNT_LastLine
;
Essentially each line in the stuff in the CNT mode is free-form, but it never begins with #end followed by optional whitespace. Basically I want to keep matching the #end tag in the default lexer mode.
My test input is as follows:
#dir contents
..line..
#end
If I run this in grun I get the following
$ grun TL dirs test.txt
--matched end--
line 3:0 extraneous input '#end\n' expecting {CNT_LastLine, CNT_Line}
So clearly CNT_DirEnd gets matched, but somehow the predicate doesn't detect it.
I know that this this particular task doesn't require a semantic predicate, but that's just the part that doesn't work. The actual parser, while it may be written without the predicate, will be a lot less clean if I simply move the matching of the the #end tag into the mode CNT.
Thanks,
Kesha.
I think I figured it out. The member _input represents the characters of the original input, thus _input.LA returns characters, not lexer token IDs (is that the correct term?). Either way, the numbers returned by the lexer to the parser have nothing to do with the values returned by _input.LA, hence the predicate fails unless by some weird luck the character value returned by _input.LA(1) is equal to the lexer ID of CNT_DirEnd.
I modified the lexer as shown below and now it works, even though it is not as elegant as I hoped it would be (maybe someone knows a better way?)
lexer grammar TLLexer;
#lexer::members {
private static final String END_DIR = "#end";
private boolean isAtEndDir() {
StringBuilder sb = new StringBuilder();
int n = 1;
int ic;
// read characters until EOF
while ((ic = _input.LA(n++)) != -1) {
char c = (char) ic;
// we're interested in the next line only
if (c == '\n') break;
if (c == '\r') continue;
sb.append(c);
}
// Does the line begin with #end ?
if (sb.indexOf(END_DIR) != 0) return false;
// Is the #end followed by whitespace only?
for (int i = END_DIR.length(); i < sb.length(); i++) {
switch (sb.charAt(i)) {
case ' ':
case '\t':
continue;
default: return false;
}
}
return true;
}
}
[skipped .. nothing changed in the default mode]
mode CNT;
/* removed CNT_DirEnd */
CNT_LastLine
: ~ '\n'* '\n'
{ isAtEndDir() }? -> mode(DEFAULT_MODE)
;
CNT_Line
: ~ '\n'* '\n'
;

Groovy script in JMeter: error "expecting anything but ''\n''; got it anyway # line..." when contains closure that uses GString interpolation

I have this groovy script that defines a closure that works properly.
escape = { str ->
str.collect{ ch ->
def escaped = ch
switch (ch) {
case "\"" : escaped = "\\\"" ; break
// other cases omitted for simplicity
}
escaped
}.join()
}
assert escape("\"") == "\\\"" //Sucess
But when I add another closure that uses some GString interpolation to the script.
escape = { str ->
//Same as above
}
dummy = {
aStr = "abc"
"123${aStr}456"
}
//Compilation fails
I get the error
javax.script.ScriptException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script650.groovy: 7: expecting anything but ''\n''; got it anyway # line 7, column 39.
case "\"" : escaped = "\\"" ; break
^
1 error
Even if the added closure was commented.
escape = { str ->
//Same as above
}
/*dummy = {
aStr = "abc"
"123${aStr}456"
}*/
//Compilation fails
Still fails! What gives?

ANTLR4 lexer rule with #init block

I have this lexer rule defined in my ANTLR v3 grammar file - it maths text in double quotes.
I need to convert it to ANTLR v4. ANTLR compiler throws an error 'syntax error: mismatched input '#' expecting COLON while matching a lexer rule' (in #init line). Can lexer rule contain a #init block ? How this should be rewritten ?
DOUBLE_QUOTED_CHARACTERS
#init
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
: ('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
;
Sample data:
" " -> throws error "Illegal empty quotes!";
"asd -> throws error "Missing closing double quote!"
"text" -> returns text (valid input, content of "...")
I think this is the right way to do this.
DOUBLE_QUOTED_CHARACTERS
:
{
int doubleQuoteMark = input.mark();
int semiColonPos = -1;
}
(
('"' WS* '"') => '"' WS* '"' { $channel = HIDDEN; }
{
RecognitionException re = new RecognitionException("Illegal empty quotes\"\"!", input);
reportError(re);
}
| '"' (options {greedy=false;}: ~('"'))+
('"'|';' { semiColonPos = input.index(); } ('\u0020'|'\t')* ('\n'|'\r'))
{
if (semiColonPos >= 0)
{
input.rewind(doubleQuoteMark);
RecognitionException re = new RecognitionException("Missing closing double quote!", input);
reportError(re);
input.consume();
}
else
{
setText(getText().substring(1, getText().length()-1));
}
}
)
;
There are some other errors as well in above like WS .. => ... but I am not correcting them as part of this answer. Just to keep things simple. I took hint from here
Just to hedge against that link moving or becoming invalid after sometime, quoting the text as is:
Lexer actions can appear anywhere as of 4.2, not just at the end of the outermost alternative. The lexer executes the actions at the appropriate input position, according to the placement of the action within the rule. To execute a single action for a role that has multiple alternatives, you can enclose the alts in parentheses and put the action afterwards:
END : ('endif'|'end') {System.out.println("found an end");} ;
The action conforms to the syntax of the target language. ANTLR copies the action’s contents into the generated code verbatim; there is no translation of expressions like $x.y as there is in parser actions.
Only actions within the outermost token rule are executed. In other words, if STRING calls ESC_CHAR and ESC_CHAR has an action, that action is not executed when the lexer starts matching in STRING.
I in countered this problem when my .g4 grammar imported a lexer file. Importing grammar files seems to trigger lots of undocumented shortcomings in ANTLR4. So ultimately I had to stop using import.
In my case, once I merged the LEXER grammar into the parser grammar (one single .g4 file) my #input and #after parsing errors vanished. I should submit a test case + bug, at least to get this documented. I will update here once I do that.
I vaguely recall 2-3 issues with respect to importing lexer grammar into my parser that triggered undocumented behavior. Much is covered here on stackoverflow.

redefinition token type in parser

Need to implement syntax highlighting for COS aka MUMPS
for the language of a possible design of the form
new (new,set,kill)
set kill=new
where: 'new' and 'set' are commands, and also variable
grammar cos;
Command_KILL :( ('k'|'K') | ( ('k'|'K')('i'|'I')('l'|'L')('l'|'L') ) );
Command_NEW :( ('n'|'N') | ( ('n'|'N')('e'|'E')('w'|'W') ) );
Command_SET :( ('s'|'S') | ( ('s'|'S')('e'|'E')('t'|'T') ) );
INT : [0-9]+;
ID : [a-zA-Z][a-zA-Z0-9]*;
Space: ' ';
Equal: '=';
newCommand
: Command_NEW Space ID
;
setCommand
: Command_SET Space ID Space* Equal Space* INT
;
I have a problem, when ID like name as commands (NEW,SET e.t.c.)
According to the Wikipedia page, MUMPS doesn't have reserved words:
Reserved words: None. Since MUMPS interprets source code by context, there is no need for reserved words. You may use the names of language commands as variables.
Lexer rules like Command_KILL function exactly like reserved words: they're designed to make sure no other token is generated when input "kill" is encountered. So token type Command_KILL will always be produced on "kill", even if it's intended to be an identifier. You can keep the command lexer rules if you want, but you'll have to treat them like IDs as well because you just don't know what "kill" refers to based on the token alone.
Making a MUMPS implementation in ANTLR means focusing on token usage and context rather than token types. Consider this grammar:
grammar Example;
document : (expr (EOL|EOF))+;
expr : command=ID Space+ value (Space* COMMA Space* value)* #CallExpr
| command=ID Space+ name=ID Space* Equal Space* value #SetExpr
;
value : ID | INT;
INT : [0-9]+;
ID : [a-zA-Z][a-zA-Z0-9]*;
Space : ' ';
Equal : '=';
EOL : [\r\n]+;
COMMA : ',';
Parser rule expr knows when an ID token is a command based on the layout of the entire line.
If the input tokens are ID ID, then the input is a CallExpr: the first ID is a command name and the second ID is a regular identifier.
If the input tokens are ID ID Equal ID, then the input is a SetExpr: the first ID will be a command (either "set" or something like it), the second ID is the target identifier, and the third ID is the source identifier.
Here's a Java test application followed by a test case similar to the one mentioned in your question.
import java.util.List;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
public class ExampleTest {
public static void main(String[] args) {
ANTLRInputStream input = new ANTLRInputStream(
"new new, set, kill\nset kill = new");
ExampleLexer lexer = new ExampleLexer(input);
ExampleParser parser = new ExampleParser(new CommonTokenStream(lexer));
parser.addParseListener(new ExampleBaseListener() {
#Override
public void exitCallExpr(ExampleParser.CallExprContext ctx) {
System.out.println("Call:");
System.out.printf("\tcommand = %s%n", ctx.command.getText());
List<ExampleParser.ValueContext> values = ctx.value();
if (values != null) {
for (int i = 0, count = values.size(); i < count; ++i) {
ExampleParser.ValueContext value = values.get(i);
System.out.printf("\targ[%d] = %s%n", i,
value.getText());
}
}
}
#Override
public void exitSetExpr(ExampleParser.SetExprContext ctx) {
System.out.println("Set:");
System.out.printf("\tcommand = %s%n", ctx.command.getText());
System.out.printf("\tname = %s%n", ctx.name.getText());
System.out.printf("\tvalue = %s%n", ctx.value().getText());
}
});
parser.document();
}
}
Input
new new, set, kill
set kill = new
Output
Call:
command = new
arg[0] = new
arg[1] = set
arg[2] = kill
Set:
command = set
name = kill
value = new
It's up to the calling code to determine whether a command is valid in a given context. The parser can't reasonably handle this because of MUMPS's loose approach to commands and identifiers. But it's not as bad as it may sound: you'll know which commands function like a call and which function like a set, so you'll be able to test the input from the Listener that ANTLR produces. In the code above, for example, it would be very easy to test whether "set" was the command passed to exitSetExpr.
Some MUMPS syntax may be more difficult to process than this, but the general approach will be the same: let the lexer treat commands and identifiers like IDs, and use the parser rules to determine whether an ID refers to a command or an identifier based on the context of the entire line.

Resources