Antlr4 match whole input string or bust - antlr4

I am new to Antlr4 and have been wracking my brain for some days now about a behaviour that I simply don't understand. I have the following combined grammar and expect it to fail and report an error, but it doesn't:
grammar MWE;
parse: cell EOF;
cell: WORD;
WORD: ('a'..'z')+;
If I feed it the input
a4
I expect it to not be able to parse it, because I want it to match the whole input string and not just a part of it, as signified by the EOF. But instead it reports no error (I listen for errors with a errorlistener implementing the IAntlrErrorListener interface) and gives me the following parse tree:
(parse (cell a) <EOF>)
Why is this?

The error recovery mechanism when input is reached which no lexer rule matches is to drop a character and continue with the next one. In your case, the lexer is dropping the 4 character, so your parser is seeing the equivalent of this input:
a
The solution is to instruct the lexer to create a token for the dropped character rather than ignore it, and pass that token on to the parser where an error will be reported. In the grammar, this rule takes the following form and is always added as the last rule in the grammar. If you have multiple lexer modes, a rule with this form should appear as the last rule in the default mode as well as the last rule in each extra mode.
ErrChar
: .
;

Related

ANTLR 4: Recognises 'and' but not 'or' without a space

I'm using the ANTLR 4 plugin in IntelliJ, and I have the most bizarre bug. I'll start with the relevant parser/lexer rules:
// Take care of whitespace.
WS : [ \r\t\f\n]+ -> skip;
OTHER: . -> skip;
STRING
: '"' [A-z ]+ '"'
;
evaluate // starting rule.
: textbox? // could be an empty textbox.
;
textbox
: (row '\n')*
;
row
: ability
| ability_list
ability
: activated_ability
| triggered_ability
| static_ability
triggered_ability
: trigger_words ',' STRING
;
trigger_words
: ('when'|'whenever'|'as') whenever_triggers|'at'
;
whenever_triggers
: triggerer (('or'|'and') triggerer)* // this line has the issue.
;
triggerer
: self
self: '~'
I pass it this text: whenever ~ or ~, and it fails on the or, saying line 1:10 mismatched input ' or' expecting {'or', 'and'}. However, if I add a space to the whenever_triggers rule's or string (making it ' or'|'and'), it works fine.
The weirdest thing is that if I try whenever ~ and ~, it works fine even without the rule having a space in the and string. This doesn't change if I make 'and'|'or' a lexer rule either. It's just bizarre. I've confirmed this bug happens when running the 'test rig' in Antlrworks 2, so it's not just an IntelliJ thing.
This is an image of the parse tree when the error occurs:
Alright you have found the answer more or less by yourself so with this answer of mine I will focus on explaining why the problem occured in the first place.
First of all - for everyone stumbling upon this question - the problem was that he had another implicit lexer rule defined that looked like this ' or' (notice the whitespace). Changing that to 'or' resolved the problem.
But why was that a problem?
In order to understand that you have to understand what ANTLR does if you write '<something>' in one of your parser rules: When compiling the grammar it will generate a new lexer rule for each of those declarations. These lexer rules will be created before the lexer rules defined in your grammar. The lexer itself will match the given input into tokens and for that it processes each lexer rule at a time in the order they have been declared. Therefore it will always start with the implicit token definitions and then move on to the topmost "real" lexer rule.
The problem is that the lexer isn't too clever about this process that means once it has matched some input with the current lexer rule it will create a respective token and moves on with the trailing input.
As a result a lexer rule that comes afterwards that would have matched the input as well (but as another token as it is a different lexer rule) will be skipped so that the respective input might not have the expected token type because the lexer rules have overwrritten themselves.
In your example the self-overwriting rules are ' or'(Token 1) and 'or'(Token 2). Each of those implicit lexer rule declarations will result in a different lexer rule and as the first one got matched I assume that it is declared before the second one.
Now look at your input: whenever ~ or ~ The lexer will start interpreting it and the first rule it comes across is ' or' (After the start is matched of course) and it will match the input as there really is a space before the or. Therefore it will match it as Token 1.
The parser on the other hand is expecting a Token 2 at this point so that it will complain about the given input (although it really is complaining about the wrong token type). Altering the input to whenever ~or ~ will result in the correct interpretation.
Exactly that is the reason why you shouldn't use implicit token definitions in your grammar (unless it is really small). Create a new lexer rule for every input and start with the most specific rules. That means rules that match special character sequences (e.g. keywords) should be declared before general lexer rules like ID or STRING or something like that. Rules that will match all the characters in order to prevent the lexer from throwing an error upon unrecognized input have to declared last as they would overwrite every lexer rule after them.

How to perform Lexer Actions that send an Exception?

I'm new to ANTL4 and I can't seem to figure out how to get lexer actions to perform properly.
I have a code snippet that looks for input text:
SIZE10 : [a-zA-Z]* {getText().length() <= 10}?
I would expect that it does not match any combinations of letters that are over 10 letters long, however what this does is treat a 10+ letter string as two different tokens, instead of just nullifying the whole set of 10+ letters. How can I get this action to nullify the whole set of letters?
In addition, where can I go to see all the different token functions I can use (other than getText())? The documentation about lexer actions is really poor. In general, I'm having a hard time figuring out what resources can give me a definitive list of everything in the language. Even an entry point into the source code for me to read would be good at this point. The documentation is too general/basic for me.
EDIT: I've figured out how to send a RuntimeException, but I don't know where to get the elements needed for a proper RecognitionException.
The predicate in a rule directs the parsing process in a way that allows to match only partial input (like in your case) or essentially switch off a part of the grammar depending on certain conditions. In your case the SIZE10 rule is matched until the predicate returns false. Everything up to this event is then returned as a match for SIZE10. After that lexing continues at the point it ended for the previous token and if that is again a letter it will again match SIZE10 as long as the predicate says it is correct. That's a bit different than what you would expect (e.g. using the predicate as an all or nothing switch).
However, if you instead want to match the full set of letters first and then check if the length is <= 10 you can do this in a listener. You can hook into the exitSIZE10() event and reject the match by throwing a recognition exception.
For the usable functions in your actions see the API documentation for ANTLR. For instance here is the one for Token which shows you other possibilities beside getText(). In your action, consider the context you have. In a lexer rule you deal with a Token, hence getText() etc. work on the token. In a parser rule you have a ParserContext instead, which also has a getText() function but that works differently (collecting all child contexts text into a comma separated list).

grok pattern for jmeter

i am trying to parse the below log
2015-07-07T17:51:30.091+0530,857,SelectAppointment,Non HTTP response code: java.net.URISyntaxException,FALSE,8917,20,20,0,1,1,byuiepsperflg01
Now I am unable to parse Non HTTP response code: java.net.URISyntaxException in one field. Please help be build the pattern
This is the pattern I'm using
%{TIMESTAMP_ISO8601:log_timestamp}\,%{INT:elapsed}\,%{WORD:label}\,%{INT:respons‌ecode}\,%{WORD:responsemessage}\,%{WORD:success}\,%{SPACE:faliusemessage}\,%{INT:‌​bytes}\,%{INT:grpThreads}\,%{INT:allThreads}\,%{INT:Latency}\,%{INT:SampleCount}\‌​,%{INT:ErrorCount}\,%{WORD:Hostname}
If you paste your input and pattern into the grok debugger, it says "Compile ERROR". It might be an SO problem, but you had some weird characters in your pattern ("<200c><200b>").
The trick to building custom patterns is to start at the left side and pull one piece off at a time. With that, you would notice that this partial pattern works:
%{TIMESTAMP_ISO8601:log_timestamp},%{INT:elapsed},%{WORD:label}
but this one returns "No Matches":
%{TIMESTAMP_ISO8601:log_timestamp},%{INT:elapsed},%{WORD:label},%{INT:respons‌​ecode}
because you don't have an integer in that position.
Continue adding fields one at a time until everything you want is matched.
Note that you don't have to escape the commas.

Troubles with returns declaration on the first parser rule in an ANTLR4 grammar

I am using returns for my parser rules which works for all parser rules except the first one. If the first parser rule in my grammer uses the returns declaration ANTLR4 complains as follows:
expecting ARG_ACTION while matching a rule
If I add another parser rule above which does not use "returns" ANTLR does not complain.
Here you have a grammar reduced to the problem:
grammar FirstParserRuleReturnIssue;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
aRule returns [String s]: ID { $s = $ID.text; };
I searched to find a special role of the first rule that could explain the behaviour but did not find anything. Is it a bug? Do I miss some understanding?
You need to place parser rules (start with a lowercase letter) before lexer rules (start with an uppercase letter) in your grammar. After encountering a lexer rule, the [ triggers a LEXER_CHAR_SET instead of ARG_ACTION, so the token stream seen by the compiler looks like you're passing a set of characters where the return value should be.

Parsec: Consume all input

One common problem I have with Parsec is that it tends to ignore invalid input if it occurs in the "right" place.
As a concrete example, suppose we have integer :: Parser Int, and I write
expression = sepBy integer (char '+')
(Ignore whitespace issues for a moment.)
This correctly parses something like "123+456+789". However, if I feed it "123+456-789", it merrily ignores the illegal "-" character and the trailing part of the expression; I actually wanted an error message telling me about the invalid input, not just having it silently ignore that part.
I understand why this happens; what I'm not sure about is how to fix it. What is the general method for designing parsers that consume all supplied input and succeed only if all of it is a valid expression?
It's actually pretty simple--just ensure it's followed by eof:
parse (expression <* eof) "<interactive>" "123+456-789"
eof matches the end of the input, even if the input is just a string and not a file.
Obviously, this only makes sense at the top level of your parser.

Resources