How do I disambiguate an OSC addresses from regular division by a value in ANTLR4? - antlr4

I have a grammar where I recently added syntax for a constant OSC address --- it looks like this
OSCAddressConstant: ('/' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')+)+;
Typical examples might be
/a/b/c
/Handle/SetValue
/1/Volume/Page3
Unfortunately, I discovered rather quickly that simple expressions with division: e.g.
foo = 20/10
now fail with type errors because the parser thinks that the /10 is an OSC address and so we get "integer" "Divide" "OSCAddressConstant"
What is the recommended (and hopefully) simplest way to disambiguate these other than changing the actual syntax of the OSC address, which would be a pity.
Thanks in advance
(NB - I saw a similar question about ambiguity between division and regular expression syntax but I did not understand the solution - there was a reference to the use of #member but it was unclear what to do with it - I've not seen that before and other questions about #member seem to have gone unanswered)

That OSCAddressConstant rule is rather a higher level rule, like a complex identifier, possibly qualified. Such higher level constructs should go into the parser, not the lexer.
Just like you would define a qualified identifier as:
ID: [a-zA-Z][a-zA-Z0-9]*;
DOT: '.';
qualified: ID (DOT ID)?;
you can define your OSC address as:
EID: [a-zA-Z0-9_]+;
DIV: '/';
oscAddressConstant: (DIV EID)+;
The only drawback with this approach is: when you usually ignore whitespaces this syntax will allow constructs like: / abc / 12. But if that's something you do not want handle whitespaces in the semantic phase and throw an error then.

Related

Antlr4 expression with conditional missing separator

I want to parse Smalltalk.
Normally in a sequence of expressions, they need a PERIOD token (.) in between as a separator, like the ';' in java.
An expression alone does not need the PERIOD.
Hence i match this PERIOD in the expressions rule:
expressions : expression (PERIOD expression)*;
And the different sub-rules for the specific expression do not match the PERIOD by themselves.
However, there is one special type of expression, that calls to native libraries:
<primitive: ABC>
And when this is followed by another expression, the PERIOD is surprisingly not needed.
How can such a situation be handled?
Perhaps injecting a PERIOD. From within the "primitive" rule, tell the lexer to inject a PERIOD token next. But how?
Or is there a better solution for this situation?
Frank
Perhaps something like this:
expressions
: start_expression* expression '.'?
;
start_expression
: expression '.'
| pragma
;
expression
: assignment
| pragma
;
assignment
: ID ':=' NUMBER
;
pragma
: '<' ID ':' ID '>'
;

ANTLR sub rule does not match when the parent rule is missing following token

Consider the following simple grammar using ANTLR 4.7.1.
grammar Grammar;
ID: [a-z];
DOT: '.';
LPAREN: '(';
RPAREN: ')';
SEMICOLON: ';';
LT: '<';
GT: '>';
term
: ID LT ID GT LPAREN expr RPAREN # CallExpr
| ID # Id
| LPAREN expr RPAREN # ParenExpr
;
expr
: term DOT? # PrimaryExpr
| expr bop=(GT | LT) expr # BinaryExpr
;
update : expr SEMICOLON ;
When matching the snippet a<b>(c) against the rule expr, ANTLR reports that there is an ambiguity, as the expression can be either a PrimaryExpr, or a BinaryExpr with one of its operands is also a BinaryExpr. That's expected and it is a feature of the language that we develop. Thanks to the priority, the parser prefers the former, which is what we want.
When a<b>(c); is matched against update, everything also works as expected - there is an ambiguity, but the PrimaryExpr has precedence.
However, when I try to match a<b>(c) against update, I expect that apart from reporting missing semicolon, there is the same ambiguity. What happens instead is that only the BinaryExpr rule is matched. This is an issue, because such a snippet can appear in a code being written, and it causes the editor plugin autocomplete (and other features) to work incorrectly. Any pointers to why that happens and how to fix it?
Things I've experimented with, that merely increased the confusion further:
When the BinaryExpr rule is removed, then a<b>(c) matches update through PrimaryExpr (with missing semicolon). How can ANTLR fallback to a different derivation when I remove the one it uses, yet do not report ambiguity?
When DOT? is removed from PrimaryExpr, the issue is gone.
When using custom reporting with update : expr (SEMICOLON | {notifyErrorListeners("Missing ';'");}), the issue is gone.
We can use the third option to fix the issue, it doesn't seem to break anything in our test suite, but I strongly feel I am not fixing the root cause and missing something fundamental which will bite us later.
I've found somewhat similar issue https://github.com/antlr/antlr4/issues/1545, but that one is already fixed.

SML Pattern Matching on char lists

I'm trying to pattern match on char lists in SML. I pass in a char list generated from a string as an argument to the helper function, but I get an error saying "non-constructor applied to argument in pattern". The error goes away if instead of
#"a"::#"b"::#"c"::#"d"::_::nil
I use:
#"a"::_::nil.
Any explanations regarding why this happens would be much appreciated, and work-arounds if any. I'm guessing I could use the substring function to check this specific substring in the original string, but I find pattern matching intriguing and wanted to take a shot. Also, I need specific information in the char list located somewhere later in the string, and I was wondering if my pattern could be:
#"some useless characters"::#"list of characters I want"::#"newline character"
I checked out How to do pattern matching on string in SML? but it didn't help.
fun somefunction(#"a"::#"b"::#"c"::#"d"::_::nil) = print("true\n")
| somefunction(_) = print("false\n")
If you add parentheses around the characters the problem goes away:
fun somefunction((#"a")::(#"b")::(#"c")::(#"d")::_::nil) = print("true\n")
| somefunction(_) = print("false\n")
Then somefunction (explode "abcde") prints true and somefunction (explode "abcdef") prints false.
I'm not quite sure why the SML parser had difficulties parsing the original definition. The error message suggests that is was interpreting # as a function which is applied to strings. The problem doesn't arise simply in pattern matching. SML also has difficulty with an expression like #"a"::#"b"::[]. At first it seems like a precedence problem (of # and ::) but that isn't the issue since #"a"::explode "bc" works as expected (matching your observation of how your definition worked when only one # appeared). I suspect that the problem traces to the fact that characters where added to the language with SML 97. The earlier SML 90 viewed characters as strings of length 1. Perhaps there is some sort of behind-the-scenes kludge with the way the symbol # as a part of character literals was grafted onto the language.

Why is this left-recursive and how do I fix it?

I'm learning ANTLR4 and I'm confused at one point. For a Java-like language, I'm trying to add rules for constructs like member chaining, something like that:
expr1.MethodCall(expr2).MethodCall(expr3);
I'm getting an error, saying that two of my rules are mutually left-recursive:
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| memberAccess
;
memberAccess: expression DOT (methodCall | fieldReference);
I thought I understood why the above rule combination is considered left-recursive: because memberAccess is a candidate of expression and memberAccess starts with an expression.
However, my understanding broke down when I saw (by looking at the Java example) that if I just move the contents of memberAccess to expression, I got no errors from ANTLR4 (even though it still doesn't parse what I want, seems to fall into a loop):
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| expression DOT (methodCall | fieldReference)
;
Why is the first example left-recursive but the second isn't?
And what do I have to do to actually parse the initial line?
The second is left-recursive but not mutually left recursive. ANTLR4 can eliminate left-recursive rules with an inbuilt algorithm. It cannot eliminate mutually left recursive rules. There probably exists an algorithm, but this would hardly preserve actions and semantic predicates.
For some reason, ANTLRWorks 2 was not responding when my grammar had left-recursion, causing me to (erroneously) believe that my grammar was wrong.
Compiling and testing from commandline revealed that the version with immediate left-recursion did, in fact, compile and parse correctly.
(I'm leaving this here in case anyone else is confused by the behavior of the IDE.)

Solving ambiguous input: mismatched input

I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

Resources