Localize token for different languages - antlr4

Developing a new grammar with ANTLR. My grammar supports basic math and boolean expressions like "4 equals (2 minuses 2)" or "true", "false". All operators are in natural language. I want to support other languages in their nature. For example, "4 equals 4" is "4 ist 4" in German.
What is the best practice to localize tokens and/or expressions?

In our project we follow this structure. There are files FooLexerBase.g and FooLexerLang1.g, FooLexerLang2.g and so on. The base grammar defines common token rules. Tokens that depend on language are not defined in the base, but can be referred to. These tokens are defined in the language-specific grammars, that all also include the base.
So, basically it looks something like this:
FooLexerBase.g:
lexer grammar FooLexerBase;
...
FLOATING_POINT
: DIGIT+ EXPONENT
| DIGIT+ DECIMAL_SEP DIGIT* EXPONENT?
| DECIMAL_SEP DIGIT+ EXPONENT?;
...
DIGIT and EXPONENT are defined in the base, since they are common, while DECIMAL_SEP is language-specific.
For example, FooLexerGerman.g looks like this:
lexer grammar FooLexerGerman;
import base = FooBase;
...
fragment
DECIMAL_SEP: ',';
...
Finally, parser grammar is common for all languages. It is defined this way:
parser grammar FooParser;
options {
tokenVocab = FooLexerBase;
}
...
It is important to not process FooLexerBase with ANTLR, but pass all other grammars through it.
At runtime you build a parser and pass an appropriate lexer as argument to the constructor. I guess it looks more or less the same in any programming language (we use Java).

Related

How do I disambiguate an OSC addresses from regular division by a value in ANTLR4?

I have a grammar where I recently added syntax for a constant OSC address --- it looks like this
OSCAddressConstant: ('/' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_')+)+;
Typical examples might be
/a/b/c
/Handle/SetValue
/1/Volume/Page3
Unfortunately, I discovered rather quickly that simple expressions with division: e.g.
foo = 20/10
now fail with type errors because the parser thinks that the /10 is an OSC address and so we get "integer" "Divide" "OSCAddressConstant"
What is the recommended (and hopefully) simplest way to disambiguate these other than changing the actual syntax of the OSC address, which would be a pity.
Thanks in advance
(NB - I saw a similar question about ambiguity between division and regular expression syntax but I did not understand the solution - there was a reference to the use of #member but it was unclear what to do with it - I've not seen that before and other questions about #member seem to have gone unanswered)
That OSCAddressConstant rule is rather a higher level rule, like a complex identifier, possibly qualified. Such higher level constructs should go into the parser, not the lexer.
Just like you would define a qualified identifier as:
ID: [a-zA-Z][a-zA-Z0-9]*;
DOT: '.';
qualified: ID (DOT ID)?;
you can define your OSC address as:
EID: [a-zA-Z0-9_]+;
DIV: '/';
oscAddressConstant: (DIV EID)+;
The only drawback with this approach is: when you usually ignore whitespaces this syntax will allow constructs like: / abc / 12. But if that's something you do not want handle whitespaces in the semantic phase and throw an error then.

antlr 4 lexer rule RULE: '<TAG>'; isn't recognized as token but if fragment rule then recognized

EDIT:
I've been asked if I can provide the full grammar. I cannot and here is the reason why:
I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.
This was posted in the original answer but to prevent frustration from possible helpers I now promote it to the top of the post.
Disclaimer: this is homework related.
I am trying to tokenize a piece of text for homework, and almost everything works as expected, except the following:
TIME : '<time>';
This rule used to be in my grammar. When tokenizing the piece of text, I would not see the TIME token, instead I would see a '<time>' token (which I guess Antlr created for me somehow). But when I moved the string itself to a fragment rule and made the TIME rule point to it, like so:
fragment TIME_TAG : '<time>';
.
.
.
TIME : TIME_TAG;
Then I see the TIME token as expected. I've been searching the internet for several hours and couldn't find an answer.
Another thing that happens is the ATHLETE rule which is defined as:
ATHLETE : WHITESPACE* '<athlete>' WHITESPACE*;
Is also recognized properly and I see the token ATHLETE, but it wasn't recognized when
I didn't allow the WHITESPACE* before and after the tag string.
I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.
Here is my piece of text:
World Record World Record
[1] <time> 9.86 <athlete> "Carl Lewis" <country> "United
States" <date> 25 August 1991
[2] <time> 9.69 <athlete> "Tyson Gay" <country> "United
States" <date> 20 September 2009
[3] <time> 9.82 <athlete> "Donovan Baily" <country>
"Canada" <date> 27 July 1996
[4] <time> 9.58
<athlete> "Usain Bolt"
<country> "Jamaica" <date> 16 August 2009
[5] <time> 9.79 <athlete> "Maurice Greene" <country>
"United State" <date> 16 June 1999
My task is simply to tokenize it. I am not being given the definitions of tokens, and I am supposed to decide that myself. I think '<sometag>' is pretty obvious, so are '"' wrapped strings, numbers, dates, and square-bracket surrounded enumerations.
Thanks in advance to any help or any useful knowledge.
(This will be something of a challenge, without just doing your homework, but maybe a few comments will set you on your way)
The TIME : '<time>'; rule should work just fine. ANTLR only creates tokens for you in parser rules. (parser rules begin with lower case letters, and Lexer rules with uppercase letters, so this wouldn't have been the case with this exact example (perhaps you had a rule name that began with a lower case letter?)
Note: If you dump your tokens, you'll see the TIME token represented like so:
[#3,5:10='<time>',<'<time>'>,2:4]
This means that ANTLR has recognized it as the TIME token (I suspect this may be the source of the confusion. It's just how ANTLR prints out the TIME token.)
As #kaby76 mentions, we usually skip whitespace or throw it into a hidden channel as we don't want to be explicit in parser rules about everywhere we allow whitespace. Either of those options causes them to be ignored by the parser. A very common Whitespace rule is:
WS: [ \t\r\n]+;`.
Since you're only tokenizing, you won't need to worry about parser rules.
Adding this Lexer rule will tokenize whitespace into separate tokens for you so you don't need to account for it in rules like ATHLETE.
You'll need work out Lexer rules for your content, but perhaps this will help you move forward.
The following implementation is a split lexer/parser grammar that "tokenizes" your input file. You can combine the two if you like. I generally split my grammars because of constraints with Antlr lexer grammars, such as when you want to "superClass" the lexer.
But, without a clear problem statement, this implementation may not tokenize the input as required. All software must begin with requirements. If none were given in the assignment, then I would state exactly what are the token types recognized.
In most languages, whitespace is not included in the set of token types consumed by a parser. Thus, I implemented it with "-> skip", which tells the lexer to not produce a token for the recognized input.
It's also not clear whether input such as "[1]" is to be tokenized as one token or separately. In the following implementation, I produce separate tokens for '[', '1', and ']'.
The use of "fragment" rules is likely unnecessary so I don't include any use of the feature. "fragment" rules cannot be used to produce a token in itself, and the symbol cannot be used in a parser rule. They are useful for reuse of a common RHS. You can read more about it here.
FooLexer.g4:
lexer grammar FooLexer;
Athlete : '<athlete>';
Date : '<date>';
Time : '<time>';
Country : '<country>';
StringLiteral : '"' .*? '"';
Stray : [a-zA-Z]+;
OB : '[';
CB : ']';
Number : [0-9.]+;
Ws : [ \t\r\n]+ -> skip;
FooParser.g4:
parser grammar FooParser;
options { tokenVocab = FooLexer; }
start: .* EOF;
Tokens:
$ trparse input.txt | trtokens
Time to parse: 00:00:00.0574154
# tokens per sec = 1219.1850966813781
[#0,0:4='World',<6>,1:0]
[#1,6:11='Record',<6>,1:6]
[#2,13:17='World',<6>,1:13]
[#3,19:24='Record',<6>,1:19]
[#4,27:27='[',<7>,2:0]
[#5,28:28='1',<9>,2:1]
[#6,29:29=']',<8>,2:2]
[#7,31:36='<time>',<3>,2:4]
[#8,38:41='9.86',<9>,2:11]
[#9,43:51='<athlete>',<1>,2:16]
[#10,53:64='"Carl Lewis"',<5>,2:26]
[#11,66:74='<country>',<4>,2:39]
[#12,76:91='"United\r\nStates"',<5>,2:49]
[#13,93:98='<date>',<2>,3:8]
[#14,100:101='25',<9>,3:15]
[#15,103:108='August',<6>,3:18]
[#16,110:113='1991',<9>,3:25]
[#17,116:116='[',<7>,4:0]
[#18,117:117='2',<9>,4:1]
[#19,118:118=']',<8>,4:2]
[#20,120:125='<time>',<3>,4:4]
[#21,127:130='9.69',<9>,4:11]
[#22,132:140='<athlete>',<1>,4:16]
[#23,142:152='"Tyson Gay"',<5>,4:26]
[#24,154:162='<country>',<4>,4:38]
[#25,164:179='"United\r\nStates"',<5>,4:48]
[#26,181:186='<date>',<2>,5:8]
[#27,188:189='20',<9>,5:15]
[#28,191:199='September',<6>,5:18]
[#29,201:204='2009',<9>,5:28]
[#30,207:207='[',<7>,6:0]
[#31,208:208='3',<9>,6:1]
[#32,209:209=']',<8>,6:2]
[#33,211:216='<time>',<3>,6:4]
[#34,218:221='9.82',<9>,6:11]
[#35,223:231='<athlete>',<1>,6:16]
[#36,233:247='"Donovan Baily"',<5>,6:26]
[#37,249:257='<country>',<4>,6:42]
[#38,260:267='"Canada"',<5>,7:0]
[#39,269:274='<date>',<2>,7:9]
[#40,276:277='27',<9>,7:16]
[#41,279:282='July',<6>,7:19]
[#42,284:287='1996',<9>,7:24]
[#43,290:290='[',<7>,8:0]
[#44,291:291='4',<9>,8:1]
[#45,292:292=']',<8>,8:2]
[#46,294:299='<time>',<3>,8:4]
[#47,301:304='9.58',<9>,8:11]
[#48,308:316='<athlete>',<1>,9:1]
[#49,318:329='"Usain Bolt"',<5>,9:11]
[#50,333:341='<country>',<4>,10:1]
[#51,343:351='"Jamaica"',<5>,10:11]
[#52,353:358='<date>',<2>,10:21]
[#53,360:361='16',<9>,10:28]
[#54,363:368='August',<6>,10:31]
[#55,370:373='2009',<9>,10:38]
[#56,378:378='[',<7>,12:0]
[#57,379:379='5',<9>,12:1]
[#58,380:380=']',<8>,12:2]
[#59,382:387='<time>',<3>,12:4]
[#60,389:392='9.79',<9>,12:11]
[#61,394:402='<athlete>',<1>,12:16]
[#62,404:419='"Maurice Greene"',<5>,12:26]
[#63,421:429='<country>',<4>,12:43]
[#64,432:445='"United State"',<5>,13:0]
[#65,447:452='<date>',<2>,13:15]
[#66,454:455='16',<9>,13:22]
[#67,457:460='June',<6>,13:25]
[#68,462:465='1999',<9>,13:30]
[#69,466:465='',<-1>,13:34]

Natty converting from anlr3 to antlr 4

as I'm new to antlr I have plenty of problems with syntactic predicates.
I'v been trying to convert this grammar,which is part of natty grammar, in order to parse it with antlr4,I really got confused how to change it in a meaningful way.
date_time
: (
(date)=>date (date_time_separator explicit_time)?
| explicit_time (time_date_separator date)?
) -> ^(DATE_TIME date? explicit_time?)
| relative_time -> ^(DATE_TIME relative_time?)
;`
Syntactic predicates and re-write rules are no longer supported in ANTLR4. ANTLR4's parsing algorithm should be powerful enough for not needing syntactic predicates, and if you want to traverse the parse tree, have a look at these links:
ANTLR4 visitor pattern on simple arithmetic example
https://github.com/antlr/antlr4/blob/master/doc/tree-matching.md
So, the rule you posted would look like this in ANTLR4:
date_time
: date ( date_time_separator explicit_time )?
| explicit_time ( time_date_separator date )?
| relative_time
;

ANTLR : Have errors with Fragment

The error is :
mismatched input 'elseState' expecting RULE_TOKEN_REF
Can someone explain to me why do i have this error and how to fix it ?
Your help will be appreciated
Fragments are reserved to lexer rules definition and are not usable for parser rules, you don't need it in your case.
A fragment is used to split complex lexer rules and introduce reusability without producing a dedicated token, e.g.:
NUMBER : DIGIT+;
ID : LETTER (LETTER|DIGIT)*;
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
In these lexer rules, I don't want LETTER and DIGIT as token, however, I want to use and reuse them in other lexer rules (NUMBER and DIGIT), so I 'mark' them as fragment. It makes the lexer more readable and easier to maintain.
You can read more details here: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules

Why is this left-recursive and how do I fix it?

I'm learning ANTLR4 and I'm confused at one point. For a Java-like language, I'm trying to add rules for constructs like member chaining, something like that:
expr1.MethodCall(expr2).MethodCall(expr3);
I'm getting an error, saying that two of my rules are mutually left-recursive:
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| memberAccess
;
memberAccess: expression DOT (methodCall | fieldReference);
I thought I understood why the above rule combination is considered left-recursive: because memberAccess is a candidate of expression and memberAccess starts with an expression.
However, my understanding broke down when I saw (by looking at the Java example) that if I just move the contents of memberAccess to expression, I got no errors from ANTLR4 (even though it still doesn't parse what I want, seems to fall into a loop):
expression
: literal
| variableReference
| LPAREN expression RPAREN
| statementExpression
| expression DOT (methodCall | fieldReference)
;
Why is the first example left-recursive but the second isn't?
And what do I have to do to actually parse the initial line?
The second is left-recursive but not mutually left recursive. ANTLR4 can eliminate left-recursive rules with an inbuilt algorithm. It cannot eliminate mutually left recursive rules. There probably exists an algorithm, but this would hardly preserve actions and semantic predicates.
For some reason, ANTLRWorks 2 was not responding when my grammar had left-recursion, causing me to (erroneously) believe that my grammar was wrong.
Compiling and testing from commandline revealed that the version with immediate left-recursion did, in fact, compile and parse correctly.
(I'm leaving this here in case anyone else is confused by the behavior of the IDE.)

Resources