ANTLR4. How to create properly unicode range lexer rules? - antlr4

In my grammar I'd like variables to be comprised of latin, cyrillic and mandarin characters.
For this purposes I define lexer rule, like this:
CYRILLIC_RANGE: [\u0400–\u04FF];
this is what I see in my ANTLRWorks 2.1 output when I try to run expression against my query:
line 1:4 token recognition error at: 'н'
What am I missing?

I'm not sure what you are missing, as this seems to be working for me here. Have you tried the other range syntax? Both of these should be equivalent.
CYRILLIC_RANGE : [\u0400-\u04FF] ;
CYRILLIC_RANGE : '\u0400'..'\u04FF' ;

Related

antlr 4 lexer rule RULE: '<TAG>'; isn't recognized as token but if fragment rule then recognized

EDIT:
I've been asked if I can provide the full grammar. I cannot and here is the reason why:
I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.
This was posted in the original answer but to prevent frustration from possible helpers I now promote it to the top of the post.
Disclaimer: this is homework related.
I am trying to tokenize a piece of text for homework, and almost everything works as expected, except the following:
TIME : '<time>';
This rule used to be in my grammar. When tokenizing the piece of text, I would not see the TIME token, instead I would see a '<time>' token (which I guess Antlr created for me somehow). But when I moved the string itself to a fragment rule and made the TIME rule point to it, like so:
fragment TIME_TAG : '<time>';
.
.
.
TIME : TIME_TAG;
Then I see the TIME token as expected. I've been searching the internet for several hours and couldn't find an answer.
Another thing that happens is the ATHLETE rule which is defined as:
ATHLETE : WHITESPACE* '<athlete>' WHITESPACE*;
Is also recognized properly and I see the token ATHLETE, but it wasn't recognized when
I didn't allow the WHITESPACE* before and after the tag string.
I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.
Here is my piece of text:
World Record World Record
[1] <time> 9.86 <athlete> "Carl Lewis" <country> "United
States" <date> 25 August 1991
[2] <time> 9.69 <athlete> "Tyson Gay" <country> "United
States" <date> 20 September 2009
[3] <time> 9.82 <athlete> "Donovan Baily" <country>
"Canada" <date> 27 July 1996
[4] <time> 9.58
<athlete> "Usain Bolt"
<country> "Jamaica" <date> 16 August 2009
[5] <time> 9.79 <athlete> "Maurice Greene" <country>
"United State" <date> 16 June 1999
My task is simply to tokenize it. I am not being given the definitions of tokens, and I am supposed to decide that myself. I think '<sometag>' is pretty obvious, so are '"' wrapped strings, numbers, dates, and square-bracket surrounded enumerations.
Thanks in advance to any help or any useful knowledge.
(This will be something of a challenge, without just doing your homework, but maybe a few comments will set you on your way)
The TIME : '<time>'; rule should work just fine. ANTLR only creates tokens for you in parser rules. (parser rules begin with lower case letters, and Lexer rules with uppercase letters, so this wouldn't have been the case with this exact example (perhaps you had a rule name that began with a lower case letter?)
Note: If you dump your tokens, you'll see the TIME token represented like so:
[#3,5:10='<time>',<'<time>'>,2:4]
This means that ANTLR has recognized it as the TIME token (I suspect this may be the source of the confusion. It's just how ANTLR prints out the TIME token.)
As #kaby76 mentions, we usually skip whitespace or throw it into a hidden channel as we don't want to be explicit in parser rules about everywhere we allow whitespace. Either of those options causes them to be ignored by the parser. A very common Whitespace rule is:
WS: [ \t\r\n]+;`.
Since you're only tokenizing, you won't need to worry about parser rules.
Adding this Lexer rule will tokenize whitespace into separate tokens for you so you don't need to account for it in rules like ATHLETE.
You'll need work out Lexer rules for your content, but perhaps this will help you move forward.
The following implementation is a split lexer/parser grammar that "tokenizes" your input file. You can combine the two if you like. I generally split my grammars because of constraints with Antlr lexer grammars, such as when you want to "superClass" the lexer.
But, without a clear problem statement, this implementation may not tokenize the input as required. All software must begin with requirements. If none were given in the assignment, then I would state exactly what are the token types recognized.
In most languages, whitespace is not included in the set of token types consumed by a parser. Thus, I implemented it with "-> skip", which tells the lexer to not produce a token for the recognized input.
It's also not clear whether input such as "[1]" is to be tokenized as one token or separately. In the following implementation, I produce separate tokens for '[', '1', and ']'.
The use of "fragment" rules is likely unnecessary so I don't include any use of the feature. "fragment" rules cannot be used to produce a token in itself, and the symbol cannot be used in a parser rule. They are useful for reuse of a common RHS. You can read more about it here.
FooLexer.g4:
lexer grammar FooLexer;
Athlete : '<athlete>';
Date : '<date>';
Time : '<time>';
Country : '<country>';
StringLiteral : '"' .*? '"';
Stray : [a-zA-Z]+;
OB : '[';
CB : ']';
Number : [0-9.]+;
Ws : [ \t\r\n]+ -> skip;
FooParser.g4:
parser grammar FooParser;
options { tokenVocab = FooLexer; }
start: .* EOF;
Tokens:
$ trparse input.txt | trtokens
Time to parse: 00:00:00.0574154
# tokens per sec = 1219.1850966813781
[#0,0:4='World',<6>,1:0]
[#1,6:11='Record',<6>,1:6]
[#2,13:17='World',<6>,1:13]
[#3,19:24='Record',<6>,1:19]
[#4,27:27='[',<7>,2:0]
[#5,28:28='1',<9>,2:1]
[#6,29:29=']',<8>,2:2]
[#7,31:36='<time>',<3>,2:4]
[#8,38:41='9.86',<9>,2:11]
[#9,43:51='<athlete>',<1>,2:16]
[#10,53:64='"Carl Lewis"',<5>,2:26]
[#11,66:74='<country>',<4>,2:39]
[#12,76:91='"United\r\nStates"',<5>,2:49]
[#13,93:98='<date>',<2>,3:8]
[#14,100:101='25',<9>,3:15]
[#15,103:108='August',<6>,3:18]
[#16,110:113='1991',<9>,3:25]
[#17,116:116='[',<7>,4:0]
[#18,117:117='2',<9>,4:1]
[#19,118:118=']',<8>,4:2]
[#20,120:125='<time>',<3>,4:4]
[#21,127:130='9.69',<9>,4:11]
[#22,132:140='<athlete>',<1>,4:16]
[#23,142:152='"Tyson Gay"',<5>,4:26]
[#24,154:162='<country>',<4>,4:38]
[#25,164:179='"United\r\nStates"',<5>,4:48]
[#26,181:186='<date>',<2>,5:8]
[#27,188:189='20',<9>,5:15]
[#28,191:199='September',<6>,5:18]
[#29,201:204='2009',<9>,5:28]
[#30,207:207='[',<7>,6:0]
[#31,208:208='3',<9>,6:1]
[#32,209:209=']',<8>,6:2]
[#33,211:216='<time>',<3>,6:4]
[#34,218:221='9.82',<9>,6:11]
[#35,223:231='<athlete>',<1>,6:16]
[#36,233:247='"Donovan Baily"',<5>,6:26]
[#37,249:257='<country>',<4>,6:42]
[#38,260:267='"Canada"',<5>,7:0]
[#39,269:274='<date>',<2>,7:9]
[#40,276:277='27',<9>,7:16]
[#41,279:282='July',<6>,7:19]
[#42,284:287='1996',<9>,7:24]
[#43,290:290='[',<7>,8:0]
[#44,291:291='4',<9>,8:1]
[#45,292:292=']',<8>,8:2]
[#46,294:299='<time>',<3>,8:4]
[#47,301:304='9.58',<9>,8:11]
[#48,308:316='<athlete>',<1>,9:1]
[#49,318:329='"Usain Bolt"',<5>,9:11]
[#50,333:341='<country>',<4>,10:1]
[#51,343:351='"Jamaica"',<5>,10:11]
[#52,353:358='<date>',<2>,10:21]
[#53,360:361='16',<9>,10:28]
[#54,363:368='August',<6>,10:31]
[#55,370:373='2009',<9>,10:38]
[#56,378:378='[',<7>,12:0]
[#57,379:379='5',<9>,12:1]
[#58,380:380=']',<8>,12:2]
[#59,382:387='<time>',<3>,12:4]
[#60,389:392='9.79',<9>,12:11]
[#61,394:402='<athlete>',<1>,12:16]
[#62,404:419='"Maurice Greene"',<5>,12:26]
[#63,421:429='<country>',<4>,12:43]
[#64,432:445='"United State"',<5>,13:0]
[#65,447:452='<date>',<2>,13:15]
[#66,454:455='16',<9>,13:22]
[#67,457:460='June',<6>,13:25]
[#68,462:465='1999',<9>,13:30]
[#69,466:465='',<-1>,13:34]

Python ANTLR4 example - Parser doesn't seem to parse correctly

To demonstrate the problem, I'm going to create a simple grammar to merely detect Python-like variables.
I create a virtual environment and install antlr4-python3-runtime in it, as mentioned in "Where can I get the runtime?":
Then, I create a PyVar.g4 file with the following content:
grammar PyVar;
program: IDENTIFIER+;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
NEWLINE: '\n' | '\r\n';
WHITESPACE: [ ]+ -> skip;
Now if I test the grammar with grun, I can see that the grammar detects the variables just fine:
Now I'm trying to write a parser in Python to do just that. I generate the Lexer and Parser, using this command:
antlr4 -Dlanguage=Python3 PyVar.g4
And they're generated with no errors:
But when I use the example provided in "How do I run the generated lexer and/or parser?", I get no output:
What am I not doing right?
There are two problems here.
1. The grammar:
In the line where I had,
program: IDENTIFIER+;
the parser will only detect one or more variables, and it will not detect any newline. The output you see when running grun is the output created by the lexer, that's why newlines are present in the tokens. So I had to replace it with something like this, for the parser to detect newlines.
program: (IDENTIFIER | NEWLINE)+;
2. Printing the output of parser
In PyVar.py file, I created a tree with this line:
tree = parser.program()
But it didn't print its output, nor did I know how to, but the OP's comment on this accepted answer suggests using tree.toStringTree().
Now if we fix those, we can see that it works:

ANTLR4 - How to parse content between same string values

I'm trying to write an antlr4 parser rule that can match the content between some arbitrary string values that are same. So far I couldn't find a method to do it.
For example, in the below input, I need a rule to extract Hello and Bye. I'm not interested in extracting xyz though.
TEXT Hello TEXT
TEXT1 Bye TEXT1
TEXT5 xyz TEXT8
As it is very much similar to an XML element grammar, I tried an example for XML Parser given in ANTLR4 XML Grammar, but it parses an input like <ABC> ... </XYZ> without error which is not what I wanted.
I also tried using semantic predicates without much success.
Could anyone please help with a hint on how to match content that is embedded between same strings?
Thank you!
Satheesh
Not sure how this works out performance wise, because of many many checks the parser has to do, but you could try something like:
token:
start = IDENTIFIER WORD* end = IDENTIFIER { start == end }?
;
The part between the curly braces is a validating semantic predicate. The lexer tokens are self-explanatory, I believe.
The more I think about it, it might be better you just tokenize the input and write an owner parser that processes the input and acts accordingly. Depends of course on the complexity of the syntax.

antlr4 token recognition error at: '$'

Trying to build a grammar for PowerScript language. I split the language in several parts and everything seems to be working except for the simple headers. It seems that the $ simbol can't be recognized. Could anyone help me a little? ( I just copy the small example I'm trying)
grammar PowerScript;
compilationUnit : Header EOF;
fragment
Header : ID '.' ID;
ID : [a-zA-Z0-9$_]+ ;
test file just contains:
$PBExportHeader$n_logversion.sru
Thanks
The compilationUnit rule is a parser rule. Parser rules cannot refer to lexer fragments. Just remove the fragment qualifier to make Header a proper lexer rule.
Update
Antlr4 is fully Unicode capable. Just include the characters in standard Unicode encoding form:
ID : ( [a-zA-Z0-9$_] | '\uD83D\uDCB2' )+ ; // Unicode heavy Dollar sign

Solving ambiguous input: mismatched input

I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

Resources