antlr 4 lexer rule RULE: '<TAG>'; isn't recognized as token but if fragment rule then recognized - antlr4

EDIT:
I've been asked if I can provide the full grammar. I cannot and here is the reason why:
I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.
This was posted in the original answer but to prevent frustration from possible helpers I now promote it to the top of the post.
Disclaimer: this is homework related.
I am trying to tokenize a piece of text for homework, and almost everything works as expected, except the following:
TIME : '<time>';
This rule used to be in my grammar. When tokenizing the piece of text, I would not see the TIME token, instead I would see a '<time>' token (which I guess Antlr created for me somehow). But when I moved the string itself to a fragment rule and made the TIME rule point to it, like so:
fragment TIME_TAG : '<time>';
.
.
.
TIME : TIME_TAG;
Then I see the TIME token as expected. I've been searching the internet for several hours and couldn't find an answer.
Another thing that happens is the ATHLETE rule which is defined as:
ATHLETE : WHITESPACE* '<athlete>' WHITESPACE*;
Is also recognized properly and I see the token ATHLETE, but it wasn't recognized when
I didn't allow the WHITESPACE* before and after the tag string.
I cannot provide my full grammar code because it is homework and I am not allowed to disclose my solution, and I will sadly understand if my question cannot be answered because of this. I am just hoping this is a simple thing that I am just failing to understand from the documentation and that this will be enough for someone who knows antlr4 to know the answer.
Here is my piece of text:
World Record World Record
[1] <time> 9.86 <athlete> "Carl Lewis" <country> "United
States" <date> 25 August 1991
[2] <time> 9.69 <athlete> "Tyson Gay" <country> "United
States" <date> 20 September 2009
[3] <time> 9.82 <athlete> "Donovan Baily" <country>
"Canada" <date> 27 July 1996
[4] <time> 9.58
<athlete> "Usain Bolt"
<country> "Jamaica" <date> 16 August 2009
[5] <time> 9.79 <athlete> "Maurice Greene" <country>
"United State" <date> 16 June 1999
My task is simply to tokenize it. I am not being given the definitions of tokens, and I am supposed to decide that myself. I think '<sometag>' is pretty obvious, so are '"' wrapped strings, numbers, dates, and square-bracket surrounded enumerations.
Thanks in advance to any help or any useful knowledge.

(This will be something of a challenge, without just doing your homework, but maybe a few comments will set you on your way)
The TIME : '<time>'; rule should work just fine. ANTLR only creates tokens for you in parser rules. (parser rules begin with lower case letters, and Lexer rules with uppercase letters, so this wouldn't have been the case with this exact example (perhaps you had a rule name that began with a lower case letter?)
Note: If you dump your tokens, you'll see the TIME token represented like so:
[#3,5:10='<time>',<'<time>'>,2:4]
This means that ANTLR has recognized it as the TIME token (I suspect this may be the source of the confusion. It's just how ANTLR prints out the TIME token.)
As #kaby76 mentions, we usually skip whitespace or throw it into a hidden channel as we don't want to be explicit in parser rules about everywhere we allow whitespace. Either of those options causes them to be ignored by the parser. A very common Whitespace rule is:
WS: [ \t\r\n]+;`.
Since you're only tokenizing, you won't need to worry about parser rules.
Adding this Lexer rule will tokenize whitespace into separate tokens for you so you don't need to account for it in rules like ATHLETE.
You'll need work out Lexer rules for your content, but perhaps this will help you move forward.

The following implementation is a split lexer/parser grammar that "tokenizes" your input file. You can combine the two if you like. I generally split my grammars because of constraints with Antlr lexer grammars, such as when you want to "superClass" the lexer.
But, without a clear problem statement, this implementation may not tokenize the input as required. All software must begin with requirements. If none were given in the assignment, then I would state exactly what are the token types recognized.
In most languages, whitespace is not included in the set of token types consumed by a parser. Thus, I implemented it with "-> skip", which tells the lexer to not produce a token for the recognized input.
It's also not clear whether input such as "[1]" is to be tokenized as one token or separately. In the following implementation, I produce separate tokens for '[', '1', and ']'.
The use of "fragment" rules is likely unnecessary so I don't include any use of the feature. "fragment" rules cannot be used to produce a token in itself, and the symbol cannot be used in a parser rule. They are useful for reuse of a common RHS. You can read more about it here.
FooLexer.g4:
lexer grammar FooLexer;
Athlete : '<athlete>';
Date : '<date>';
Time : '<time>';
Country : '<country>';
StringLiteral : '"' .*? '"';
Stray : [a-zA-Z]+;
OB : '[';
CB : ']';
Number : [0-9.]+;
Ws : [ \t\r\n]+ -> skip;
FooParser.g4:
parser grammar FooParser;
options { tokenVocab = FooLexer; }
start: .* EOF;
Tokens:
$ trparse input.txt | trtokens
Time to parse: 00:00:00.0574154
# tokens per sec = 1219.1850966813781
[#0,0:4='World',<6>,1:0]
[#1,6:11='Record',<6>,1:6]
[#2,13:17='World',<6>,1:13]
[#3,19:24='Record',<6>,1:19]
[#4,27:27='[',<7>,2:0]
[#5,28:28='1',<9>,2:1]
[#6,29:29=']',<8>,2:2]
[#7,31:36='<time>',<3>,2:4]
[#8,38:41='9.86',<9>,2:11]
[#9,43:51='<athlete>',<1>,2:16]
[#10,53:64='"Carl Lewis"',<5>,2:26]
[#11,66:74='<country>',<4>,2:39]
[#12,76:91='"United\r\nStates"',<5>,2:49]
[#13,93:98='<date>',<2>,3:8]
[#14,100:101='25',<9>,3:15]
[#15,103:108='August',<6>,3:18]
[#16,110:113='1991',<9>,3:25]
[#17,116:116='[',<7>,4:0]
[#18,117:117='2',<9>,4:1]
[#19,118:118=']',<8>,4:2]
[#20,120:125='<time>',<3>,4:4]
[#21,127:130='9.69',<9>,4:11]
[#22,132:140='<athlete>',<1>,4:16]
[#23,142:152='"Tyson Gay"',<5>,4:26]
[#24,154:162='<country>',<4>,4:38]
[#25,164:179='"United\r\nStates"',<5>,4:48]
[#26,181:186='<date>',<2>,5:8]
[#27,188:189='20',<9>,5:15]
[#28,191:199='September',<6>,5:18]
[#29,201:204='2009',<9>,5:28]
[#30,207:207='[',<7>,6:0]
[#31,208:208='3',<9>,6:1]
[#32,209:209=']',<8>,6:2]
[#33,211:216='<time>',<3>,6:4]
[#34,218:221='9.82',<9>,6:11]
[#35,223:231='<athlete>',<1>,6:16]
[#36,233:247='"Donovan Baily"',<5>,6:26]
[#37,249:257='<country>',<4>,6:42]
[#38,260:267='"Canada"',<5>,7:0]
[#39,269:274='<date>',<2>,7:9]
[#40,276:277='27',<9>,7:16]
[#41,279:282='July',<6>,7:19]
[#42,284:287='1996',<9>,7:24]
[#43,290:290='[',<7>,8:0]
[#44,291:291='4',<9>,8:1]
[#45,292:292=']',<8>,8:2]
[#46,294:299='<time>',<3>,8:4]
[#47,301:304='9.58',<9>,8:11]
[#48,308:316='<athlete>',<1>,9:1]
[#49,318:329='"Usain Bolt"',<5>,9:11]
[#50,333:341='<country>',<4>,10:1]
[#51,343:351='"Jamaica"',<5>,10:11]
[#52,353:358='<date>',<2>,10:21]
[#53,360:361='16',<9>,10:28]
[#54,363:368='August',<6>,10:31]
[#55,370:373='2009',<9>,10:38]
[#56,378:378='[',<7>,12:0]
[#57,379:379='5',<9>,12:1]
[#58,380:380=']',<8>,12:2]
[#59,382:387='<time>',<3>,12:4]
[#60,389:392='9.79',<9>,12:11]
[#61,394:402='<athlete>',<1>,12:16]
[#62,404:419='"Maurice Greene"',<5>,12:26]
[#63,421:429='<country>',<4>,12:43]
[#64,432:445='"United State"',<5>,13:0]
[#65,447:452='<date>',<2>,13:15]
[#66,454:455='16',<9>,13:22]
[#67,457:460='June',<6>,13:25]
[#68,462:465='1999',<9>,13:30]
[#69,466:465='',<-1>,13:34]

Related

Finding the start of an expression when the end of the previous one is difficult to express

I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.
Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.

Arbitrary lookaheads in PLY

I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.

ANTLR 4: Recognises 'and' but not 'or' without a space

I'm using the ANTLR 4 plugin in IntelliJ, and I have the most bizarre bug. I'll start with the relevant parser/lexer rules:
// Take care of whitespace.
WS : [ \r\t\f\n]+ -> skip;
OTHER: . -> skip;
STRING
: '"' [A-z ]+ '"'
;
evaluate // starting rule.
: textbox? // could be an empty textbox.
;
textbox
: (row '\n')*
;
row
: ability
| ability_list
ability
: activated_ability
| triggered_ability
| static_ability
triggered_ability
: trigger_words ',' STRING
;
trigger_words
: ('when'|'whenever'|'as') whenever_triggers|'at'
;
whenever_triggers
: triggerer (('or'|'and') triggerer)* // this line has the issue.
;
triggerer
: self
self: '~'
I pass it this text: whenever ~ or ~, and it fails on the or, saying line 1:10 mismatched input ' or' expecting {'or', 'and'}. However, if I add a space to the whenever_triggers rule's or string (making it ' or'|'and'), it works fine.
The weirdest thing is that if I try whenever ~ and ~, it works fine even without the rule having a space in the and string. This doesn't change if I make 'and'|'or' a lexer rule either. It's just bizarre. I've confirmed this bug happens when running the 'test rig' in Antlrworks 2, so it's not just an IntelliJ thing.
This is an image of the parse tree when the error occurs:
Alright you have found the answer more or less by yourself so with this answer of mine I will focus on explaining why the problem occured in the first place.
First of all - for everyone stumbling upon this question - the problem was that he had another implicit lexer rule defined that looked like this ' or' (notice the whitespace). Changing that to 'or' resolved the problem.
But why was that a problem?
In order to understand that you have to understand what ANTLR does if you write '<something>' in one of your parser rules: When compiling the grammar it will generate a new lexer rule for each of those declarations. These lexer rules will be created before the lexer rules defined in your grammar. The lexer itself will match the given input into tokens and for that it processes each lexer rule at a time in the order they have been declared. Therefore it will always start with the implicit token definitions and then move on to the topmost "real" lexer rule.
The problem is that the lexer isn't too clever about this process that means once it has matched some input with the current lexer rule it will create a respective token and moves on with the trailing input.
As a result a lexer rule that comes afterwards that would have matched the input as well (but as another token as it is a different lexer rule) will be skipped so that the respective input might not have the expected token type because the lexer rules have overwrritten themselves.
In your example the self-overwriting rules are ' or'(Token 1) and 'or'(Token 2). Each of those implicit lexer rule declarations will result in a different lexer rule and as the first one got matched I assume that it is declared before the second one.
Now look at your input: whenever ~ or ~ The lexer will start interpreting it and the first rule it comes across is ' or' (After the start is matched of course) and it will match the input as there really is a space before the or. Therefore it will match it as Token 1.
The parser on the other hand is expecting a Token 2 at this point so that it will complain about the given input (although it really is complaining about the wrong token type). Altering the input to whenever ~or ~ will result in the correct interpretation.
Exactly that is the reason why you shouldn't use implicit token definitions in your grammar (unless it is really small). Create a new lexer rule for every input and start with the most specific rules. That means rules that match special character sequences (e.g. keywords) should be declared before general lexer rules like ID or STRING or something like that. Rules that will match all the characters in order to prevent the lexer from throwing an error upon unrecognized input have to declared last as they would overwrite every lexer rule after them.

ANTLR : Have errors with Fragment

The error is :
mismatched input 'elseState' expecting RULE_TOKEN_REF
Can someone explain to me why do i have this error and how to fix it ?
Your help will be appreciated
Fragments are reserved to lexer rules definition and are not usable for parser rules, you don't need it in your case.
A fragment is used to split complex lexer rules and introduce reusability without producing a dedicated token, e.g.:
NUMBER : DIGIT+;
ID : LETTER (LETTER|DIGIT)*;
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
In these lexer rules, I don't want LETTER and DIGIT as token, however, I want to use and reuse them in other lexer rules (NUMBER and DIGIT), so I 'mark' them as fragment. It makes the lexer more readable and easier to maintain.
You can read more details here: https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules

lexer/parser ambiguity

How does a lexer solve this ambiguity?
/*/*/
How is it that it doesn't just say, oh yeah, that's the begining of a multi-line comment, followed by another multi-line comment.
Wouldn't a greedy lexer just return the following tokens?
/*
/*
/
I'm in the midst of writing a shift-reduce parser for CSS and yet this simple comment thing is in my way. You can read this question if you wan't some more background information.
UPDATE
Sorry for leaving this out in the first place. I'm planning to add extensions to the CSS language in this form /* # func ( args, ... ) */ but I don't want to confuse an editor which understands CSS but not this extension comment of mine. That's why the lexer just can't ignore comments.
One way to do it is for the lexer to enter a different internal state on encountering the first /*. For example, flex calls these "start conditions" (matching C-style comments is one of the examples on that page).
The simplest way would probably be to lex the comment as one single token - that is, don't emit a "START COMMENT" token, but instead continue reading in input until you can emit a "COMMENT BLOCK" token that includes the entire /*(anything)*/ bit.
Since comments are not relevant to the actual parsing of executable code, it's fine for them to basically be stripped out by the lexer (or at least, clumped into a single token). You don't care about token matches within a comment.
In most languages, this is not ambiguous: the first slash and asterix are consumed to produce the "start of multi-line comment" token. It is followed by a slash which is plain "content" within the comment and finally the last two characters are the "end of multi-line comment" token.
Since the first 2 characters are consumed, the first asterix cannot also be used to produce an end of comment token. I just noted that it could produce a second "start of comment" token... oops, that could be a problem, depending on the amount of context is available for the parser.
I speak here of tokens, assuming a parser-level handling of the comments. But the same applies to a lexer, whereby the underlying rule is to start with '/*' and then not stop till '*/' is found. Effectively, a lexer-level handling of the whole comment wouldn't be confused by the second "start of comment".
Since CSS does not support nested comments, your example would typically parse into a single token, COMMENT.
That is, the lexer would see /* as a start-comment marker and then consume everything up to and including a */ sequence.
Use the regexp's algorithm, search from the beginning of the string working way back to the current location.
if (chars[currentLocation] == '/' and chars[currentLocation - 1] == '*') {
for (int i = currentLocation - 2; i >= 0; i --) {
if (chars[i] == '/' && chars[i + 1] == '*') {
// .......
}
}
}
It's like applying the regexp /\*([^\*]|\*[^\/])\*/ greedy and bottom-up.
One way to solve this would be to have your lexer return:
/
*
/
*
/
And have your parser deal with it from there. That's what I'd probably do for most programming languages, as the /'s and *'s can also be used for multiplication and other such things, which are all too complicated for the lexer to worry about. The lexer should really just be returning elementary symbols.
If what the token is starts to depend too much on context, what you're looking for may very well be a simpler token.
That being said, CSS is not a programming language so /'s and *'s can't be overloaded. Really afaik they can't be used for anything else other than comments. So I'd be very tempted to just pass the whole thing as a comment token unless you have a good reason not to: /\*.*\*/

Resources