The lexer chooses the wrong Token - antlr4

Hi I am new to antrl and have a problem that I am not able to solve during the last days:
I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)
100abc
150100
200def
Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.
I thought I could recogize this with this grammar:
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';
header : HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
I also read some other posts like this:
antlr4 mixed fragments in tokens
Lexer, overlapping rule, but want the shorter match
But I did not think, that it solves my problem, or at least I don't see how that could help me.

One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.
So, the final version of correct grammar on my opinion is
grammar Types;
file : Header Content Trailer;
A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline.
Header : '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;
Your input text will be parsed to (file 100abc\n 150100\n 200def)

Related

How to parse "Leg 1: Jun 25" with ANTLR

I am starting with antlr4 and after following some tutorials, I started to make my own grammar. For now, I wanted to parse a simple input Leg 1: Jun 25.
fragment DIGIT
: [0-9];
fragment MONTH
: [A-Z][a-z][a-z];
DATE
: MONTH ' ' DIGIT+;
LEG_NUMBER
: DIGIT+;
leg
: 'Leg ' LEG_NUMBER ': ' DATE;
But it's no success, I get the following error
line 1:0 mismatched input 'Leg 1' expecting 'Leg '
I don't understand even the output message... Here is the parse tree in IntelliJ ANTLR plugin
The parse tree is showing you that the Lexer has recognized your input as three tokens: a DATE ("Leg 1"), your : (implicitly defined) token, and then another DATE ("Jun 25").
The first thing to understand is that the Lexer will first tokenize your input stream of characters into a stream of tokens. At this point in the processing, parser rules have absolutely no impact. Parser rules match against the stream of tokens (not your input stream of characters).
Since your DATE rule says "Upper case letter, lowercase letter, lowercase letter, space, one or more numbers", then "Leg 1" is a match, and is recognized as a DATE token. The Lexer doesn't know (or care) that your parser rule wants to start by matching "Leg ".
It's always a good idea to run your input through some tool that shows you the token stream so you can validate your Lexer rules. That can either be the grun alias with the -tokens option, or you should be able to view your token stream in the IntelliJ ANTLR plugin (with some experience you'll also recognize that the parse tree diagram is telling you that as well)
One way to fix this would be to tighten up the MONTH fragment:
fragment MONTH
: (
'Jan'
| 'Feb'
| 'Mar'
| 'Apr'
| 'May'
| 'Jun'
| 'Jul'
| 'Aug'
| 'Sep'
| 'Oct'
| 'Nov'
| 'Dec'
)
;
That will prevent "Leg 1" from matching. I'm not recommending that as a good path forward with a "real" grammar, but it does resolve this immediate issue as you start to work with ANTLR.

Parsing formatted strings in Go

The Problem
I have slice of string values wherein each value is formatted based on a template. In my particular case, I am trying to parse Markdown URLs as shown below:
- [What did I just commit?](#what-did-i-just-commit)
- [I wrote the wrong thing in a commit message](#i-wrote-the-wrong-thing-in-a-commit-message)
- [I committed with the wrong name and email configured](#i-committed-with-the-wrong-name-and-email-configured)
- [I want to remove a file from the previous commit](#i-want-to-remove-a-file-from-the-previous-commit)
- [I want to delete or remove my last commit](#i-want-to-delete-or-remove-my-last-commit)
- [Delete/remove arbitrary commit](#deleteremove-arbitrary-commit)
- [I tried to push my amended commit to a remote, but I got an error message](#i-tried-to-push-my-amended-commit-to-a-remote-but-i-got-an-error-message)
- [I accidentally did a hard reset, and I want my changes back](#i-accidentally-did-a-hard-reset-and-i-want-my-changes-back)
What I want to do?
I am looking for ways to parse this into a value of type:
type Entity struct {
Statement string
URL string
}
What have I tried?
As you can see, all the items follow the pattern: - [{{ .Statement }}]({{ .URL }}). I tried using the fmt.Sscanf function to scan each string as:
var statement, url string
fmt.Sscanf(s, "[%s](%s)", &statement, &url)
This results in:
statement = "I"
url = ""
The issue is with the scanner storing space-separated values only. I do not understand why the URL field is not getting populated based on this rule.
How can I get the Markdown values as mentioned above?
EDIT: As suggested by Marc, I will add couple of clarification points:
This is a general purpose question on parsing strings based on a format. In my particular case, a Markdown parser might help me but my intention to learn how to handle such cases in general where a library might not exist.
I have read the official documentation before posting here.
Note: The following solution only works for "simple", non-escaped input markdown links. If this suits your needs, go ahead and use it. For full markdown-compatibility you should use a proper markdown parser such as gopkg.in/russross/blackfriday.v2.
You could use regexp to get the link text and the URL out of a markdown link.
So the general input text is in the form of:
[some text](somelink)
A regular expression that models this:
\[([^\]]+)\]\(([^)]+)\)
Where:
\[ is the literal [
([^\]]+) is for the "some text", it's everything except the closing square brackets
\] is the literal ]
\( is the literal (
([^)]+) is for the "somelink", it's everything except the closing brackets
\) is the literal )
Example:
r := regexp.MustCompile(`\[([^\]]+)\]\(([^)]+)\)`)
inputs := []string{
"[Some text](#some/link)",
"[What did I just commit?](#what-did-i-just-commit)",
"invalid",
}
for _, input := range inputs {
fmt.Println("Parsing:", input)
allSubmatches := r.FindAllStringSubmatch(input, -1)
if len(allSubmatches) == 0 {
fmt.Println(" No match!")
} else {
parts := allSubmatches[0]
fmt.Println(" Text:", parts[1])
fmt.Println(" URL: ", parts[2])
}
}
Output (try it on the Go Playground):
Parsing: [Some text](#some/link)
Text: Some text
URL: #some/link
Parsing: [What did I just commit?](#what-did-i-just-commit)
Text: What did I just commit?
URL: #what-did-i-just-commit
Parsing: invalid
No match!
You could create a simple lexer in pure-Go code for this use case. There's a great talk by Rob Pike from years ago that goes into the design of text/template which would be applicable. The implementation chains together a series of state functions into an overall state machine, and delivers the tokens out through a channel (via Goroutine) for later processing.

Inconsistent token handling in ANTLR4

The ANTLR4 book references a multi-mode example
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsLexer.g4
lexer grammar ModeTagsLexer;
// Default mode rules (the SEA)
OPEN : '<' -> mode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> mode(DEFAULT_MODE) ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsParser.g4
parser grammar ModeTagsParser;
options { tokenVocab=ModeTagsLexer; } // use tokens from ModeTagsLexer.g4
file: (tag | TEXT)* ;
tag : '<' ID '>'
| '<' '/' ID '>'
;
I'm trying to build on this example, but using the « and » characters for delimiters. If I simply substitute I'm getting error 126
cannot create implicit token for string literal in non-combined grammar: '«'
In fact, this seems to occur as soon as I have the « character in the parser tag rule.
tag : '«' ID '>';
with
OPEN : '«' -> pushMode(ISLAND);
TEXT : ~'«'+;
Is there some antlr foo I'm missing? This is using antlr4-maven-plugin 4.2.
The wiki mentions something along these lines, but the way I read it that's contradicting the example on github and anecdotal experience when using <. See "Redundant String Literals" at https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules
One of the following is happening:
You forgot to update the OPEN rule in ModeTagsLexer.g4 to use the following form:
OPEN : '«' -> mode(ISLAND) ;
You found a bug in ANTLR 4, which should be reported to the issue tracker.
Have you specified the file encoding that ANTLR should use when reading the grammar? It should be okay with European characters less than 255 but...

ANTLR4 : mismatched input

I would like to match the input of the form ::
commit a1b2c3
Author: Michael <michael#test.com>
commit d3g4
Author: David <david#test.com>
Here is the grammar I have written:
grammar commit;
file : commitinfo+;
commitinfo : commitdesc authordesc;
commitdesc : 'commit' COMMITHASH NEWLINE;
authordesc : 'Author:' AUTHORNAME '<' EMAIL '>' NEWLINE;
COMMITHASH : [a-z0-9]+;
AUTHORNAME : [a-zA-Z]+;
EMAIL : [a-zA-Z0-9.#]+;
NEWLINE : '\r'?'\n';
WHITESPACE : [ \t]->skip;
The problem with the above parser is that, for the above input it matches perfectly. But when the input changes to :
commit c1d2
Author: michael <michael#test.com>
it throws an error like :
line 2:8 mismatched input 'michael' expecting AUTHORNAME.
When I print the tokens, it seems the string 'michael' gets matched by the token COMMITHASH instead of AUTHORNAME.
How to fix the above case?
ANTLR4 matches the lexer rules according to the sequence in which they have been written.
'michael' gets matched by the rule COMMITHASH : [a-z0-9]+ ; which appears before the rule AUTHORNAME and hence you are having the error.
I can think of the following options to resolve the issue you are facing :
You can use the 'mode' feature in ANTLR : In ANTLR 4, one lexer mode is active at a time, and the longest non-fragment lexer rule in that mode rule will determine which token is created. Your grammar only includes the default mode, so all the lexer rules are active and hence 'michael' gets matched to COMMITHASH as the length of the token matched is same for COMMITHASH and AUTHORNAME but COMMITHASH appears before AUTHORNAME in the grammar.
You can alter your lexical rules by interchanging the way in which they appear in the grammar. Assuming your COMMITHASH rule always has a numeral matched with it. Put AUTHORNAME before COMMITHASH in the following way :
grammar commit;
...
AUTHORNAME : [a-zA-Z]+;
COMMITHASH : [a-z0-9]+;
...
Note: I strongly feel that your lexer rules are not crisply written. Are you sure that your COMMITHASH rule should be [a-z0-9]+; This would mean a token like 'abhdks' will also get matched by your COMMITHASH rule. But that's a different issue altogether.

ANTLR4: Tree construction

I am extending the baseClass Listener and am attempting to read in some values, however there doesnt seem to be any hierrarchy in the order.
A cut down version of my grammar is as follows:
start: config_options+
config_options: (KEY) EQUALS^ (PATH | ALPHANUM) (' '|'\r'|'\n')* ;
KEY: 'key' ;
EQUALS: '=' ;
ALPHANUM: [0-9a-zA-Z]+ ;
However the parse tree of this implementation is flat at the config_options level (Terminal level) i.e.the rule start has many children of config_options but EQUALS is not the root of subtrees of config_options, all of the TOKENS have the rule config_options as root node. How can I make one of the terminals a root node instead?
In this particular rule I dont want any of the spaces to be captured, I know that there is the -> skip directed for the lexer however there are some cases where I do want the space. i.e. in String '"'(ALPHANUM|' ')'"'
(Note: the ^ does not seem to work)
an example for input is:
key=abcdefg
key=90weata
key=acbefg9
All I want to do is extract the key and value pairs. I would expect that the '=' would be the root and the two children would be the key and the value.
When you generate your grammar, you should be getting a syntax error over the use of the ^ operator, which was removed in ANTLR 4. ANTLR 4 generates parse trees, the roots of which are implicitly defined by the rules in your grammar. In other words, for the grammar you gave above the parse tree nodes will be start and config_options.
The generated config_options rule will return an instance of Config_optionsContext, which contains the following methods:
KEY() returns a TerminalNode for the KEY token.
EQUALS() (same for the EQUALS token)
PATH() (same for the PATH token)
ALPHANUM() (same for the ALPHANUM token)
You can call getSymbol() on a TerminalNode to get the Token instance.

Resources