How to parse "Leg 1: Jun 25" with ANTLR - antlr4

I am starting with antlr4 and after following some tutorials, I started to make my own grammar. For now, I wanted to parse a simple input Leg 1: Jun 25.
fragment DIGIT
: [0-9];
fragment MONTH
: [A-Z][a-z][a-z];
DATE
: MONTH ' ' DIGIT+;
LEG_NUMBER
: DIGIT+;
leg
: 'Leg ' LEG_NUMBER ': ' DATE;
But it's no success, I get the following error
line 1:0 mismatched input 'Leg 1' expecting 'Leg '
I don't understand even the output message... Here is the parse tree in IntelliJ ANTLR plugin

The parse tree is showing you that the Lexer has recognized your input as three tokens: a DATE ("Leg 1"), your : (implicitly defined) token, and then another DATE ("Jun 25").
The first thing to understand is that the Lexer will first tokenize your input stream of characters into a stream of tokens. At this point in the processing, parser rules have absolutely no impact. Parser rules match against the stream of tokens (not your input stream of characters).
Since your DATE rule says "Upper case letter, lowercase letter, lowercase letter, space, one or more numbers", then "Leg 1" is a match, and is recognized as a DATE token. The Lexer doesn't know (or care) that your parser rule wants to start by matching "Leg ".
It's always a good idea to run your input through some tool that shows you the token stream so you can validate your Lexer rules. That can either be the grun alias with the -tokens option, or you should be able to view your token stream in the IntelliJ ANTLR plugin (with some experience you'll also recognize that the parse tree diagram is telling you that as well)
One way to fix this would be to tighten up the MONTH fragment:
fragment MONTH
: (
'Jan'
| 'Feb'
| 'Mar'
| 'Apr'
| 'May'
| 'Jun'
| 'Jul'
| 'Aug'
| 'Sep'
| 'Oct'
| 'Nov'
| 'Dec'
)
;
That will prevent "Leg 1" from matching. I'm not recommending that as a good path forward with a "real" grammar, but it does resolve this immediate issue as you start to work with ANTLR.

Related

Splunk rex Search - Unable to tabulate because of NULL

I want to extract "TimesAccesed" from the message field.
Message: PublicDomainAPI.SaveAsync: progresses = [{"UserGuid":"0a062514-def3-4ae5-9092-asd12easd","CourseId":"c71f6538-e379-447e-aaf3-asd1dasd","Status":"InProgress","UserScore":1,"TotalTime":"0:23:45","TimesAccessed":null,"CompletionDate":null,"LastTimeAccessed":"2022-07-23T09:59:12.191+00:00","SuccessStatus":"Pass","Bookmark":"en","SuspendData":null,"Progress":null,"RegistrationDate":"2022-07-23T09:59:12.191+00:00","RegistrationNumber":1}], total: 1
I used | rex field=Message "\"TimesAccessed\"\:\"(?<TimesAccessed>[^\"]+)"
But I am not getting tabulated results because my data has NULL.
The same works for other fields like
| rex field=Message "\"TotalTime\"\:\"(?<TotalTime>[^\"]+)"
| rex field=Message "\"CourseId\"\:\"(?<CourseId>[^\"]+)"
Checking your regex on regex101 shows that it fails - you're looking to match a literal ", but it's not there for your "null" value
This regular expression is both simpler to read, and pulls what you're looking for (without the extraneous comma):
| rex field=Message "TimesAccessed[[:punct:]]+(?<TimesAccessed>[^\",]+)"
Use the [[:punct:]] character class to match any punctuation between the text you're trying to match

Odd ANTLR4 error handling behavior with very simply (trivial) behavior

Given the below super simple grammar:
ddlStatement
: defineStatement
;
defineStatement
: 'define' tableNameToken=Identifier ';'?
;
and the input "add 1 to bob"
I would expect to get an error. However, the parser matches the "defineStatement" rule with a missing "define" token. The following Listener will fire
#Override
public void exitDefineStatement(DDLParser.DefineStatementContext ctx) {
log.info(MessageFormat.format("Defining {0}", ctx.tableNameToken.getText()));
}
and log "Defining add".
I can assign 'define' to a variable and test that variable for NULL but that seems like work I shouldn't have to do.
BTW if the grammar becomes more complete - specifically with the addition of alternatives to the ddlStatement rule - error handling works as I would expect.
This ANTLR's error recovery in action.
In many cases, it's VERY beneficial for ANTLR to assume either a missing token, or ignore a token, if it allows parsing to continue. The missing "define" token should have been reported as an error.
Without this capability, ANTLR would frequently get "stumped" at the first sign of problems. With this, ANTLR is saying "Well, if I assume X, then I can make sense of your input. So I'm assuming X and reporting that as an error so I can continue on.
(Filling a few details to get this to build)
grammar Test
;
ddlStatement: defineStatement;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
if I run antlr on this and compile the Java output. The following command:
echo "add 1 to bob" | grun Test ddlStatement -gui
yields the error:
line 1:0 missing 'define' at 'add'
and produces the parse tree:
The highlighted node is the error node in the tree.
The reason it stops after "add" is that input (assuming a missing "define", would be a ddlStatement
ANTLR will stop processing input once it has recognized your stop rule.
To get it to "pay attention" to the entire input, add an EOF token to your start rule:
grammar Test
;
ddlStatement: defineStatement EOF;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
gives these errors:
line 1:0 missing 'define' at 'add'
line 1:4 mismatched input '1' expecting {<EOF>, ';'}
and this tree:

Formatting string in Powershell but only first or specific occurrence of replacement token

I have a regular expression that I use several times in a script, where a single word gets changed but the rest of the expression remains the same. Normally I handle this by just creating a regular expression string with a format like the following example:
# Simple regex looking for exact string match
$regexTemplate = '^{0}$'
# Later on...
$someString = 'hello'
$someString -match ( $regexTemplate -f 'hello' ) # ==> True
However, I've written a more complex expression where I need to insert a variable into the expression template and... well regex syntax and string formatting syntax begin to clash:
$regexTemplate = '(?<=^\w{2}-){0}(?=-\d$)'
$awsRegion = 'us-east-1'
$subRegion = 'east'
$awsRegion -match ( $regexTemplate -f $subRegion ) # ==> Error
Which results in the following error:
InvalidOperation: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.
I know what the issue is, it's seeing one of my expression quantifiers as a replacement token. Rather than opt for a string-interpolation approach or replace {0} myself, is there a way I can tell PowerShell/.NET to only replace the 0-indexed token? Or is there another way to achieve the desired output using format strings?
If a string template includes { and/or } characters, you need to double these so they do not interfere with the numbered placeholders.
Try
$regexTemplate = '(?<=^\w{{2}}-){0}(?=-\d$)'

The lexer chooses the wrong Token

Hi I am new to antrl and have a problem that I am not able to solve during the last days:
I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)
100abc
150100
200def
Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.
I thought I could recogize this with this grammar:
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';
header : HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
I also read some other posts like this:
antlr4 mixed fragments in tokens
Lexer, overlapping rule, but want the shorter match
But I did not think, that it solves my problem, or at least I don't see how that could help me.
One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.
So, the final version of correct grammar on my opinion is
grammar Types;
file : Header Content Trailer;
A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline.
Header : '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;
Your input text will be parsed to (file 100abc\n 150100\n 200def)

Inconsistent token handling in ANTLR4

The ANTLR4 book references a multi-mode example
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsLexer.g4
lexer grammar ModeTagsLexer;
// Default mode rules (the SEA)
OPEN : '<' -> mode(ISLAND) ; // switch to ISLAND mode
TEXT : ~'<'+ ; // clump all text together
mode ISLAND;
CLOSE : '>' -> mode(DEFAULT_MODE) ; // back to SEA mode
SLASH : '/' ;
ID : [a-zA-Z]+ ; // match/send ID in tag to parser
https://github.com/stfairy/learn-antlr4/blob/master/tpantlr2-code/lexmagic/ModeTagsParser.g4
parser grammar ModeTagsParser;
options { tokenVocab=ModeTagsLexer; } // use tokens from ModeTagsLexer.g4
file: (tag | TEXT)* ;
tag : '<' ID '>'
| '<' '/' ID '>'
;
I'm trying to build on this example, but using the « and » characters for delimiters. If I simply substitute I'm getting error 126
cannot create implicit token for string literal in non-combined grammar: '«'
In fact, this seems to occur as soon as I have the « character in the parser tag rule.
tag : '«' ID '>';
with
OPEN : '«' -> pushMode(ISLAND);
TEXT : ~'«'+;
Is there some antlr foo I'm missing? This is using antlr4-maven-plugin 4.2.
The wiki mentions something along these lines, but the way I read it that's contradicting the example on github and anecdotal experience when using <. See "Redundant String Literals" at https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lexer+Rules
One of the following is happening:
You forgot to update the OPEN rule in ModeTagsLexer.g4 to use the following form:
OPEN : '«' -> mode(ISLAND) ;
You found a bug in ANTLR 4, which should be reported to the issue tracker.
Have you specified the file encoding that ANTLR should use when reading the grammar? It should be okay with European characters less than 255 but...

Resources