Antlr4 how to match something that I do not care about - antlr4

I am working on a parser that will have phases. On every phase it will analyze some data and will prepare it for the next phase. The input may have data already prepared for the next phase. I do not want to complicate the grammar for phase 1 with constructions from phase 2.
Or if I have a file that looks like this:
some complex constructions that I do not care at this phase about specifics
but I know that they will not have line starting with <
...
...
< line: 0 foo bar
< line: 1 blah blah
more constructions
< line: 0 foo bar
< line: 1 blah blah
What my grammar should look like to parse the lines with a rule like this
line
: SERIALIZE_BRACKET KW_LINE
NUMBER # indent
.*? # content
( EOL | EOF ) # the end of the line
;
and everything else to be returned in a generic unknown blocks. Note I do not want the data to be skipped, I would like to be able to have access to it, I just do not want to parse it at this phase.
What my grammar should look like ... ?

Related

Python ANTLR4 extraneous input plus tokens removal

I am trying to parse a text file and I want to create a grammar to catch specific text blocks let's say
a) the word 'specificWordA' or 'specWordB' followed by zero or more digits, or
b) the word 'testC' followed by 1 or more digits.
My grammar looks like this:
grammar Hello;
catchExpr : expr+ EOF;
expr : matchAB | matchC;
matchAB : TEXTAB DIGIT*;
matchC : TEXTC DIGIT+;
TEXTAB : ('specificWordA' | 'specWordB') ;
TEXTC : ('testC') ;
DIGIT : NUMBER+ ;
CHARS : ('a'..'z' | 'A'..'Z')+ ;
SPACES : [ \r\t\n] ->skip;
fragment NUMBER: '0'..'9'+ ;
I am using ANTLR4 and I have compiled the code both on JAVA (to use the TestRig gui command for the AST) and Python2 (to provide a custom listener to traverse the tree). My file contains the following text:
specificWordA 11
specWordB specWordB specWordB testC 22 not not testD
testD 11
testC teeeeeeeeeest
testD 2
end here
Please could someone help my with the following questions:
1) Does ANTLR4 create nodes by default for every token I have defined in my grammar? How can I remove them so as to get a simplified version of the AST (see image below there are nodes for every sequence of characters that match token CHARS)?
2) Why does "testC teeeeeeeeeest testD 2 end here"
matches an expression? My rule is a text block 'testC' followed by at least one digit!
3) When I run my code I get the following messages:
line 3:39 extraneous input 'not' expecting {<EOF>, TEXTAB, 'testC'}
line 7:6 mismatched input 'teeeeeeeeeest' expecting {<EOF>, TEXTAB, 'testC'}
What does extraneous input mean? Do I have to change my grammar or it is just a warning?
Based on these questions,
ANTLR4 extraneous input
ANTLR4: Extraneous Input error
I cannot figure out what is wrong with my grammar!

ANTLR 4: How do I know if all the input was parsed?

If my input is "ab" and the parse is looking for "a", it recognises "a" as expected but I need the trailing "b" to produce an error. How do I test for this?
The lexer generates an EOF token at the end of the source input. To force processing of all input, require the EOF as part of your main parser rule:
r : a+ EOF ;
a : A ;
b : B ;
A : 'a' ;
B : 'b' ;
The parser, starting from rule r with input 'abaab', will throw an unrecognized input error - actually two. The default parser error strategy will attempt to skip a limited number of consecutive unknown tokens - one IIRC - and try to resynchronize with the input token stream. In this case it will succeed in resynchronizing, first with an A token and second with the EOF token.
Optionally, use
Parser.addErrorListener(...) to add your own error reporter (extend BaseErrorListener)
Parser.setErrorHandler(...) to add your own error recovery strategy (extend DefaultErrorStrategy)
If I remember correctly you can use actions inside your Antlr grammar which will looks something like:
grammar Expr;
prog: a b;
a: 'a';
b: 'b'{throw new Exception();};
Which will throw an error after the parser has seen a valid declaration of b. Instead of throwing an error you can also print out some debugging information.

context sensitive lexing & parsing

I have a set of files to parse that has this weird contents. Each line contains the following data(removing all the other contents of no relevance here):
DATA < Alphabetic data> < numeric data>< Period as terminator>
and then we could have text which may be as follows:
TEXT < Alphanumeric data + puntuation (including period)>
So I am having problem in parsing the DATA line as the numeric data can be of any of the following type:
99.9
99.
.9
Especially to parse data like:
DATA ANACRON99..
The first dot at the end is the decimal point and second is the terminator
A sample of the grammar I tried, copying just the relevant portion is as follows:
file: lines+ EOF
;
lines: data_line
| text_line
;
text_line: TEXT TEXTUALDATA
;
data_line: DATA sensordata
;
sensordata: DATA FLOATVALUE PERIOD
;
TEXT:'TEXT';
DATA: 'DATA' ->mode(SENSORMODE);
TEXTUALDATA: (.)*?
;
mode SENSORMODE;
FLOATVALUE: ([0-9])*('.')([0-9])*
;
WS:[ \t]->skip
;
WS2:[\r\n]
;
PERIOD:'.' ->mode(DEFAULT_MODE)
;
This detects the first period as part of floatdata, but completely ignores the second and complains it was expecting PERIOD but found EOF. What could be a way to solve this please. Is there any way to look ahead and at the same time keep track of the last token detected?
Thanks!!
FLOATVALUE can match a single period and it is listed before PERIOD. So, the lexer is matching two FLOATVALUEs in series.
To avoid the ambiguity, change FLOATVALUE to:
FLOATVALUE: ([0-9])+('.')([0-9])+
| ([0-9])+('.')
| ('.')([0-9])+
;
To avoid that FLOATVALUE matches your PERIOD you can switch the sequence of your lexer definitions.
mode SENSORMODE;
PERIOD:'.' ->mode(DEFAULT_MODE)
;
FLOATVALUE: ([0-9])*('.')([0-9])*
;
WS:[ \t]->skip
;
WS2:[\r\n]
;
ANTLR always returns the token type of the longest matching lexer rule and in case of same-length matches it returns the first type. The modified grammar moves PERIOD to the first position.

incremental change the pattern in vim

Is it possible to search pattern like this: some_name{number}: for example r001. And then incrementally increment each search result number by one. For example for such input:
r001
...
r001
...
r001
...
Desired output will be:
r002
...
r003
...
r004
Following would replace each number in the entire file with an incrementing number starting from two.
let i=2|g/\v\d+$/s//\=i/|let i=i+1
Output
r2
...
r3
...
r4
...
Breakdown
let i=2 : Initializes variable i to 2
g/\v\d+$/ : a global command to search for numbers at the end of lines
s//\=i/ : a substitute command to replace the search matches with
the contents of i
let i=i+1 : increment i for the next match
All that's left to do is incorporate the commands to pad with zero's:
Vim: padding out lines with a character

Move lines matched by :g to the top of the file

I have a large text file with several calls to a specific function method_name.
I've matched them using :g/method_name.
How would I move them to the top of the file (with the first match being on the top)?
I tried :g/method_name/normal ddggP but that reverses the order. Is there a better way to directly cut and paste all the matching lines, in order?
Example input file:
method_name 1
foo
method_name 2
bar
method_name 3
baz
Example output file:
method_name 1
method_name 2
method_name 3
foo
bar
baz
How about trying it the other way around: moving the un-matched lines to the bottom:
:v/method_name/normal ddGp
This seems to achieve what you want.
I think you can achieve the desired result by first creating a variable assigned
to 0:
:let i=0
And then executing this command:
:g/method_name/exec "m ".i | let i+= 1
It basically calls :m passing as address the value of i, and then increments
that value by one so it can be used in the next match. Seems to work.
Of course, you can delete the variable when you don't need it anymore:
:unlet i
If the file is really large, count of matching entries is small, and you don't want to move around the entire file with solution v/<pattern>/ m$, you may do this:
Pick any mark you don't care about, say 'k. Now the following key sequence does what you want:
ggmk:g/method_name/ m 'k-1
ggmk marks first line with 'k.
m 'k-1 moves matching line to 1 line before the 'k mark (and mark moves down with the line it is attached to).
This will only move a few matching lines, not the entire file.
Note: this somehow works even if the first line contains the pattern -- and I don't have an explanation for that.
For scripts:
normal ggmk
g/method_name/ m 'k-1

Resources