Optional token at end of file - predicate

I am trying to write a grammar to parse a file where blank lines indicate the end of a block. I have grammar similar to this which almost works.
file : block+ EOF;
block : line+ NL;
line : stuff NL;
NL : '\r'? '\n';
This works except that the last block sometimes does not have an extra newline. Is there a good way to make the NL at the end of block optional when I am at the end of the file?
In antlr3, I would have done
block : line+ (NL | (EOF) => /* empty */ )
However, antlr4 does not have syntactic predicates, so I can't do that.
block : line+ NL? ;
should work, but then a block in the middle of the file could avoid its final newline. I don't think it will since a block can only be followed by a block. That means a block without the trailing newline followed by a block looks like one single block, and the parser will greedily combine them. However, it makes it less clear what the structure actually is. I can certainly imagine more complicated source file formats where this causes a problem.
Is there a good way to solve this?

Try something like this:
file : NL* block (NL+ block)* NL* EOF;
block : line (NL line)*;
line : stuff;
NL : '\r'? '\n';
Or simply append a line break at the end of your input.

Related

Atleast ONE Space around Parenthesis in ANTLR4

I want spaces around Parenthesis in IF condition. ATleast one space is required. But when i use Space in grammar it throws me error, when i use Else block with it. Please help me, how to accomplish it, i have seen many examples but none is related to it.
i only need spaces around Parenthesis of If condition.
prog: stat_block EOF;
stat_block: OBRACE block CBRACE;
block: (stat (stat)*)?;
stat: expr ';'
| IF condition_block (ELSE stat_block)?
;
expr
: expr SPACE ('*' | '/') SPACE expr
| ID
| INT
| STRING
;
exprList: expr (',' expr)*;
condition_block: SPACE OPAR SPACE expr SPACE CPAR SPACE stat_block;
IF: 'IF';
ELSE: 'ELSE';
OPAR: '(';
CPAR: ')';
OBRACE: '{';
CBRACE: '}';
SPACE: SINGLE_SPACE+;
SINGLE_SPACE: ' ';
ID: [a-zA-Z]+;
INT: [0-9]+;
NEWLINE: '\r'? '\n' -> skip;
WS: [ \t]+ -> skip;
Expected input to parse
IF ( 3 ) { } ELSE { }
Current Input
There's a reason that almost all languages ignore whitespace. If you don't ignore it, then you have to deal with its possible existence in the token stream anywhere it might, or might not, be in ALL of your parser rules.
You can try to include the spaces in the Lexer rules for tokens that you want wrapped in spaces, but may still find surprises.
Suggestion: Instead if -> skip; for your WS rule, use -> channel(HIDDEN); This keeps the tokens in the token stream so you can look for them in your code, but "hides" the whitespace tokens from the parser rules. This also allows ANTLR to get a proper interpretation of your input and build a parse tree that represents it correctly.
If you REALLY want to insist on the spaces before/after, you can write code in a listener that looks before/after the tokens in the input stream to see if you have whitespace, and generate your own error (that can be very specific about your requirement).
At least one space is required.
Then you either:
cannot -> skip the WS rule, which will cause all spaces and tabs to become tokens and needing your parser to handle them correctly (which is likely going to become a complete mess in your parser rules!), or
you leave WS -> skip as-is, but include a space in your PAREN rules: OPAR : ' ( '; CPAR: ' ) '; (or with tabs as well if that is possible)

Antlr4: Skip line when it start with * unless the second char is

In my input, a line start with * is a comment line unless it starts with *+ or *-. I can ignore the comments but need to get the others.
This is my lexer rules:
WhiteSpaces : [ \t]+;
Newlines : [\r\n]+;
Commnent : '*' .*? Newlines -> skip ;
SkipTokens : (WhiteSpaces | Newlines) -> skip;
An example:
* this is a comment line
** another comment line
*+ type value
So, the first two are comment lines, and I can skip it. But I don't know to to define lexer/parser rule that can catch the last line.
Your SkipTokens lexer rule will never be matched because the rules WhiteSpaces and Newlines are placed before it. See this Q&A for an explanation how the lexer matches tokens: ANTLR Lexer rule only seems to work as part of parser rule, and not part of another lexer rule
For it to work as you expect, do this:
SkipTokens : (WhiteSpaces | Newlines) -> skip;
fragment WhiteSpaces : [ \t]+;
fragment Newlines : [\r\n]+;
What a fragment is, check this Q&A: What does "fragment" mean in ANTLR?
Now, for your question. You defined a Comment rule to always end with a line break. This means that there can't be a comment at the end of your input. So you should let a comment either end with a line break or the EOF.
Something like this should do the trick:
COMMENT
: '*' ~[+\-\r\n] ~[\r\n]* // a '*' must be followed by something other than '+', '-' or a line break
| '*' ( [\r\n]+ | EOF ) // a '*' is a valid comment if directly followed by a line break, or the EOF
;
STAR_MINUS
: '*-'
;
STAR_PLUS
: '*+'
;
SPACES
: [ \t\r\n]+ -> skip
;
This, of course, does not mandate the * to be at the start of the line. If you want that, checkout this Q&A: Handle strings starting with whitespaces

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Using getCharPositionInLine() with leading spaces in ANTLR4

I am writing grammar for a script which is based on VBScript.
In the script, the variable assignment is done in the usual manner of i=10 and in addition a variation with: Set i=10
The method calls can be done in several ways along with calling methods on objects, like:
Another(10).Call(20).Chain(30)
I consider 'Set' as a keyword in my grammar. However, in some pre-defined calsses, the developer is allowed to name the method as 'Set', so, there maybe calls like (let me mark this as line A):
Another(10).Call(20,30).Set 40,50
my grammar:
definition: body EOF;
body: NL_WS* bodyElement NL_WS*;
bodyElement: statement (NL_WS+ statement)* ;
statement: assignment | chainCall;
assignment: (START_SET)? IDENTIFIER WS? EQUALS WS? (chainCall | VALID_NUMBER) ;
chainCall: methodCall ('.' methodCall)* ;
methodCall: IDENTIFIER WS? LPAREN? WS? argumentList? WS? RPAREN?;
argumentList: VALID_NUMBER (WS? COMMA WS? VALID_NUMBER)* ;
START_SET: 'Set' WS;
VALID_NUMBER: [1-9] NUMBER? ;
IDENTIFIER: LETTER LETTER_OR_DIGIT*;
LETTER: [a-zA-Z_];
NUMBER: [0-9];
LETTER_OR_DIGIT: [a-zA-Z0-9_];
EQUALS: '=' ;
LPAREN: '(';
RPAREN: ')';
COMMA: ',';
NL_WS: WS? NEWLINE WS?;
NEWLINE: [\r\n];
WS: [ \t]+;
This fails in what I have marked as line A (where Set is a method call inside an object):
line 10:24 mismatched input 'Set ' expecting IDENTIFIER
1) I am not able to understand why. My thinking is that as in the assignment rule, the (START_SET)? is defined at the beginning, it should expect Set at the beginning and so, the method call at the end should match with IDENTIFIER.
2) When I try with getCharPositionInLine, like:
START_SET: {getCharPositionInLine() == 0}? 'Set' WS;
it works fine, but, I have to deal with another problem. That is, there maybe leading whitespaces before the 'Set' assignment, like:
' Set k=10'
and in such cases, it fails saying:
line 16:8 mismatched input 'k' expecting {<EOF>, '.', NL_WS}
(in this case, I think it matches with chainCall and not assignment which is understandable as it is not the first character in line).
So, is there an alternate method which will be like 'first character in line minus spaces'?
I also tried,START_SET: {getCharPositionInLine() == 0}? WS? 'Set' WS;
thinking that the initial WS? will cover the first character in line, but I get the same error.
Any help is appreciated.
I found a method to deal with this issue. I wrote what is called pre-processor which strips all the leading spaces in a line and then have that parsed by ANTLR. This way I could use {getCharPositionInLine() == 0} successfully.
Also, this helped in keeping the grammar simpler.
HTH.

How do I get multi-line string between two braces containing a specific search string?

I'm looking for a quick and easy one-liner to extract all brace-delimited text-blocks containing a search string from a text file. I've just about googled myself crazy on this, but everyone seems to be only posting about getting the text between braces without a search string.
I've got a large text file with contents like this:
blabla
blabla {
blabla
}
blabla
blabla {
blabla
blablaeventblabla
}
blabla
The vast majority of bracketed entries do not contain the search string, which is "event".
What I am trying to extract are all text (especially including multi-line matches) between each set of curly braces, but only if said text also contains the search string. So output like this:
blabla {
blabla
blablaeventblabla
}
My linux command line is /usr/bin/bash. I've been trying various grep and awk commands, but just can't get it to work:
awk '/{/,/event/,/}/' filepath
grep -iE "/{.*event.*/}" filepath
I was thinking this would be really easy, as it's a common task. What am I missing here?
This gnu-awk should work:
awk -v RS='[^\n]*{|}' 'RT ~ /{/{p=RT} /event/{ print p $0 RT }' file
blabla {
blabla
blablaeventblabla
}
RS='[^\n]*{\n|}' sets input record separator as any text followed by { OR a }. RT is the internal awk variable that is set to matched text based on RS regex.
User 999999999999999999999999999999 had a nice answer using sed which I really liked, unfortunately their answer appears to have disappeared for some reason.
Here it is for those who might be interested:
sed '/{/{:1; /}/!{N; b1}; /event/p}; d' filepath
Explanation:
/{/ if current line contains{then execute next block
{ start block
:1; label for code to jump to
/}/! if the line does not contain}then execute next block
{ start block
N; add next line to pattern space
b1 jump to label 1
}; end block
/event/p if the pattern space contains the search string, print it
(at this point the pattern space contains a full block of lines
from{to})
}; end block
d delete pattern space
Here is a modified version of this gem from 'leu' (10x leu for enlighten us). This one is doing something very similarly. Extract everything between which begin with 'DEC::PKCS7[' and ending with ']!':
cat file | sed '/^DEC::PKCS7\[/{s///; :1; /\]\!$/!{N; b1;}; s///;};'
Explanation:
/^DEC::PKCS7\[/ # if current line begins with 'DEC::PKCS7[' then execute next block
{ # start block
s///; # remove all upto 'DEC::PKCS7['
:1; # label '1' for code to jump to
/\]\!$/! # if the line does not end with ']!' then execute next block
{ # start block
N; # add next line to pattern space
b1; # jump to label 1
}; # end block
s///; # remove all from ']!' to end of line
}; # end block
Notes:
This works on single and multi-line.
This will have unexpected behavior if you have ']!' in the middle of
the input.
This does not answer the question. It's already answered very well.
My intentions are just to help other cases.

Resources