antlr4 grammar to iteratively parse repeating things from a single InputStream

antlr4 grammar to iteratively parse repeating things from a single InputStream - antlr4

I have an InputStream that contains repeating chunks like this:
fld1:val1
fld2:val2
[A B C D]
[E F]
fld1:val3
fld2:val4
[M N]
[Q S T Y]
fld1:val5
...
I wish to construct a solution where I can parse the fld:val block, skip the blank line separator, then parse the "listy" part, then stop parsing at the next blank line and reset the parser on the same open stream to process the next chunk. I was thinking I might be able to do this in my override of the baselistener class exitListy callback by getting access to the parser and calling reset(). Ideally, this would end the call chain to ParseTree t = parser.parse() and let control return to the code immediately following parse() I experimented with this and, somewhat predictably, got a null pointer exception here: org.antlr.v4.runtime.Parser.exitRule(Parser.java:639) I cannot change the format of the input stream, like inserting snip-here markers or anything like that.

(Completely new answer based on comment)
Listeners operate on ParseTrees returned once a parse completes. In your case, it appears, You'll be listening on an, essentially, unending stream, and want data back periodically.
I'd highly recommend "The Definitive ANTLR 4 Reference" from Pragmatic Programmers.
There are two very pertinent sections:
"Making Things Happen During the Parse"
"Unbuffered Character and Token Streams"
For your grammar, try something akin to the following "rough draft" (this may not be reporting back exactly when you want, but hopefully gives you the idea to work with)
grammar Streaming;
#parser::members {
java.util.function.Consumer<MyData> consumer;
MyData myData = new MyData();
public StreamingParser(TokenStream input, java.util.function.Consumer<MyData> consumer) {
this(input);
this.consumer = consumer;
}
}
stream: (fldLine emptyLine listLine emptyLine) EOF;
fldLine:
fld = ITEM COLON val = ITEM EOL {
// add data to MyDataObject
};
listLine:
O_BRACKET (items = ITEM)* C_BRACKET {
// add data to MyDataObject
};
emptyLine:
EOL {
consumer.accept(myData);
// reset myData
};
O_BRACKET: '[';
C_BRACKET: ']';
EOL: '\n';
COLON: ':';
ITEM: [a-zA-Z][a-zA-Z0-9]*;
SPACE: ' ' -> skip;
This takes advantage of embedded actions that are described in the first section.
Then the second section describes how to use Unbuffered streams.
Something like this (untested; much lifted directly from the referenced book)
CharStream input = new UnbufferedCharStream(<your stream>);
StreamingLexer lex = new StreamingLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
StreamingParser parser = new StreamingParser(tokens,
// This lambda will handle data reported back when a blank line is encountered
myData -> handle(myData));
// You just want ANTLR reporting back periodically
// not building a giant parse tree
parser.setBuildParseTree(false);
parser.stream(); // won't return until you shut down the input stream

Related

rewriting AST Action Translation to ANTLR4

I have a grammar file written in antlr2 syntax and need help understanding how to rewrite some of the parser rules in antlr4 syntax. I know antlr4 eliminated the need for building AST so I'm not sure what to do with the rules that are AST action translations. ANTLR Tree Construction explains some of the syntax and how to use the # construct but I'm still unsure how to read this rules and re-write them.
temp_root :
temp { #temp_root = #([ROOT, "root"], #temp_root); } EOF;
temp :
c:temp_content
{ #temp = #(#([FUNCTION_CALL, "template"], #template), c);
reparent((MyAST)#temp, MyAST)#c); };
temp_content :
(foo | bar);
foo :
{
StringBuilder result = new StringBuilder("");
}
: (c:FOO! { result.append(c.getText()); } )+
{ #foo = #([TEMPLATE_STRING_LITERAL, result.toString()], #foo); };
bar :
BEGIN_BAR! expr END_BAR!
exception
catch [Exception x] {
bar_AST = handleException(x);
};

You cannot manipulate the produced parse tree (at least not with grammar code), so simply remove all tree rewriting stuff (you may have to adjust consumer code, if that relies on a specific tree structure). Also remove the exclamation marks (which denote a token that should not appear in the AST). A surprise is the c:FOO part. Can't remember having ever seen this. but judging from the following action code I guess it's a var assignment and should be rewritten as c = FOO.

Megaparsec: macro expansion during parsing

In a small DSL, I'm parsing macro definitions, similarly to #define C pre-processor directives (here a simplistic example):
_def mymacro(a,b) = a + b / a
When the following call is encountered by the parser
c = mymacro(pow(10,2),3)
it is expanded to
c = pow(10,2) + 3 / pow(10,2)
My current approach is:
wrap the parser in a State monad
when parsing macro definitions, store them in the state, with their body unparsed (parse it as a string)
when parsing a macro call, find the definition in the state, replace the arguments in the body text, replace the call with this body and resume the parsing.
Some code from the last step:
macrocallStmt
= do -- capture starting position and content of old input before macro call
oldInput <- getInput
oldPos <- getPosition
-- parse the call
ret <- identifier
symbolCS "="
i <- identifier
args <- parens $ commaSep anyExprStr
-- expand the macro call
us <- get
let inlinedCall = replaceMacroArgs i args ret us
-- set up new input with macro call expanded
remainder <- getInput
let newInput = T.append inlinedCall (T.cons '\n' remainder)
setPosition oldPos
setInput newInput
-- update the expanded input script
modify (updateExpandedInput oldInput newInput)
anyExprStr = fmap praShow expression <|> fmap praShow algexpr
This approach does the job decently. However, it has a number of drawbacks.
Parsing multiple times
Any valid DSL expression can be an argument of the macro call. Therefore, even though I only need their textual representation (to be replaced in the macro body), I need to parse them and then convert them again to string - simply looking for the next comma wouldn't work. Then the complete and customised macro will be parsed. So in practice, macro arguments get parsed twice (and also show-ed, which has its cost). Moreover, each call requires a new parsing of the (almost same) body. The reason to keep the body unparsed in memory is to allow maximum flexibility: in the body, even DSL keywords could be constructed out of the macro arguments.
Error handling
Because the expanded body is inserted in front of the unconsumed input (replacing the call), the initial and final input can be quite different. In the event of a parse error, the position where the error occurred in the expanded input is available. However, when processing the error, I only have the original, not expanded, input. So the error position won't match.
That is why, in the code snippet above, I use the state to save the expanded input, so that it is available when the parser exits with an error.
This works well, but I noticed that it becomes quite costly, with new Text arrays (the input stream is Text) being allocated for the whole stream at every expansion. Perhaps keeping the expanded input in the state as String, rather than Text, would be cheaper in this case, i.e. when a middle part needs to be replaced?
The reasons for this question are:
I would appreciate suggestions / comments on the two issues described above
Can anyone suggest a better approach altogether?

How to parse C-style comments with Alex lexer?

NB. I'm using this Alex template from Simon Marlow.
I'd like to create lexer for C-style comments. My current approach creates separate tokens for starting comments, ending, middle and oneline
%wrapper "monad"
tokens :-
<0> $white+ ;
<0> "/*" { mkL LCommentStart `andBegin` comment }
<comment> . { mkL LComment }
<comment> "*/" { mkL LCommentEnd `andBegin` 0 }
<0> "//" .*$ { mkL LSingleLineComment }
data LexemeClass
= LEOF
| LCommentStart
| LComment
| LCommentEnd
| LSingleLineComment
How can I reduce number of middle tokens? For input /*blabla*/ I will get 8 tokens instead of one!
How can I strip the // part from single line comment token?
Is it possible to lex comments without monad wrapper?

Have a look at this:
http://lpaste.net/107377
Test with something like:
echo "This /* is a */ test" | ./c_comment
which should print:
Right [W "This",CommentStart,CommentBody " is a ",CommentEnd,W "test"]
The key alex routines you need to use are:
alexGetInput -- gets the current input state
alexSetInput -- sets the current input state
alexGetByte -- returns the next byte and input state
andBegin -- return a token and set the current start code
Each of the routines commentBegin, commentEnd and commentBody have the following signature:
AlexInput -> Int -> Alex Lexeme
where Lexeme stands for the your token type. The AlexInput parameter has the form (for the monad wrapper):
(AlexPosn, Char, [Bytes], String)
The Int parameter is the length of the match stored in the String field. Therefore the form of most token handlers will be:
handler :: AlexInput -> Int -> Alex Lexeme
handler (pos,_,_,inp) len = ... do something with (take len inp) and pos ...
In general it seems that a handler can ignore the Char and [Bytes] fields.
The handlers commentBegin and commentEnd can ignore both the AlexInput and Int arguments because they just match fixed length strings.
The commentBody handler works by calling alexGetByte to accumulate the comment body until "*/" is found. As far as I know C comments may not be nested so the comment ends at the first occurrence of "*/".
Note that the first character of the comment body is in the match0 variable. In fact, my code has a bug in it since it will not match "/**/" correctly. It should look at match0 to decide whether to start at loop or loopStar.
You can use the same technique to parse "//" style comments - or any token where a non-greedy match is required.
Another key point is that patterns like $white+ are qualified with a start code:
<0>$white+
This is done so that they are not active while processing comments.
You can use another wrapper, but note that the structure of the AlexInput type may be different -- e.g. for the basic wrapper it is just a 3-tuple: (Char,[Byte],String). Just look at the definition of AlexInput in the generated .hs file.
A final note... accumulating characters using ++ is, of course, rather inefficient. You probably want to use Text (or ByteString) for the accumulator.

How to make Antlr4 stop parsing when a rule is completed

I have to parse parts of a file where I have expressions like :
garbage garbage garbage
BEGIN <something> END
garbage garbage...
Here, I want to parse everything between BEGIN and END, keeping the garbage aside.
I tried to write a parser which has rules like :
rule : BEGIN expr END;
expr : ... ;
which correctly parses my expression, if it's the only thing I have in my file. Sadly, when I try to kick the parser when I meet a "BEGIN" in my file, the parser will correctly parse the expression, but will then try to fetch some other token after the "END".
I have read the part abiut fyzzy grammar in ANTLR4 book, but this is not what I want, because the result of the parsing will impact the remaining of the file (basically, the result of the parsing will produce a set of substitutions to apply in the following text).
What I'm looking for is a way to tell the parser to stop after the "END" keyword. I have tried to override the TokenStream to produce a Token.EOF when END is met, with this modified rule set :
rule : BEGIN expr EOF;
expr : ... ;
with code like :
public Token LT(int k)
{
Token token = super.LT( k );
if ( token.getType() == MyParser.END )
{
token = new CommonToken(Token.EOF,"");
}
return token;
}
but in this case, the stream is closed, and I can't anymore use it for the remaining file...

You could make a special mode in your lexer to consume the garbage as a single GARBAGE token. In the following example code, I have GarbageMode as the separate mode, which requires you to explicitly call lexer.setMode(GarbageMode) after creating a new instance of the lexer. The alternative is placing the Garbage and GarbageMode_BEGIN rules in the default mode and moving the remaining rules from the default mode into a new mode, e.g. MainMode.
BEGIN
: 'BEGIN'
;
END
: 'END' -> mode(GarbageMode)
;
mode GarbageMode;
GARBAGE
: .+? (BEGIN | EOF) -> mode(DEFAULT_MODE)
;
GarbageMode_BEGIN
: BEGIN -> type(BEGIN), mode(DEFAULT_MODE)
;
The key to making the above lexer work is overriding the Lexer.emit method to reset the input stream position prior to creating a GARBAGE token. An example of this is available in PositionAdjustingLexer.g4, with the corresponding unit test testPositionAdjustingLexer(). In your case, you'll simply remove the last 5 characters from the GARBAGE token if the text of the token ends with BEGIN.

Simplest nested block parser

I want to write a simple parser for a nested block syntax, just hierarchical plain-text. For example:
Some regular text.
This is outputted as-is, foo{but THIS
is inside a foo block}.
bar{
Blocks can be multi-line
and baz{nested}
}
What's the simplest way to do this? I've already written 2 working implementations, but they are overly complex. I tried full-text regex matching, and streaming char-by-char analysis.
I have to teach the workings of it to people, so simplicity is paramount. I don't want to introduce a dependency on Lex/Yacc Flex/Bison (or PEGjs/Jison, actually, this is javascript).

The good choices probably boil down as follows:
Given your constaints, it's going to be recursive-descent. That's a fine way to go even without constraints.
you can either parse char-by-char (traditional) or write a lexical layer that uses the local string library to scan for { and }. Either way, you might want to return three terminal symbols plus EOF: BLOCK_OF_TEXT, LEFT_BRACE, and RIGHT_BRACE.

char c;
boolean ParseNestedBlocks(InputStream i)
{ if ParseStreamContent(i)
then { if c=="}" then return false
else return true
}
else return false;
boolean ParseSteamContent(InputStream i)
{ loop:
c = GetCharacter(i);
if c =="}" then return true;
if c== EOF then return true;
if c=="{"
{ if ParseStreamContent(i)
{ if c!="}" return false; }
else return false;
}
goto loop
}

Recently, I've been using parser combinators for some projects in pure Javascript. I pulled out the code into a separate project; you can find it here. This approach is similar to the recursive descent parsers that #DigitalRoss suggested, but with a more clear split between code that's specific to your parser and general parser-bookkeeping code.
A parser for your needs (if I understood your requirements correctly) would look something like this:
var open = literal("{"), // matches only '{'
close = literal("}"), // matches only '}'
normalChar = not1(alt(open, close)); // matches any char but '{' and '}'
var form = new Parser(function() {}); // forward declaration for mutual recursion
var block = node('block',
['open', open ],
['body', many0(form)],
['close', close ]);
form.parse = alt(normalChar, block).parse; // set 'form' to its actual value
var parser = many0(form);
and you'd use it like this:
// assuming 'parser' is the parser
var parseResult = parser.parse("abc{def{ghi{}oop}javascript}is great");
The parse result is a syntax tree.
In addition to backtracking, the library also helps you produce nice error messages and threads user state between parser calls. The latter two I've found very useful for generating brace error messages, reporting both the problem and the location of the offending brace tokens when: 1) there's an open brace but no close; 2) there's mismatched brace types -- i.e. (...] or {...); 3) a close brace without a matching open.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string