I have to parse parts of a file where I have expressions like :
garbage garbage garbage
BEGIN <something> END
garbage garbage...
Here, I want to parse everything between BEGIN and END, keeping the garbage aside.
I tried to write a parser which has rules like :
rule : BEGIN expr END;
expr : ... ;
which correctly parses my expression, if it's the only thing I have in my file. Sadly, when I try to kick the parser when I meet a "BEGIN" in my file, the parser will correctly parse the expression, but will then try to fetch some other token after the "END".
I have read the part abiut fyzzy grammar in ANTLR4 book, but this is not what I want, because the result of the parsing will impact the remaining of the file (basically, the result of the parsing will produce a set of substitutions to apply in the following text).
What I'm looking for is a way to tell the parser to stop after the "END" keyword. I have tried to override the TokenStream to produce a Token.EOF when END is met, with this modified rule set :
rule : BEGIN expr EOF;
expr : ... ;
with code like :
public Token LT(int k)
{
Token token = super.LT( k );
if ( token.getType() == MyParser.END )
{
token = new CommonToken(Token.EOF,"");
}
return token;
}
but in this case, the stream is closed, and I can't anymore use it for the remaining file...
You could make a special mode in your lexer to consume the garbage as a single GARBAGE token. In the following example code, I have GarbageMode as the separate mode, which requires you to explicitly call lexer.setMode(GarbageMode) after creating a new instance of the lexer. The alternative is placing the Garbage and GarbageMode_BEGIN rules in the default mode and moving the remaining rules from the default mode into a new mode, e.g. MainMode.
BEGIN
: 'BEGIN'
;
END
: 'END' -> mode(GarbageMode)
;
mode GarbageMode;
GARBAGE
: .+? (BEGIN | EOF) -> mode(DEFAULT_MODE)
;
GarbageMode_BEGIN
: BEGIN -> type(BEGIN), mode(DEFAULT_MODE)
;
The key to making the above lexer work is overriding the Lexer.emit method to reset the input stream position prior to creating a GARBAGE token. An example of this is available in PositionAdjustingLexer.g4, with the corresponding unit test testPositionAdjustingLexer(). In your case, you'll simply remove the last 5 characters from the GARBAGE token if the text of the token ends with BEGIN.
Related
I have an InputStream that contains repeating chunks like this:
fld1:val1
fld2:val2
[A B C D]
[E F]
fld1:val3
fld2:val4
[M N]
[Q S T Y]
fld1:val5
...
I wish to construct a solution where I can parse the fld:val block, skip the blank line separator, then parse the "listy" part, then stop parsing at the next blank line and reset the parser on the same open stream to process the next chunk. I was thinking I might be able to do this in my override of the baselistener class exitListy callback by getting access to the parser and calling reset(). Ideally, this would end the call chain to ParseTree t = parser.parse() and let control return to the code immediately following parse() I experimented with this and, somewhat predictably, got a null pointer exception here: org.antlr.v4.runtime.Parser.exitRule(Parser.java:639) I cannot change the format of the input stream, like inserting snip-here markers or anything like that.
(Completely new answer based on comment)
Listeners operate on ParseTrees returned once a parse completes. In your case, it appears, You'll be listening on an, essentially, unending stream, and want data back periodically.
I'd highly recommend "The Definitive ANTLR 4 Reference" from Pragmatic Programmers.
There are two very pertinent sections:
"Making Things Happen During the Parse"
"Unbuffered Character and Token Streams"
For your grammar, try something akin to the following "rough draft" (this may not be reporting back exactly when you want, but hopefully gives you the idea to work with)
grammar Streaming;
#parser::members {
java.util.function.Consumer<MyData> consumer;
MyData myData = new MyData();
public StreamingParser(TokenStream input, java.util.function.Consumer<MyData> consumer) {
this(input);
this.consumer = consumer;
}
}
stream: (fldLine emptyLine listLine emptyLine) EOF;
fldLine:
fld = ITEM COLON val = ITEM EOL {
// add data to MyDataObject
};
listLine:
O_BRACKET (items = ITEM)* C_BRACKET {
// add data to MyDataObject
};
emptyLine:
EOL {
consumer.accept(myData);
// reset myData
};
O_BRACKET: '[';
C_BRACKET: ']';
EOL: '\n';
COLON: ':';
ITEM: [a-zA-Z][a-zA-Z0-9]*;
SPACE: ' ' -> skip;
This takes advantage of embedded actions that are described in the first section.
Then the second section describes how to use Unbuffered streams.
Something like this (untested; much lifted directly from the referenced book)
CharStream input = new UnbufferedCharStream(<your stream>);
StreamingLexer lex = new StreamingLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
StreamingParser parser = new StreamingParser(tokens,
// This lambda will handle data reported back when a blank line is encountered
myData -> handle(myData));
// You just want ANTLR reporting back periodically
// not building a giant parse tree
parser.setBuildParseTree(false);
parser.stream(); // won't return until you shut down the input stream
I have a grammar file written in antlr2 syntax and need help understanding how to rewrite some of the parser rules in antlr4 syntax. I know antlr4 eliminated the need for building AST so I'm not sure what to do with the rules that are AST action translations. ANTLR Tree Construction explains some of the syntax and how to use the # construct but I'm still unsure how to read this rules and re-write them.
temp_root :
temp { #temp_root = #([ROOT, "root"], #temp_root); } EOF;
temp :
c:temp_content
{ #temp = #(#([FUNCTION_CALL, "template"], #template), c);
reparent((MyAST)#temp, MyAST)#c); };
temp_content :
(foo | bar);
foo :
{
StringBuilder result = new StringBuilder("");
}
: (c:FOO! { result.append(c.getText()); } )+
{ #foo = #([TEMPLATE_STRING_LITERAL, result.toString()], #foo); };
bar :
BEGIN_BAR! expr END_BAR!
exception
catch [Exception x] {
bar_AST = handleException(x);
};
You cannot manipulate the produced parse tree (at least not with grammar code), so simply remove all tree rewriting stuff (you may have to adjust consumer code, if that relies on a specific tree structure). Also remove the exclamation marks (which denote a token that should not appear in the AST). A surprise is the c:FOO part. Can't remember having ever seen this. but judging from the following action code I guess it's a var assignment and should be rewritten as c = FOO.
In a small DSL, I'm parsing macro definitions, similarly to #define C pre-processor directives (here a simplistic example):
_def mymacro(a,b) = a + b / a
When the following call is encountered by the parser
c = mymacro(pow(10,2),3)
it is expanded to
c = pow(10,2) + 3 / pow(10,2)
My current approach is:
wrap the parser in a State monad
when parsing macro definitions, store them in the state, with their body unparsed (parse it as a string)
when parsing a macro call, find the definition in the state, replace the arguments in the body text, replace the call with this body and resume the parsing.
Some code from the last step:
macrocallStmt
= do -- capture starting position and content of old input before macro call
oldInput <- getInput
oldPos <- getPosition
-- parse the call
ret <- identifier
symbolCS "="
i <- identifier
args <- parens $ commaSep anyExprStr
-- expand the macro call
us <- get
let inlinedCall = replaceMacroArgs i args ret us
-- set up new input with macro call expanded
remainder <- getInput
let newInput = T.append inlinedCall (T.cons '\n' remainder)
setPosition oldPos
setInput newInput
-- update the expanded input script
modify (updateExpandedInput oldInput newInput)
anyExprStr = fmap praShow expression <|> fmap praShow algexpr
This approach does the job decently. However, it has a number of drawbacks.
Parsing multiple times
Any valid DSL expression can be an argument of the macro call. Therefore, even though I only need their textual representation (to be replaced in the macro body), I need to parse them and then convert them again to string - simply looking for the next comma wouldn't work. Then the complete and customised macro will be parsed. So in practice, macro arguments get parsed twice (and also show-ed, which has its cost). Moreover, each call requires a new parsing of the (almost same) body. The reason to keep the body unparsed in memory is to allow maximum flexibility: in the body, even DSL keywords could be constructed out of the macro arguments.
Error handling
Because the expanded body is inserted in front of the unconsumed input (replacing the call), the initial and final input can be quite different. In the event of a parse error, the position where the error occurred in the expanded input is available. However, when processing the error, I only have the original, not expanded, input. So the error position won't match.
That is why, in the code snippet above, I use the state to save the expanded input, so that it is available when the parser exits with an error.
This works well, but I noticed that it becomes quite costly, with new Text arrays (the input stream is Text) being allocated for the whole stream at every expansion. Perhaps keeping the expanded input in the state as String, rather than Text, would be cheaper in this case, i.e. when a middle part needs to be replaced?
The reasons for this question are:
I would appreciate suggestions / comments on the two issues described above
Can anyone suggest a better approach altogether?
So my rule is
/* Addition and subtraction have the lowest precedence. */
additionExp returns [double value]
: m1=multiplyExp {$value = $m1.value;}
( op=AddOp m2=multiplyExp )* {
if($op != null){ // test if matched
if($op.text == "+" ){
$value += $m2.value;
}else{
$value -= $m2.value;
}
}
}
;
AddOp : '+' | '-' ;
My test ist 3 + 4 but op.text always returns NULL and never a char.
Does anyone know how I can test for the value of AddOp?
In the example from ANTLR4 Actions and Attributes it should work:
stat: ID '=' INT ';'
{
if ( !$block::symbols.contains($ID.text) ) {
System.err.println("undefined variable: "+$ID.text);
}
}
| block
;
Are you sure $op.text is always null? Your comparison appears to check for $op.text=="+" rather than checking for null.
I always start these answers with a suggestion that you migrate all of your action code to listeners and/or visitors when using ANTLR 4. It will clean up your grammar and greatly simplify long-term maintenance of your code.
This is probably the primary problem here: Comparing String objects in Java should be performed using equals: "+".equals($op.text). Notice that I used this ordering to guarantee that you never get a NullPointerException, even if $op.text is null.
I recommend removing the op= label and referencing $AddOp instead.
When you switch to using listeners and visitors, removing the explicit label will marginally reduce the size of the parse tree.
(Only relevant to advanced users) In some edge cases involving syntax errors, labels may not be assigned while the object still exists in the parse tree. In particular, this can happen when a label is assigned to a rule reference (your op label is assigned to a token reference), and an error appears within the labeled rule. If you reference the context object via the automatically generated methods in the listener/visitor, the instances will be available even when the labels weren't assigned, improving your ability to report details of some errors.
I want to write a simple parser for a nested block syntax, just hierarchical plain-text. For example:
Some regular text.
This is outputted as-is, foo{but THIS
is inside a foo block}.
bar{
Blocks can be multi-line
and baz{nested}
}
What's the simplest way to do this? I've already written 2 working implementations, but they are overly complex. I tried full-text regex matching, and streaming char-by-char analysis.
I have to teach the workings of it to people, so simplicity is paramount. I don't want to introduce a dependency on Lex/Yacc Flex/Bison (or PEGjs/Jison, actually, this is javascript).
The good choices probably boil down as follows:
Given your constaints, it's going to be recursive-descent. That's a fine way to go even without constraints.
you can either parse char-by-char (traditional) or write a lexical layer that uses the local string library to scan for { and }. Either way, you might want to return three terminal symbols plus EOF: BLOCK_OF_TEXT, LEFT_BRACE, and RIGHT_BRACE.
char c;
boolean ParseNestedBlocks(InputStream i)
{ if ParseStreamContent(i)
then { if c=="}" then return false
else return true
}
else return false;
boolean ParseSteamContent(InputStream i)
{ loop:
c = GetCharacter(i);
if c =="}" then return true;
if c== EOF then return true;
if c=="{"
{ if ParseStreamContent(i)
{ if c!="}" return false; }
else return false;
}
goto loop
}
Recently, I've been using parser combinators for some projects in pure Javascript. I pulled out the code into a separate project; you can find it here. This approach is similar to the recursive descent parsers that #DigitalRoss suggested, but with a more clear split between code that's specific to your parser and general parser-bookkeeping code.
A parser for your needs (if I understood your requirements correctly) would look something like this:
var open = literal("{"), // matches only '{'
close = literal("}"), // matches only '}'
normalChar = not1(alt(open, close)); // matches any char but '{' and '}'
var form = new Parser(function() {}); // forward declaration for mutual recursion
var block = node('block',
['open', open ],
['body', many0(form)],
['close', close ]);
form.parse = alt(normalChar, block).parse; // set 'form' to its actual value
var parser = many0(form);
and you'd use it like this:
// assuming 'parser' is the parser
var parseResult = parser.parse("abc{def{ghi{}oop}javascript}is great");
The parse result is a syntax tree.
In addition to backtracking, the library also helps you produce nice error messages and threads user state between parser calls. The latter two I've found very useful for generating brace error messages, reporting both the problem and the location of the offending brace tokens when: 1) there's an open brace but no close; 2) there's mismatched brace types -- i.e. (...] or {...); 3) a close brace without a matching open.