rewriting AST Action Translation to ANTLR4 - antlr4

I have a grammar file written in antlr2 syntax and need help understanding how to rewrite some of the parser rules in antlr4 syntax. I know antlr4 eliminated the need for building AST so I'm not sure what to do with the rules that are AST action translations. ANTLR Tree Construction explains some of the syntax and how to use the # construct but I'm still unsure how to read this rules and re-write them.
temp_root :
temp { #temp_root = #([ROOT, "root"], #temp_root); } EOF;
temp :
c:temp_content
{ #temp = #(#([FUNCTION_CALL, "template"], #template), c);
reparent((MyAST)#temp, MyAST)#c); };
temp_content :
(foo | bar);
foo :
{
StringBuilder result = new StringBuilder("");
}
: (c:FOO! { result.append(c.getText()); } )+
{ #foo = #([TEMPLATE_STRING_LITERAL, result.toString()], #foo); };
bar :
BEGIN_BAR! expr END_BAR!
exception
catch [Exception x] {
bar_AST = handleException(x);
};

You cannot manipulate the produced parse tree (at least not with grammar code), so simply remove all tree rewriting stuff (you may have to adjust consumer code, if that relies on a specific tree structure). Also remove the exclamation marks (which denote a token that should not appear in the AST). A surprise is the c:FOO part. Can't remember having ever seen this. but judging from the following action code I guess it's a var assignment and should be rewritten as c = FOO.

Related

antlr4 grammar to iteratively parse repeating things from a single InputStream

I have an InputStream that contains repeating chunks like this:
fld1:val1
fld2:val2
[A B C D]
[E F]
fld1:val3
fld2:val4
[M N]
[Q S T Y]
fld1:val5
...
I wish to construct a solution where I can parse the fld:val block, skip the blank line separator, then parse the "listy" part, then stop parsing at the next blank line and reset the parser on the same open stream to process the next chunk. I was thinking I might be able to do this in my override of the baselistener class exitListy callback by getting access to the parser and calling reset(). Ideally, this would end the call chain to ParseTree t = parser.parse() and let control return to the code immediately following parse() I experimented with this and, somewhat predictably, got a null pointer exception here: org.antlr.v4.runtime.Parser.exitRule(Parser.java:639) I cannot change the format of the input stream, like inserting snip-here markers or anything like that.
(Completely new answer based on comment)
Listeners operate on ParseTrees returned once a parse completes. In your case, it appears, You'll be listening on an, essentially, unending stream, and want data back periodically.
I'd highly recommend "The Definitive ANTLR 4 Reference" from Pragmatic Programmers.
There are two very pertinent sections:
"Making Things Happen During the Parse"
"Unbuffered Character and Token Streams"
For your grammar, try something akin to the following "rough draft" (this may not be reporting back exactly when you want, but hopefully gives you the idea to work with)
grammar Streaming;
#parser::members {
java.util.function.Consumer<MyData> consumer;
MyData myData = new MyData();
public StreamingParser(TokenStream input, java.util.function.Consumer<MyData> consumer) {
this(input);
this.consumer = consumer;
}
}
stream: (fldLine emptyLine listLine emptyLine) EOF;
fldLine:
fld = ITEM COLON val = ITEM EOL {
// add data to MyDataObject
};
listLine:
O_BRACKET (items = ITEM)* C_BRACKET {
// add data to MyDataObject
};
emptyLine:
EOL {
consumer.accept(myData);
// reset myData
};
O_BRACKET: '[';
C_BRACKET: ']';
EOL: '\n';
COLON: ':';
ITEM: [a-zA-Z][a-zA-Z0-9]*;
SPACE: ' ' -> skip;
This takes advantage of embedded actions that are described in the first section.
Then the second section describes how to use Unbuffered streams.
Something like this (untested; much lifted directly from the referenced book)
CharStream input = new UnbufferedCharStream(<your stream>);
StreamingLexer lex = new StreamingLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
StreamingParser parser = new StreamingParser(tokens,
// This lambda will handle data reported back when a blank line is encountered
myData -> handle(myData));
// You just want ANTLR reporting back periodically
// not building a giant parse tree
parser.setBuildParseTree(false);
parser.stream(); // won't return until you shut down the input stream

Bison simple postfix calculator using union with struct

Im making a simple calculator that prints postfix to learn bison. i could make the postfix part work ,but now i need to do assigments to variables (a-z) tex: a=3+2; should print: a32+=; for example. Im trying modified my working postfix code to be able to read a char too.
If I understand correctly, to be able to put different types in the $$ i need a union and to make the nodes i should use a struct because in my case 'expr' can be a int or a char.
This is my parser:
%code requires{
struct nPtr{
char *val;
int num;
};
}
%union {
int iValue;
char sIndex;
struct nPtr *e;
};
%token PLUS MINUS STAR LPAREN RPAREN NEWLINE DIV ID NUMBER POW EQL
%type <iValue> NUMBER
%type <sIndex> ID
%type <e> expr line
%left PLUS MINUS
%left STAR DIV
%left POW
%left EQL
%%
line : /* empty */
|line expr NEWLINE { printf("=\n%d\n", $2.num); }
expr : LPAREN expr RPAREN { $$.num = $2.num; }
| expr PLUS expr { $$.num = $1.num + $3.num; printf("+"); }
| expr MINUS expr { $$.num = $1.num - $3.num; printf("-"); }
| expr STAR expr { $$.num = $1.num * $3.num; printf("*"); }
| expr DIV expr { $$.num = $1.num / $3.num; printf("/");}
| expr POW expr { $$.num = pow($1.num, $3.num); printf("**");}
| NUMBER { $$.num = $1.num; printf("%d", yylval); }
| ID EQL expr { printf("%c", yylval); }
;
%%
I have this in the lex to handle the "=" and variables
"=" { return EQL; }
[a-z] { yylval.sIndex = strdup(yytext); return ID; }
i get and error
warning empty rule for typed nonterminal and no action
the only answer i found here about that is this:
Bison warning: Empty rule for typed nonterminal
It says to just remove the /* empty */ part in:
line: /* empty */
| line expr NEWLINE { printf("=\n%d\n", $2.num); }
when i do that i get 3 new warnings:
warning 3 nonterminal useless in grammar
warning 10 rules useless in grammar
fatal error: start symbol line does not derive any sentence
I googled and got some solutions that just gave me other problems, for example.
when i change:
line:line expr NEWLINE { printf("=\n%d\n", $2.num); }
to
line:expr NEWLINE { printf("=\n%d\n", $2.num); }
bison works but when i try to run the code in visual studio i get a lot of errors like:
left of '.e' must have struct/union type
'pow': too few arguments for call
'=': cannot convert from 'int' to 'YYSTYPE'
thats as afar as i got. I cant find a simple example similar to my needs. I just want to make 'expr' be able to read a char and print it. If someone could check my code and recommend some changes . it will be really really appreciated.
It is important to actually understand what you are asking Bison to do, and what it is telling you when it suggests that your instructions are wrong.
Applying a fix from someone else's problem will only work if they made the same mistake as you did. Just making random changes based on Google searching is neither a very structured way to debug nor to learn a new tool.
Now, what you said was:
%type <e> expr line
Which means that expr and line both produce a value whose union tag is e. (Tag e is a pointer to a struct nPtr. I doubt that's what you wanted either, but we'll start with the Bison warnings.)
If a non-terminal produces a value, it produce a value in every production. To produce a value, the semantic rule needs to assign a value (of the correct tag-type) to $$. For convenience, if there is no semantic action and if $1 has the correct tag-type, then Bison will provide the default rule $$ = $1. (That sentence is not quite correct but it's a useful approximation.) Many people think that you should not actually rely on this default, but I think it's OK as long as you know that it is happening, and have verified that the preconditions are valid.
In your case, you have two productions for line. Neither of them assigns a value to $$. This might suggest that you didn't really intend for line to even have a semantic value, and if we look more closely, we'll see that you never attempt to use a semantic value for any instance of the non-terminal line. Bison does not, as far as I know, attempt such a detailed code examination, but it is capable of noting that the production:
line : /* empty */
has no semantic action, and
does not have any symbols on the right-hand side, so no default action can be constructed.
Consequently, it warns you that there is some circumstance in which line will not have its value set. Think of this as the same warning as your C compiler might give you if it notices that a variable you use has never had a value assigned to it.
In this case, as I said, Bison has not noticed that you never use value of line. It just assumes that you wouldn't have declared the type of the value if you don't intend to somehow provide a value. You did declare the type, and you never provide a value. So your instructions to Bison were not clear, right?
Now, the solution is not to remove the production. There is no problem with the production, and if you remove it then line is left with only one production:
line : line expr NEWLINE
That's a recursive production, and the recursion can only terminate if there is some other production for line which does not recurse. That would be the production line: %empty, which you just deleted. So now Bison can see that the non-terminal line is useless. It is useless because it can never derive any string without non-terminals.
But that's a big problem, because that's the non-terminal your grammar is supposed to recognize. If line can't derive any string, then your grammar cannot recognize anything. Also, because line is useless, and expr is only used by a production in line (other than recursive uses), expr itself can never actually be used. So it, too, becomes useless. (You can think of this as the equivalent of your C compiler complaining that you have unreachable code.)
So then you attempt to fix the problem by making (the only remaining) rule for line non-recursive. Now you have:
line : expr NEWLINE
That's fine, and it makes line useful again. But it does not recognise the language you originally set out to recognise, because it will now only recognise a single line. (That is, an expr followed by a NEWLINE, as the production says.)
I hope you have now figured out that your original problem was not the grammar, which was always correct, but rather the fact that you told Bison that line produced a value, but you did not actually do anything to provide that value. In other words, the solution would have been to not tell Bison that line produces a value. (You could, instead, provide a value for line in the semantic rule associated with every production for line, but why would you do that if you have no intention of every using that value?)
Now, let's go back to the actual type declarations, to see why the C compiler complains. We can start with a simple one:
expr : NUM { $$.num = $1.num; }
$$ refers the the expr itself, which is declared as having type tag e, which is a struct nPtr *. The star is important: it indicates that the semantic value is a pointer. In order to get at a field in a struct from a pointer to the struct, you need to use the -> operator, just like this simple C example:
struct nPtr the_value; /* The actual struct */
struct nPtr *p = &the_value; /* A pointer to the struct */
p.num = 3; /* Error: p is not a struct */
p->num = 3; /* Fine. p points to a struct */
/* which has a field called num */
(*p).num = 3; /* Also fine, means exactly the same thing. */
/* But don't write this. p->num is much */
/* more readable. */
It's also worth noting that in the C example, we had to make sure that p was initialised to the address of some actual memory. If we hadn't done that and we attempted to assign something to p->num, then we would be trying to poke a value into an undefined address. If you're lucky, that would segfault. Otherwise, it might overwrite anything in your executable.
If any of that was not obvious, please go back to your C textbook and make sure that you understand how C pointers work, before you try to use them.
So, that was the $$.num part. The actual statement was
$$.num = $1.num;
So now lets look at the right-hand side. $1 refers to a NUM which has tag type iValue, which is an int. An int is not a struct and it has no named fields. $$.num = $1.num is going to be processed into something like:
int i = <some value>;
$$.num = i.num;
which is totally meaningless. Here, the compiler will again complain that iValue is not a struct, but for a different reason: on the left-hand side, e was a pointer to a struct (which is not a struct); on the right-hand side iValue is an integer (which is also not a struct).
I hope that gives you some ideas on what you need to do. If in doubt, there are a variety of fully-worked examples in the Bison manual.

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

Is there a way to put Haskell code before the token rules in an Alex source file?

Consider the following, working, Alex source file:
{
module Main (main) where
}
%wrapper "basic"
tokens :-
$white ;
. { rule "!"}
{
type Token = String
rule tok = \s -> tok
main = do
s <- getContents
mapM_ print (alexScanTokens s)
}
I would love to put my helper code closer to the top of the file, before all the rules. I tried doing this:
{
module Main (main) where
}
%wrapper "basic"
{
type Token = String
rule tok = \s -> tok
}
tokens :-
$white ;
. { rule "!"}
{
main = do
s <- getContents
mapM_ print (alexScanTokens s)
}
but got the following error:
test.x:11:2: parse error
(line 11 is the closing curly brace after my helper code)
Is there a way to move my helper code closer to the top of the file?
I also tried putting the helper code in the first block, together with "module Main" declaration but that didn't work because the "%wrapper" bit generates some import statements that need to appear right as the first thing in the generated file.
Quoting from the documentation of Alex:
"The overall layout of an Alex file is:
alex := [ #code ] [ wrapper ] { macrodef } #id ':-' { rule } [ #code ]
At the top of the file, the code fragment is normally used to declare the module name and some imports, and that is all it should do: don't declare any functions or types in the top code fragment, because Alex may need to inject some imports of its own into the generated lexer code, and it does this by adding them directly after this code fragment in the output file."
So, what you are trying to do violates the syntax. It seems that the sole place you can put the definition of the Token datatype is at the final code block.
However, it is possible to have this code in a separate module, if you like, and import it at the top code block.

Simplest nested block parser

I want to write a simple parser for a nested block syntax, just hierarchical plain-text. For example:
Some regular text.
This is outputted as-is, foo{but THIS
is inside a foo block}.
bar{
Blocks can be multi-line
and baz{nested}
}
What's the simplest way to do this? I've already written 2 working implementations, but they are overly complex. I tried full-text regex matching, and streaming char-by-char analysis.
I have to teach the workings of it to people, so simplicity is paramount. I don't want to introduce a dependency on Lex/Yacc Flex/Bison (or PEGjs/Jison, actually, this is javascript).
The good choices probably boil down as follows:
Given your constaints, it's going to be recursive-descent. That's a fine way to go even without constraints.
you can either parse char-by-char (traditional) or write a lexical layer that uses the local string library to scan for { and }. Either way, you might want to return three terminal symbols plus EOF: BLOCK_OF_TEXT, LEFT_BRACE, and RIGHT_BRACE.
char c;
boolean ParseNestedBlocks(InputStream i)
{ if ParseStreamContent(i)
then { if c=="}" then return false
else return true
}
else return false;
boolean ParseSteamContent(InputStream i)
{ loop:
c = GetCharacter(i);
if c =="}" then return true;
if c== EOF then return true;
if c=="{"
{ if ParseStreamContent(i)
{ if c!="}" return false; }
else return false;
}
goto loop
}
Recently, I've been using parser combinators for some projects in pure Javascript. I pulled out the code into a separate project; you can find it here. This approach is similar to the recursive descent parsers that #DigitalRoss suggested, but with a more clear split between code that's specific to your parser and general parser-bookkeeping code.
A parser for your needs (if I understood your requirements correctly) would look something like this:
var open = literal("{"), // matches only '{'
close = literal("}"), // matches only '}'
normalChar = not1(alt(open, close)); // matches any char but '{' and '}'
var form = new Parser(function() {}); // forward declaration for mutual recursion
var block = node('block',
['open', open ],
['body', many0(form)],
['close', close ]);
form.parse = alt(normalChar, block).parse; // set 'form' to its actual value
var parser = many0(form);
and you'd use it like this:
// assuming 'parser' is the parser
var parseResult = parser.parse("abc{def{ghi{}oop}javascript}is great");
The parse result is a syntax tree.
In addition to backtracking, the library also helps you produce nice error messages and threads user state between parser calls. The latter two I've found very useful for generating brace error messages, reporting both the problem and the location of the offending brace tokens when: 1) there's an open brace but no close; 2) there's mismatched brace types -- i.e. (...] or {...); 3) a close brace without a matching open.

Resources