Antrl visitor on a grammar with compiler directive - antlr4

I try to get compiler directive in a verilog parser which give me the true file name/path and the true current line in the non-preprocessed file.
Verilog language needs a preprocessing pass I have, but during the visit I have to know the current file name (which can't change by the `include directive) and so the true current line in the non-preprocessed file .
The preprocessing part add the verilog directive `line which indicates the current file and line.
Then I send the preprocessed buffer to the antlr Lexer, parse and extract all verilog information with a visitor. I have to keep the verilog compiler `line directive in the verilog grammar description:
Preprocessing_line
: '`line ' Decimal_number String Decimal_number '\n' -> channel(2)
;
Now, I don't know how to get this dedicated channel information at any point in the visitor? The target language for this parser is Python3.

Given that the Preprocessing_line tokens may not have a reliable relation to the parse-tree tokens (different Verilog compilers can be a bit loose about where they inject the reference lines), the easiest solution is to create a temporary index prior to the visitor walk.
That is, after parsing the pre-processed Verilog source, do a quick pass over the entire token stream (BufferedTokenStream#getTokens), picking out the Preprocessing_line tokens, and building a current_line -> original_line index.
Then, in any visited context, examine the underlying token(s) (ParserRuleContext#getStart, #getStop, #getSourceInterval) to find their current_line (Token#getLine)

Related

Arbitrary lookaheads in PLY

I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.

Converting ANTLR parse trees into string and then reverting it

I am new to ANTLR, and I am digging into it for a project. My work would require me to generate a parse tree from a source code file, convert the parse tree into a string that holds all the information about the parse tree in a somewhat "human-readable" form. Parts of this string (representing the parse tree) will then be modified, and the modified string will have to be converted to a changed source code.
I have found out that the .toStringTree(tree) method can be used in ANTLR to print out the tree in LISP format. Is there a better way to represent the parse tree as a string that holds all information?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR? If no, are there any tools for this?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR?
That string does not contain the token types, just the matched text. In other words: you cannot create a parse tree from the output of the ToStringTree. Besides, many ANTLR grammars have lexer rules that skip certain input (white spaces and line breaks, for example), so converting a parse tree back to the original input source is not always possible.
If no, are there any tools for this?
Without a doubt, I suggest you do a search on GitHub. But when you have the parse tree, it is trivial to create a custom tree structure and convert that to JSON.

ANTLR get first production

I'm using ANTLR4 and, in particular, the C grammar available in their repo (grammar). It seems that the grammar hasn't an initial rule, so I was wondering how it's possible to get it. In fact, once initialized the parser, I attach my listener, but I obtain syntax errors since I'm trying to parse two files with different code instructions:
int a;
int foo() { return 0; }
In my example I call the parser with "parser.primaryExpression();" which is the first production of the "g4" file. Is it possible to avoid to call the first production and get it automatically by ANTLR instead?
In addition to #GRosenberg's answer:
Also the rule enum (in the generated parser) contains entries for each rule in the order they appear in the grammar and the first rule has the value 0. However, just because it's the first rule in the grammar doesn't mean that it is the main entry point. Only the grammar author knows what the real entry is and sometimes you might even want to parse only with a subrule, which makes this decision even harder.
ANTLR provides no API to obtain the first rule. However, in the parser as generated, the field
public static final String[] ruleNames = ....;
lists the rulenames in the order of occurrence in the grammar. With reflection, you can access the method.
Beware. Nothing in the Antlr 'spec' defines this ordering. Simply has been true to date.

Antlr4 match whole input string or bust

I am new to Antlr4 and have been wracking my brain for some days now about a behaviour that I simply don't understand. I have the following combined grammar and expect it to fail and report an error, but it doesn't:
grammar MWE;
parse: cell EOF;
cell: WORD;
WORD: ('a'..'z')+;
If I feed it the input
a4
I expect it to not be able to parse it, because I want it to match the whole input string and not just a part of it, as signified by the EOF. But instead it reports no error (I listen for errors with a errorlistener implementing the IAntlrErrorListener interface) and gives me the following parse tree:
(parse (cell a) <EOF>)
Why is this?
The error recovery mechanism when input is reached which no lexer rule matches is to drop a character and continue with the next one. In your case, the lexer is dropping the 4 character, so your parser is seeing the equivalent of this input:
a
The solution is to instruct the lexer to create a token for the dropped character rather than ignore it, and pass that token on to the parser where an error will be reported. In the grammar, this rule takes the following form and is always added as the last rule in the grammar. If you have multiple lexer modes, a rule with this form should appear as the last rule in the default mode as well as the last rule in each extra mode.
ErrChar
: .
;

How to get the filename and line number of a particular JetBrains.ReSharper.Psi.IDeclaredElement?

I want to write a test framework extension for resharper. The docs for this are here: http://confluence.jetbrains.net/display/ReSharper/Test+Framework+Support
One aspect of this is indicating if a particular piece of code is part of a test. The piece of code is represented as a IDeclaredElement.
Is it possible to get the filename and line number of a piece of code represented by a particular IDeclaredElement?
Following up to the response below:
#Evgeny, thanks for the answer, I wonder if you can clarify one point for me.
Suppose the user has this test open in visual studio: https://github.com/fschwiet/DreamNJasmine/blob/master/NJasmine.Tests/SampleTest.cs
Suppose the user right clicks on line 48, the "player.Resume()" expression.
Will the IDeclaredElement tell me specifically they want to run at line 48? Or is it going to give me a IDeclaredElement corresponding to the entire class, and a filename/line number range for the entire class?
I should play with this myself, but I appreciate tapping into what you already know.
Yes.
The "IDeclaredElement" entity is the code symbol (class, method, variable, etc.). It could be loaded from assembly metadata, it could be declared in source code, it could come from source code implicitly.
You can use
var declarations = declaredElement.GetDeclarations()
to get all AST elements which declares it (this could return multiple declarations for partial class, for example)
Then, for any IDeclaration, you can use
var documentRange = declaration.GetDocumentRange()
if (documentRange.IsValid())
Console.WriteLine ("File: {0} Line:{1}",
DocumentManager.GetInstance(declaration.GetSolution()).GetProjectFile(documentRange.Document).Name,
documentRange.Document.GetCoordsByOffset(documentRange.TextRange.StartOffset).Line
);
By the way, which test framework extension are you developing?
Will the IDeclaredElement tell me specifically they want to run at
line 48?
Once more: IDeclaredElement has no positions in the file. Instead, it's declaration have them.
For every declaration (IDeclaration is a regular AST node) there is range in document which covers it. In my previous example, I used TextRange.StartOffset, though you can use TextRange.EndOffset.
If you need more prcise position in the file, please traverse AST tree, and check the coordinates in the document for specific expression/statement

Resources