I am new to ANTLR4. I followed the installation instruction on github and ran the example successfully. So installation appears to be OK. Next I downloaded a grammer file I wish to use, ran antlr4 on it, and compiled the resultant files with javac. Then I tried an example like this:
C:> grun GrammerName tokens examplefile
This runs for a couple of seconds and returns, but there is no response. Ive tried using -tree and -ps but I get nothing with either. If I supply a bad filename, then I get a stream of file-not-found error messages, so it is doing something... but if I supply a random data file, I also get no response. Which suggests to me that my example file is not being seen as a valid example of the grammer in question. But why do I not get an error message?
In essence, my question is how do I get TestRig to supply some feedback about the example file I've supplied?
I've tried reading the manual pages on the antlr.org site but there's too much terminology I'm not familiar with yet.
If you supply tokens as the name of the start rule, that tells grun to not invoke the parser at all and only run the tokenizer. This is generally only useful in combination with the -tokens flag, which prints the tokens. Otherwise the only output you see would be possible tokenization errors.
The options -tree, -ps or -gui display the result of the parser. So if the parser isn't executed, they do nothing at all.
If you want to see the parse tree, you should replace tokens with the name of the rule that you want to use. If you want to see the list of generated tokens, you should add the -tokens flag.
Which suggests to me that my example file is not being seen as a valid example of the grammer in question.
It's the opposite actually. If grun detects errors, it will print them to the console. So if there is no output, grun did not detect any errors (however when using tokens, it will only look for lexical errors, not syntactical ones). When calling grun with valid input and without flags such as -tree or -tokens, the expected result would be that there's no output.
Related
I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.
To demonstrate the problem, I'm going to create a simple grammar to merely detect Python-like variables.
I create a virtual environment and install antlr4-python3-runtime in it, as mentioned in "Where can I get the runtime?":
Then, I create a PyVar.g4 file with the following content:
grammar PyVar;
program: IDENTIFIER+;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
NEWLINE: '\n' | '\r\n';
WHITESPACE: [ ]+ -> skip;
Now if I test the grammar with grun, I can see that the grammar detects the variables just fine:
Now I'm trying to write a parser in Python to do just that. I generate the Lexer and Parser, using this command:
antlr4 -Dlanguage=Python3 PyVar.g4
And they're generated with no errors:
But when I use the example provided in "How do I run the generated lexer and/or parser?", I get no output:
What am I not doing right?
There are two problems here.
1. The grammar:
In the line where I had,
program: IDENTIFIER+;
the parser will only detect one or more variables, and it will not detect any newline. The output you see when running grun is the output created by the lexer, that's why newlines are present in the tokens. So I had to replace it with something like this, for the parser to detect newlines.
program: (IDENTIFIER | NEWLINE)+;
2. Printing the output of parser
In PyVar.py file, I created a tree with this line:
tree = parser.program()
But it didn't print its output, nor did I know how to, but the OP's comment on this accepted answer suggests using tree.toStringTree().
Now if we fix those, we can see that it works:
I am currently playing with the happy parser generator.
Other parser generators can give nice messages like "unexpected endline, expected 'then'".
With happy I just get the current Tokens and the position of the error.
Can you give me an example of how to get error messages like above?
There is a Happy feature that I have authored for this purpose.
See my blog post: Toward better GHC syntax errors
It was merged in this pull request RFC: On parse error - show the next possible tokens.
Generally, from what I've heard, if you want nice parser errors, use Parsec instead of Happy.
I'm using HUnit-Plus via stack test, which I believe makes use of Distribution.TestSuite.
When I get compilation errors, I get file paths and line numbers in the error. This is great because I can just click on the error in my editor and jump straight to the relevant code.
Other times there is no compilation error and instead I get output like this:
### Failure in testFoo: expected: 8
This isn't so great, because every time I have to navigate to the relevant test by hand. Also, it is sometimes ambiguous which assertion has failed, and I have to add a string to label the assertion, which becomes repetitious because the string merely repeats in some form the content of the assertion (or else is meaningless). With a line number that wouldn't be a problem.
Is there a way to get this setup to print line numbers and file paths for test failures?
Compilation errors are generated by GHC itself which gives you line numbers, to my knowledge no test suite has this feature, which would be a really nice thing to have though. What I found quite helpful is hspec-expectations-pretty-diff which is a nice diffing output but with no line numbers, I checked it also provides file path and line number!
Also I see some space for improvement for your test cases - usually a test case in my project has a string describing the test - therefore it is rarely ambiguous what test case failed. Also you can use the whole power of haskell to generate this String!
I have a program, written in fortran90, that is writing an array to a file, but for some reason is using an asterix to represent multiple columns:
8*9, 4, 2*9, 4
later on reading from the file I am getting I/O errors:
lib-4190 : UNRECOVERABLE library error
A numeric input field contains an invalid character.
Encountered during a list-directed READ from unit 10 Fortran unit 10 is connected to a sequential formatted text file:
Does anyone have any idea why this is happening, and if there is a flag to feed to the compiler to prevent it. I'm using the cray fortran compiler, and the write statement looks like this:
write (lun,*) nsf_species(bundle%species(1:bundle%n_prim))
Update:
The line reading in the data file looks like:
read (lun,*) Info(ifile)%alpha_i(1:size)
I have checked to ensure that it is this line that is causing the problem.
This compression of list-directed output is a very useful feature of the Cray Compilation Environment when writing out large amounts of data. This compressed output will, however, not be read in correctly, as you point out (which is less useful).
You can modify this behaviour, not using a compiler flag but by using the "assign" command.
Consider this sample code:
PROGRAM test
IMPLICIT NONE
INTEGER :: u
OPEN(UNIT=u,FILE="f1",FORM="FORMATTED",STATUS="UNKNOWN")
WRITE(u,*) 0,0,0
CLOSE(u)
OPEN(UNIT=u,FILE="f2",FORM="FORMATTED",STATUS="UNKNOWN")
WRITE(u,*) 0,0,0
CLOSE(u)
END PROGRAM test
We first build with CCE and execute. Files f1 and f2 both contain the compressed output form:
$ ftn -o test.x test.F90
$ ./test.x
$ cat f1
3*0
$ cat f2
3*0
Now we will use "assign" to modify the format in file f2. First we need to define a filename to hold the assign information:
$ export FILENV=my_filenenv
Now we use assign to switch off the compressed output for file f2:
$ assign -y on f:f2
Now we rerun the experiment (without needing to recompile):
$ ./test.x
$ cat f1
3*0
$ cat f2
0, 0, 0
There are options to do this for all files, for certain filename patterns or many other cases.
There are other things that assign can do. See "man assign" with PrgEnv-cray loaded for more details.
The write statement is using list directed formatting (it is still a formatted output statement - "formatted" means "formatted such that a human can read it")- as specified by the * inside the parenthesised part of the statement. The rules for list directed output give a great deal of freedom to the compiler. Typically, if you actually care about the details of the output, you should provide an explicit format.
One of the rules that does apply is that the resulting output should generally be suitable for list directed input. But there are some rather surprising rules for what is permitted as input for list directed formatting. One such feature is that you can specify in the input text a repeat count for an input values using the syntax repeat*value.
The compiler has noticed that there are repeat values in the output, so it has used this repeat count feature.
I don't know why you get an error message when reading the file under list directed input - as the line you show is a valid input line for list directed input. Make sure that the line causing the error is actually the line that you show.
A simple workaround solution would be to change the write statement so that it does not use the compressed format. e.g. change to:
write (lun,'(*(I5))') nsf_species(bundle%species(1:bundle%n_prim))
The '*' allows an arbitrary number of repeats of the specified format and should suppress the compressed output format.
However, if the compiler outputs in compressed format than it should be able to read back in in the same compressed format. Hopefully the helpdesk will be able to get to the root of why that does not work.