Antlr4: How to exit a grammar rule? - antlr4

So I"m experimenting with Antlr v4, and I'm poking it with some unusual grammar to get a sense of how it works. Here's my current test case:
I'd like a grammar that consists of the letters A, B, C, D in that order. The letters may be repeated. I also group the A's and B's together, and the C's and D's also, to make the grammar more interesting. So strings like these are acceptable grammars:
AAA
ABCD
ACCCDD
But it's not going well. I think what is happening is that Antlr needs a better exit rule for my grammar. It doesn't seem to recognize that after collecting the A's and B's, that the presence of a C means to go to the next rule. Actually it's sort of working, but I get error messages, and the resulting parse tree seems to have null elements in it, like it inserted an extra element where it issued the error message.
Here's an example error message:
line 1:2 extraneous input 'C' expecting {'B', 'A'}
which happens for the input 'ABCD'. So something weird is going on when Antlr sees the C there. Here's the output of the parse tree:
'ABCD': (prog (aOrB (a A) (aOrB (b B) aOrB)) (cOrD (c C) (cOrD (d D) cOrD)) <EOF>)
which you can see has an empty aOrB element there at the end of the first set of elements.
Any idea what is going on? What is Antlr "thinking" here when it issues the error and adds the empty element? And how might I fix this?
OK, here are the gory details.
My grammar:
grammar Abcd;
prog : aOrB cOrD EOF;
aOrB : ( a | b ) aOrB ;
a : 'A'+ ;
b : 'B'+ ;
cOrD : ( c | d ) cOrD ;
c : 'C'+ ;
d : 'D'+ ;
My test program in Java:
package antlrtests;
import antlrtests.grammars.*;
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;
class AbcdTest {
private final String[] testVectors = {
"A", "AABB", "B", "ABCD", "C", "D", };
public void runTests() {
for( String test : testVectors )
simpleTest( test );
}
private void simpleTest( String test ) {
ANTLRInputStream ains = new ANTLRInputStream( test );
AbcdLexer wpl = new AbcdLexer( ains );
CommonTokenStream tokens = new CommonTokenStream( wpl );
AbcdParser wikiParser = new AbcdParser( tokens );
ParseTree parseTree = wikiParser.prog();
System.out.println( "'" + test + "': " + parseTree.toStringTree(
wikiParser ) );
}
}
And the output of my test program. Note the error message are jumbled up with the regular output because they are printed by Antlr on standard error.
run:
line 1:1 no viable alternative at input '<EOF>'
'A': (prog (aOrB (a A) aOrB) cOrD <EOF>)
line 1:4 no viable alternative at input '<EOF>'
'AABB': (prog (aOrB (a A A) (aOrB (b B B) aOrB)) cOrD <EOF>)
'B': (prog (aOrB (b B) aOrB) cOrD <EOF>)
line 1:1 no viable alternative at input '<EOF>'
line 1:2 extraneous input 'C' expecting {'B', 'A'}
line 1:4 no viable alternative at input '<EOF>'
'ABCD': (prog (aOrB (a A) (aOrB (b B) aOrB)) (cOrD (c C) (cOrD (d D) cOrD)) <EOF>)
line 1:0 no viable alternative at input 'C'
line 1:1 no viable alternative at input '<EOF>'
line 1:0 no viable alternative at input 'D'
'C': (prog aOrB (cOrD (c C) cOrD) <EOF>)
line 1:1 no viable alternative at input '<EOF>'
'D': (prog aOrB (cOrD (d D) cOrD) <EOF>)
BUILD SUCCESSFUL (total time: 0 seconds)
Any help is much appreciated.

Is this not what you're after?
prog : 'A'* 'B'* 'C'* 'D'* EOF;
The following rule of your grammar matches an infinitely long series of A and B tokens because the tail recursive aOrB reference is not optional. Your grammar will either throw a StackOverflowException if the input starts sufficiently many A and/or B characters, or encounter a syntax error if it does not.
aOrB : ( a | b ) aOrB ;
If you want to maintain the groupings, you could use this grammar instead. I only made changes to the aOrB and cOrD rules. Since the a rule matches a sequence of A tokens, the aOrB rule uses a? instead of a* (only one instance of a could ever appear, and the entire series of A tokens would be its children).
grammar Abcd;
prog : aOrB cOrD EOF;
aOrB : a? b?;
a : 'A'+ ;
b : 'B'+ ;
cOrD : c? d?;
c : 'C'+ ;
d : 'D'+ ;
Here is another grammar that matches the same language (but produces a different parse tree) showing other options for the *, +, and ? quantifiers. I wouldn't recommend using this grammar, but you should look over it very carefully to understand what each choice is doing and understand why it matches exactly the same input as the grammar I gave above.
grammar Abcd;
prog : aOrB cOrD? EOF;
aOrB : a* b;
a : 'A' ;
b : 'B'* ;
cOrD : (c d* | d+);
c : 'C'+ ;
d : 'D' ;

Did you realize that your aOrB rule doesn't enforce any ordering of a's and b's? Likewise your cOrD rule.

Related

How to translate ABNF to LBNF

Context
I'm trying to generate a parser for BCP47 Language-Tag values, which are specified in ABNF (Augmented Backus–Naur form). I'm doing this in Haskell and would like to use the robust BNFC tool-chain, which expects LBNF (Labeled Backus–Naur form). I've searched for tooling to do this conversion automatically and could find none, so I'm basically attempting to write an LBNF for it using the ABNF as reference.
Attempted so far
I've done a lot of searching, and I think this question may be useful, but I can't get bnfc to accept any use of ε, it always spits out a syntax error at that character. For example,
Convert every option [ E ] to a fresh non-terminal X and add
X = ε | E.
-- ABNF option:
-- foo = [ E ]
-- Fresh X
Foo. Foo ::= X ;
-- add
X. X ::= ε | E ;
E. E ::= "e" ;
syntax error at line 8, column 10 due to lexer error
Giving up on that, I tried to get something even simpler working:
language = 2*ALPHA
I could not.
I've seen some BNF documentation (sorry I lost the link now) with an example for digits that looked like:
number ::= digit
number ::= number digit
This makes sense to me, so I tried the following:
LanguageISO2. Language ::= ALPHA ALPHA ;
token ALPHA ( letter ) ;
The fails to parse "en", but does parse "e n". It's clear why, but what is the right way to do what I'm intending?
I can make things kind of work by abusing token,
LanguageISO2. Language ::= ALPHA_TWO ;
token ALPHA_TWO ( letter letter ) ;
But this will quickly get out of hand as I handle 3*ALPHA and 5*8ALPHA, etc.
Specific Question
Could someone convert the following to LBNF so I can see the right approach to these things?
langtag = (language
["-" script]
["-" region]
*("-" variant))
language = (2*3ALPHA [ extlang ])
extlang = *3("-" 3ALPHA) ; reserved for future use
script = 4ALPHA ; ISO 15924 code
region = 2ALPHA ; ISO 3166 code
/ 3DIGIT ; UN M.49 code
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
alphanum = (ALPHA / DIGIT) ; letters and numbers
Thanks very much in advance.

How to correctly extend an ANTLR4 grammar?

I have a requirement where I want to extend an existing grammar A with additions defined in grammar B to produce a grammar C.
I have already tried importing grammar A in B, but that selects only certain things defined in grammar A. My guess is that the unused content of A in B is skipped while generating classes. This makes sense as the requirement is not to inherit but intermix/ merge/ combine the two grammars.
Just for understanding (the original grammar is huge), an example:
File : A.g4:
grammar A;
keywords
: X
| Y
| Z
;
X: 'X';
Y: 'Y';
Z: 'Z';
File : B.g4:
grammar B;
keywords
: A
| B
| C
;
A: 'A';
B: 'B';
C: 'C';
File : C.g4:
grammar C;
keywords
: X
| Y
| Z
| A
| B
| C
;
X: 'X';
Y: 'Y';
Z: 'Z';
A: 'A';
B: 'B';
C: 'C';
Note: I do not have the option to manipulate the grammar A directly, but I want to retain all the functionality in grammar A along with the additional rules/ keywords etc. defined in grammar B as shown above.
Any help will be much appreciated. Thanks.
Grammar import might not work as you expect it to work. Rules in the importing grammar take precedence over same named rules in the imported grammar. Thus you cannot override an existing rule in your main grammar. See also the description in the ANTLR4 repo:
Think of import as more like a smart include statement (which does not include rules that are already defined).
However, it should be possible to override a rule in a second import grammar. In your case, you would not define the keywords in your main grammar (I assume this is C). Import the grammars A and Bin reverse order if you want B´s keywords rule to take precedence over the one in A.
import B, A;
This is also demonstrated in the image from this Markdown file:
The rule r from grammar G2 is ignored, since it is imported last, so G3 kinda "overrides" it.

G-machine, (non-)strict contexts - why case expressions need special treatment

I'm currently reading Implementing functional languages: a tutorial by SPJ and the (sub)chapter I'll be referring to in this question is 3.8.7 (page 136).
The first remark there is that a reader following the tutorial has not yet implemented C scheme compilation (that is, of expressions appearing in non-strict contexts) of ECase expressions.
The solution proposed is to transform a Core program so that ECase expressions simply never appear in non-strict contexts. Specifically, each such occurrence creates a new supercombinator with exactly one variable which body corresponds to the original ECase expression, and the occurrence itself is replaced with a call to that supercombinator.
Below I present a (slightly modified) example of such transformation from 1
t a b = Pack{2,1} ;
f x = Pack{2,2} (case t x 7 6 of
<1> -> 1;
<2> -> 2) Pack{1,0} ;
main = f 3
== transformed into ==>
t a b = Pack{2,1} ;
f x = Pack{2,2} ($Case1 (t x 7 6)) Pack{1,0} ;
$Case1 x = case x of
<1> -> 1;
<2> -> 2 ;
main = f 3
I implemented this solution and it works like charm, that is, the output is Pack{2,2} 2 Pack{1,0}.
However, what I don't understand is - why all that trouble? I hope it's not just me, but the first thought I had of solving the problem was to just implement compilation of ECase expressions in C scheme. And I did it by mimicking the rule for compilation in E scheme (page 134 in 1 but I present that rule here for completeness): so I used
E[[case e of alts]] p = E[[e]] p ++ [Casejump D[[alts]] p]
and wrote
C[[case e of alts]] p = C[[e]] p ++ [Eval] ++ [Casejump D[[alts]] p]
I added [Eval] because Casejump needs an argument on top of the stack in weak head normal form (WHNF) and C scheme doesn't guarantee that, as opposed to E scheme.
But then the output changes to enigmatic: Pack{2,2} 2 6.
The same applies when I use the same rule as for E scheme, i.e.
C[[case e of alts]] p = E[[e]] p ++ [Casejump D[[alts]] p]
So I guess that my "obvious" solution is inherently wrong - and I can see that from outputs. But I'm having trouble stating formal arguments as to why that approach was bound to fail.
Can someone provide me with such argument/proof or some intuition as to why the naive approach doesn't work?
The purpose of the C scheme is to not perform any computation, but just delay everything until an EVAL happens (which it might or might not). What are you doing in your proposed code generation for case? You're calling EVAL! And the whole purpose of C is to not call EVAL on anything, so you've now evaluated something prematurely.
The only way you could generate code directly for case in the C scheme would be to add some new instruction to perform the case analysis once it's evaluated.
But we (Thomas Johnsson and I) decided it was simpler to just lift out such expressions. The exact historical details are lost in time though. :)

How can I write a grammar that matches "x by y by z of a"?

I'm designing a low-punctuation language in which I want to support the declaration of arrays using the following syntax:
512 by 512 of 255 // a 512x512 array filled with 255
100 of 0 // a 100-element array filled with 0
expr1 by expr2 by expr3 ... by exprN of exprFill
These array declarations are just one kind of expression among many.
I'm having a hard time figuring out how to write the grammar rules. I've simplified my grammar down to the simplest thing that reproduces my trouble:
grammar Dimensions;
program
: expression EOF
;
expression
: expression (BY expression)* OF expression
| INT
;
BY : 'by';
OF : 'of';
INT : [0-9]+;
WHITESPACE : [ \t\n\r]+ -> skip;
When I feed in 10 of 1, I get the parse I want:
When I feed in 20 by 10 of 1, the middle expression non-terminal slurps up the 10 of 1, leaving nothing left to match the rule's OF expression:
And I get the following warning:
line 2:0 mismatched input '<EOF>' expecting 'of'
The parse I'd like to see is
(program (expression (expression 20) by (expression 10) of (expression 1)) <EOF>)
Is there a way I can reformulate my grammar to achieve this? I feel that what I need is right-association across both BY and OF, but I don't know how to express this across two operators.
After some non-intellectual experimentation, I came up with some productions that seem to generate my desired parse:
expression
:<assoc=right> expression (BY expression)+ OF expression
|<assoc=right> expression OF expression
| INT
;
I don't know if there's a way I can express it with just one production.

Prolog importing facts from a formatted text file

I have the following input in a text file input.txt
atom1,atom2,atom3
relation(atom1 ,[10,5,2])
relation(atom2 ,[3,10,2])
relation(atom3 ,[6,5,10])
First line includes the list of atoms used in relation predicates in the file and each remaining line represents a relation predicate in order of the first line list.relation(atom1, [x,y,z]) means atom1 has a relation value of 10 with first atom, 5 with the second and 2 with the third
I need to read this file and add represent relation values for each atom seperately.For example , these are the relation values which will be added for atom1 :
assert(relation(atom1, atom1,10)).
assert(relation(atom1, atom2, 5)).
assert(relation(atom1, atom3, 2)).
I have read some prolog io tutorials and seen some recommendations on using DCG but I'm a beginner prolog programmer and having trouble to choose the method for the solving problem. So I'm here to ask help from experienced prolog programmers.
Since you didn't stated what Prolog you're using, here is a snippet written in SWI-Prolog. I attempted to signal non ISO builtins by means of SWI-Prolog docs reference.
parse_input :-
open('input.txt', read, S),
parse_line(S, atoms(Atoms)),
repeat,
( parse_line(S, a_struct(relation(A, L)))
-> store(Atoms, A, L), fail
; true ),
close(S).
:- meta_predicate(parse_line(+, //)).
parse_line(S, Grammar) :-
% see http://www.swi-prolog.org/pldoc/doc_for?object=read_line_to_codes/2
read_line_to_codes(S, L),
L \= end_of_file,
phrase(Grammar, L).
% match any sequence
% note - clauses order is mandatory
star([]) --> [].
star([C|Cs]) --> [C], star(Cs).
% --- DCGs ---
% comma sep atoms
atoms(R) -->
star(S),
( ",",
{atom_codes(A, S), R = [A|As]},
atoms(As)
; {atom_codes(A, S), R = [A]}
).
% parse a struct X,
% but it's far easier to use a builtin :)
% see http://www.swi-prolog.org/pldoc/doc_for?object=atom_to_term/3
a_struct(X, Cs, []) :-
atom_codes(A, Cs),
atom_to_term(A, X, []).
% storage handler
:- dynamic(relation/3).
store(Atoms, A, L) :-
nth1(I, L, W),
nth1(I, Atoms, B),
assertz(relation(A, B, W)).
with the sample input.txt, I get
?- parse_input.
true .
?- listing(relation).
:- dynamic relation/3.
relation(atom1, atom1, 10).
relation(atom1, atom2, 5).
relation(atom1, atom3, 2).
relation(atom2, atom1, 3).
relation(atom2, atom2, 10).
relation(atom2, atom3, 2).
relation(atom3, atom1, 6).
relation(atom3, atom2, 5).
relation(atom3, atom3, 10).
HTH

Resources