How to translate ABNF to LBNF - haskell

Context
I'm trying to generate a parser for BCP47 Language-Tag values, which are specified in ABNF (Augmented Backus–Naur form). I'm doing this in Haskell and would like to use the robust BNFC tool-chain, which expects LBNF (Labeled Backus–Naur form). I've searched for tooling to do this conversion automatically and could find none, so I'm basically attempting to write an LBNF for it using the ABNF as reference.
Attempted so far
I've done a lot of searching, and I think this question may be useful, but I can't get bnfc to accept any use of ε, it always spits out a syntax error at that character. For example,
Convert every option [ E ] to a fresh non-terminal X and add
X = ε | E.
-- ABNF option:
-- foo = [ E ]
-- Fresh X
Foo. Foo ::= X ;
-- add
X. X ::= ε | E ;
E. E ::= "e" ;
syntax error at line 8, column 10 due to lexer error
Giving up on that, I tried to get something even simpler working:
language = 2*ALPHA
I could not.
I've seen some BNF documentation (sorry I lost the link now) with an example for digits that looked like:
number ::= digit
number ::= number digit
This makes sense to me, so I tried the following:
LanguageISO2. Language ::= ALPHA ALPHA ;
token ALPHA ( letter ) ;
The fails to parse "en", but does parse "e n". It's clear why, but what is the right way to do what I'm intending?
I can make things kind of work by abusing token,
LanguageISO2. Language ::= ALPHA_TWO ;
token ALPHA_TWO ( letter letter ) ;
But this will quickly get out of hand as I handle 3*ALPHA and 5*8ALPHA, etc.
Specific Question
Could someone convert the following to LBNF so I can see the right approach to these things?
langtag = (language
["-" script]
["-" region]
*("-" variant))
language = (2*3ALPHA [ extlang ])
extlang = *3("-" 3ALPHA) ; reserved for future use
script = 4ALPHA ; ISO 15924 code
region = 2ALPHA ; ISO 3166 code
/ 3DIGIT ; UN M.49 code
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
alphanum = (ALPHA / DIGIT) ; letters and numbers
Thanks very much in advance.

Related

ANTLR4 different precedence in two seemingly equivalent grammars

The following test grammars differ only in that the first alternative of the rule 'expr' is either specified inline or refers to another rule 'notExpression' with just the same definition. But this grammars produce different trees parsing this: '! a & b'. Why?
I really want the grammar to produce the first result (with NOT associated with identifier, not with AND expression) but still need to have 'expr' to reference 'notExpression' in my real grammar. What do I have to change?
grammar test;
s: expr ';' <EOF>;
expr:
NOT expr
| left=expr AND right=expr
| identifier
;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree one
grammar test;
s: expr ';' <EOF>;
expr:
notExpression
| left=expr AND right=expr
| identifier
;
notExpression: NOT expr;
identifier: LETTER (LETTER)*;
WS : ' '+ ->skip;
NOT: '!';
AND: '&';
LETTER: 'A'..'z';
Tree two
I kind of got an answer to the second part of my question, which still do not quite give me a satisfaction because using this approach in real elaborate grammar is going to be ugly. As to the first part (WHY) I still have no idea, so more answers are welcome.
Anyway, to fix precedence in presence of referenced rule the 'notExpression' rule can be modified as follows:
notExpression: NOT (identifier|expr);
Which produces the tree different from both shown in original question, but at least the NOT does acquire higher precedence.
Parse tree

Where can BangPatterns appear

From GHC user guide it seems like most Pat can be PBangPat, but there are a few exceptions. e.g. top-level bangs in a module (like !main) aren't allowed and x : !xs fails to parse x : (!xs) parses thanks #chi. What is the formal specification about where the bangs can be added? I've looked into some chapters of the user guide and the Report but found nothing.
There is no accepted formal specification for BangPatterns since they are not a part of any Haskell Report. The closest thing we have to a specification is the User's Guide along with the haskell-prime proposal it links to.
Both of those sources explicitly mention that a bang pattern is not allowed at the top level of a module.
As for x : !xs, the User's Guide has this to say about the syntax of bang patterns:
We add a single new production to the syntax of patterns:
pat ::= !pat
It should be read in conjunction with the Haskell 2010 Report:
pat ::= lpat qconop pat
| lpat
lpat ::= apat
| - (integer | float)
| gcon apat_1 ... apat_k
apat ::= var [ # apat]
| ...
| ( pat )
| ...
According to these rules x : !xs actually should parse (since !xs is a pat, the whole thing is lpat qconop pat). So either the User's Guide (and the haskell-prime proposal) is wrong or GHC is wrong on this point.
I believe that in practice the syntax accepted by GHC is "anything that looks like a valid expression" including interpreting (!x) as a section of the operator !. For example (! Just x) is accepted as a pattern but (! ! x) is not.

Express a rule with ANTLR4

I must define a rule which expresses the following statement: {x in y | x > 0}.
For the first part of that comprehension "x in y", i have the subrule:
FIRSTPART: Name "in" Name
, whereas Name can be everything.
My problem is that I do not want a greedy behaviour. So, it should parse until the "|" sign and then stop. Since I am new in ANTLR4, I do not know how to achieve that.
best regards,
Normally, the lexer/parser rules should represent the allowable syntax of the source input stream.
The evaluation (and consequences) of how the source matches any rule or subrule is a matter of semantics -- whether the input matches a particular subrule and whether that should control how the rule is finally evaluated.
Normally, semantics are implemented as part of the tree-walker analysis. You can use alternate subrule lables (#inExpr, etc) to create easily distinguishable tree nodes for analysis purposes:
comprehension : LBrace expression property? RBrace ;
expression : ....
| Name In Name #inExpr
| Name BinOp Name #binExpr
| ....
;
property : Provided expression ;
BinOp : GT | LT | GTE | .... ;
Provided : '|' ;
In : 'in' ;

How can I write a grammar that matches "x by y by z of a"?

I'm designing a low-punctuation language in which I want to support the declaration of arrays using the following syntax:
512 by 512 of 255 // a 512x512 array filled with 255
100 of 0 // a 100-element array filled with 0
expr1 by expr2 by expr3 ... by exprN of exprFill
These array declarations are just one kind of expression among many.
I'm having a hard time figuring out how to write the grammar rules. I've simplified my grammar down to the simplest thing that reproduces my trouble:
grammar Dimensions;
program
: expression EOF
;
expression
: expression (BY expression)* OF expression
| INT
;
BY : 'by';
OF : 'of';
INT : [0-9]+;
WHITESPACE : [ \t\n\r]+ -> skip;
When I feed in 10 of 1, I get the parse I want:
When I feed in 20 by 10 of 1, the middle expression non-terminal slurps up the 10 of 1, leaving nothing left to match the rule's OF expression:
And I get the following warning:
line 2:0 mismatched input '<EOF>' expecting 'of'
The parse I'd like to see is
(program (expression (expression 20) by (expression 10) of (expression 1)) <EOF>)
Is there a way I can reformulate my grammar to achieve this? I feel that what I need is right-association across both BY and OF, but I don't know how to express this across two operators.
After some non-intellectual experimentation, I came up with some productions that seem to generate my desired parse:
expression
:<assoc=right> expression (BY expression)+ OF expression
|<assoc=right> expression OF expression
| INT
;
I don't know if there's a way I can express it with just one production.

Bison/Flex, reduce/reduce, identifier in different production

I am doing a parser in bison/flex.
This is part of my code:
I want to implement the assignment production, so the identifier can be both boolean_expr or expr, its type will be checked by a symbol table.
So it allows something like:
int a = 1;
boolean b = true;
if(b) ...
However, it is reduce/reduce if I include identifier in both term and boolean_expr, any solution to solve this problem?
Essentially, what you are trying to do is to inject semantic rules (type information) into your syntax. That's possible, but it is not easy. More importantly, it's rarely a good idea. It's almost always best if syntax and semantics are well delineated.
All the same, as presented your grammar is unambiguous and LALR(1). However, the latter feature is fragile, and you will have difficulty maintaining it as you complete the grammar.
For example, you don't include your assignment syntax in your question, but it would
assignment: identifier '=' expr
| identifier '=' boolean_expr
;
Unlike the rest of the part of the grammar shown, that production is ambiguous, because:
x = y
without knowing anything about y, y could be reduced to either term or boolean_expr.
A possibly more interesting example is the addition of parentheses to the grammar. The obvious way of doing that would be to add two productions:
term: '(' expr ')'
boolean_expr: '(' boolean_expr ')'
The resulting grammar is not ambiguous, but it is no longer LALR(1). Consider the two following declarations:
boolean x = (y) < 7
boolean x = (y)
In the first one, y must be an int so that (y) can be reduced to a term; in the second one y must be boolean so that (y) can be reduced to a boolean_expr. There is no ambiguity; once the < is seen (or not), it is entirely clear which reduction to choose. But < is not the lookahead token, and in fact it could be arbitrarily distant from y:
boolean x = ((((((((((((((((((((((y...
So the resulting unambiguous grammar is not LALR(k) for any k.
One way you could solve the problem would be to inject the type information at the lexical level, by giving the scanner access to the symbol table. Then the scanner could look a scanned identifier token in the symbol table and use the information in the symbol table to decide between one of three token types (or more, if you have more datatypes): undefined_variable, integer_variable, and boolean_variable. Then you would have, for example:
declaration: "int" undefined_variable '=' expr
| "boolean" undefined_variable '=' boolean_expr
;
term: integer_variable
| ...
;
boolean_expr: boolean_variable
| ...
;
That will work but it should be obvious that this is not scalable: every time you add a type, you'll have to extend both the grammar and the lexical description, because the now the semantics is not only mixed up with the syntax, it has even gotten intermingled with the lexical analysis. Once you let semantics out of its box, it tends to contaminate everything.
There are languages for which this really is the most convenient solution: C parsing, for example, is much easier if typedef names and identifier names are distinguished so that you can tell whether (t)*x is a cast or a multiplication. (But it doesn't work so easily for C++, which has much more complicated name lookup rules, and also much more need for semantic analysis in order to find the correct parse.)
But, honestly, I'd suggest that you do not use C -- and much less C++ -- as a model of how to design a language. Languages which are hard for compilers to parse are also hard for human beings to parse. The "most vexing parse" continues to be a regular source of pain for C++ newcomers, and even sometimes trips up relatively experienced programmers:
class X {
public:
X(int n = 0) : data_is_available_(n) {}
operator bool() const { return data_is_available_; }
// ...
private:
bool data_is_available_;
// ...
};
X my_x_object();
// ...
if (!x) {
// This code is unreachable. Can you see why?
}
In short, you're best off with a language which can be parsed into an AST without any semantic information at all. Once the parser has produced the AST, you can do semantic analyses in separate passes, one of which will check type constraints. That's far and away the cleanest solution. Without explicit typing, the grammar is slightly simplified, because an expr now can be any expr:
expr: conjunction | expr "or" conjunction ;
conjunction: comparison | conjunction "and" comparison ;
comparison: product | product '<' product ;
product: factor | product '*' factor ;
factor: term | factor '+' term ;
term: identifier
| constant
| '(' expr ')'
;
Each action in the above would simply create a new AST node and set $$ to the new node. At the end of the parse, the AST is walked to verify that all exprs have the correct type.
If that seems like overkill for your project, you can do the semantic checks in the reduction actions, effectively intermingling the AST walk with the parse. That might seem convenient for immediate evaluation, but it also requires including explicit type information in the parser's semantic type, which adds unnecessary overhead (and, as mentioned, the inelegance of letting semantics interfere with the parser.) In that case, every action would look something like this:
expr : expr '+' expr { CheckArithmeticCompatibility($1, $3);
$$ = NewArithmeticNode('+', $1, $3);
}

Resources