Mixing two languages

Mixing two languages - antlr4

I am writing a grammar for a small meta language. That language should include code blocks of another language (e.g., JavaScript, C, or the like). I would like to treat these code blocks just a plain strings that are print out unchanged. My language is C/Java syntax based using { } for code blocks. But I would also like to use { } for the code blocks of the embedded language. Here some example code:
// my language
modul Abc {
input x: string;
otherLang {
// this is now a code block from the second
// language, which I do not want to analyze
// It might itself contain { } like
if (something) {
abc = "string";
}
}
}
How would I resuse { and } for those different uses without mixing them up with the ones from an embedded language?

An interesting way to do this is to use mode recursion. ANTLR internally maintains a mode stack.
Although a bit verbose, the recursed mode offers the possibility of handling things -- like comments and escaped chars -- that could otherwise throw off the nesting.
One thing to be aware of is that rules with more attributes concatenate their matched content into the token produced by the first following non-moreed rule. The following example uses the virtual token OTHER_END to provide semantic clarity and preclude confusion with otherwise being a RPAREN token.
tokens {
OTHER_END
}
otherLang : OTHER_BEG OTHER_END+ ; // multiple 'end's dependent on nesting
OTHER_BEG : 'otherLang' LPAREN -> pushMode(Other) ;
LPAREN : LParen ;
RPAREN : RParen ;
WS : [ \t\r\n] -> skip;
mode Other ;
// handle special cases here
O_RPAREN : RParen -> type(OTHER_END), popMode() ;
O_LPAREN : LParen -> more, pushMode(Other) ;
O_STUFF : . -> more ;
fragment LParen : '{' ;
fragment RParen : '}' ;

Related

Is there a parser equivalent of 'fragment' marking in ANTLR4?

Is there a way to tell ANTLR4 to inline the parser rule?
It seems reasonable to have such feature. After reading the book on ANTLR ("The Definitive ANTLR 4 Reference") I haven't found such possibility, but changes might've been introduced in the 4 years
since the book was released, so I guess it is better to ask here.
Consider the following piece of grammar:
file: ( item | class_decl )*;
class_decl: 'class' class_name '{' type_decl* data_decl* code_decl* '}';
type_decl: 'typedef' ('bool'|'int'|'real') type_name;
const_decl: 'const' type_name const_name;
var_decl: 'var' type_name var_name;
...
fragment item: type_decl | data_decl | code_decl;
fragment data_decl: const_decl | var_decl;
fragment code_decl: function_decl | procedure_decl;
fragment class_name: ID;
fragment type_name: ID;
fragment const_name: ID;
fragment var_name: ID;
The rules marked as fragment are there for clarity/documentation and reusability, however from syntax point of view it is f.e. really a var_decl that is actual direct element of file or class_decl and I'd like to have it reflected in content of contexts created by the parser. All the intermediate contexts created for item, data_decl etc. are superfluous, needlessly take space and make it so visitor is bound to organizational structure of the grammar instead of its actual meaning.
To sum up - I'd expect ANTLR to turn the above grammar into the following before generation of a parser:
file: ( type_decl | const_decl | var_decl | function_decl | procedure_decl | class_decl )*;
class_decl: 'class' ID '{' type_decl* ( const_decl | var_decl )* ( function_decl | procedure_decl )* '}';
type_decl: 'typedef' ('bool'|'int'|'real') ID;
const_decl: 'const' ID ID;
var_decl: 'var' ID ID;
...

No, there is no such thing in parser rules. You could raise an issue/RFE in ANTLRs Github repo for such a thing: https://github.com/antlr/antlr4/issues

You can use rule element labels. They provide the similar functionality but more restricted (applicatble for only single token or rule):
file: ( item | class_decl )*;
class_decl: 'class' class_name=ID '{' type_decl* data_decl* code_decl* '}';
type_decl: 'typedef' ('bool'|'int'|'real') type_name=ID;
const_decl: 'const' type_name=ID const_name=ID;
var_decl: 'var' type_name=ID var_name=ID;
...
item: type_decl | data_decl | code_decl;
data_decl: const_decl | var_decl;
code_decl: function_decl | procedure_decl;

How to define a token which is all those characters in set A, except those in sub-set B?

In RFC2616 (HTTP/1.1) the definition of a 'token' in section '2.2 Basic Rules' is given as:
token = 1*<any CHAR except CTLs or separators>
From that section, I've got the following fragments, and now I want to define 'TOKEN':
lexer grammar AcceptEncoding;
TOKEN: /* (CHAR excluding (CTRL | SEPARATORS)) */
fragment CHAR: [\u0000-\u007f];
fragment CTRL: [\u0000-\u001f] | \u007f;
fragment SEPARATORS: [()<>#,;:\\;"/\[\]?={|}] | SP | HT;
fragment SP: ' ';
fragment HT: '\t';
How do I approximate my hypothetical 'excluding' operator for the definition of TOKEN?

There is no set/range math in ANTLR. You can only combine several sets/ranges via the OR operator. A typical rule for a number of disjoint ranges looks like:
fragment LETTER_WHEN_UNQUOTED:
'0'..'9'
| 'A'..'Z'
| '$'
| '_'
| '\u0080'..'\uffff'
;

One approach is to 'do the math' on set of characters, so that we can define lexical rules which only ever combine characters:
lexer grammar RFC2616;
TOKEN: (DIGIT | UPALPHA | LOALPHA | NON_SEPARATORS)+
/*
* split up ASCII 0-127 into 'atoms' of
* relevance per '2.2 Basic Rules'. Regions
* not requiring to be referenced are not
* given a name.
*/
// [\u0000-\u0008]; /* (control chars) */
fragment HT: '\u0009'; /* (tab) */
fragment LF: '\u0010'; /* (LF) */
// [\u0011-\u0012]; /* (control chars) */
fragment CR: '\u0013'; /* (CR)
// [\u0014-\u001f]; /* (control chars) */
fragment SP: '\u0020'; /* (space) */
// [\u0021-\u02f]; /* !"#$%'()*+,-./ */
fragment DIGIT: [u0030-\u0039]; /* 01234556789 */
// [\u003a-\u0040]; /* :;<=># */
fragment UPALPHA: [\u0041-\u005a]; /* ABCDEFGHIGJLMNOPQRSTUVWXYZ */
// [\u005b-\u0060]; /* [\]^_` */
fragment LOALPHA: [\u0061-\u0071]; /* abcdefghijklmnopqrstuvwxyz */
// [\u007b-\u007e]; /* {|}~ */
// '\u007f'; /* (del) */
/*
* Considering 'all relevant gaps' and the characters we
* cannot use per RFC 2616 Section 2.2 Basic Rules definition
* of 'separators', what does that leave us with?
* (manually determined)
*/
fragment SEPARATORS: [()<>#,;:\\;"/\[\]?={|}];
fragment NON_SEPARATORS: [!#$%&'*+-.^_`~*];
I don't find this approach especially satisfying. Another rule in RFC 2616 wants to be defined like:
TEXT: <any OCTET except CTLs, but including LWS>
qdtext = <any TEXT except <">>
This would force me to further refactor up my expedient 'SEPARATORS' token, above, like:
fragment QUOT: '"';
fragment SEPARATORS_OTHER_THAN_QUOT: [()<>#,;:\\;/\[\]?={|}];
fragment SEPARATORS: SEPARATORS_OTHER_THAN_QUOT | QUOT;
fragment LWS: SP | HT;
TEXT: DIGIT | UPALPHA | LOALPHA | LWS | SEPARATORS | NON_SEPARATORS;
QDTEXT: DIGIT | UPALPHA | LOALPHA | LWS | SEPARATORS_OTHER_THAN_QUOT | NON_SEPARATORS;
Perhaps this is part of the work of writing a lexer, and can't be avoided, but it feels more like solving the problem the wrong way!
(NB: I won't be marking this answer as 'correct'.)

Spurred on by the answer from #mike-lischke (because LETTER_WHEN_UNQUOTED really felt wrong still), I hunted for the surely-common treatment of quoted string literals in other grammars. In Terrence Parr's own Java 1.6 ANTLR3 grammar (er, not properly served as text/plain) (via ANTLR3 Grammar List), he reaches for a 'match any character other than' tilde-operator ~ in a lexer rule:
STRINGLITERAL
: '"'
( EscapeSequence
| ~( '\\' | '"' | '\r' | '\n' )
)*
'"'
;
// Copyright (c) 2007-2008 Terence Parr and possibly Yang Jiang.
NOTE: the above code is licenced under a BSD licence, but I am not re-distributing this fragment under the BSD license (since this post itself is under CC-BY-SA). Instead, I am using it within the terms of 'fair use' as I understand them.
So the ~ gives me an option to express: 'all those characters in Unicode, except those in Set B'. "Annoying I don't get to choose the set which is excluded from", I thought. But then I realised
TOOHIGH: [\u007f-\uffff];
TOKEN: (~( TOOHIGH | SP | HT | CTRL | SEPARATORS ))+
... should be fine. Although, in practice, ANTLR4 doesn't 'like' lexer sub-rules appearing in 'sets', and only handles sets of literals, so that ultimately becomes:
TOKEN:
/* this is given in '2.2 Basic Rules' as:
*
* token = 1*<any CHAR except CTLs or separators>
*
* which I am reducing down to:
* any character in ASCII 0-127 but _excluding_
* CTRL (0-31,127)
* SEPARATORS
* space (32)
* and tab (9) (which is a CTRL character anyhow)
*/
( ~( [\u0000-\u001f] | '\u007f' /*CTRL,HT*/ | [()<>#,;:\\;"/\[\]?={|}] /*SEPARATORS*/ | '\u0020' /*SP*/ | [\u0080-\uffff] /*NON_ASCII*/))*
;
The trick was expressing including the set I do want (Unicode 0-127) in terms of excluding the set I don't want (Unicode 128+).
This is much more succinct than my other answer. If it actually works, I'll mark it as correct.

how to define a var in C++

This code compiles perfect:
if ( args.Length() > 0 ) {
if ( args[0]->IsString() ) {
String::Utf8Value szQMN( args[0]->ToString() ) ;
printf( "(cc)>>>> qmn is [%s].\n", (const char*)(* szQMN) ) ;
} ;
} ;
But this one does not :
if ( args.Length() > 0 ) {
if ( args[0]->IsString() ) {
String::Utf8Value szQMN( args[0]->ToString() ) ; // <<<< (A)
} ;
} ;
printf( "(cc)>>>> qmn is [%s].\n", (const char*)(* szQMN) ) ; // <<<< (B)
Error says : "error C2065: 'szQMN' : undeclared identifier" on line (B)
This means to me that the sentence marked (A) is a definition at the same time as an assignement, right ?
And compiler decides it is "conditionally" defined as it is within two "IF's" ?
My question is : how to move the declaration out of the two "IF's" ?
In this way I also can give it a defalut value ... in case a IF fails.
If I write this line out of the two "IF's"
String::Utf8Value szQMN ("") ;
... then I get the error :
cannot convert argument 1 from 'const char [1]' to 'v8::Handle<v8::Value>'
Any ideas?

This means to me that the sentence marked (A) is a definition at the same time as an assignement, right?
Technically it is a constructor call that creates a variable and initializes it.
Also note that automatic variables exist only until the end of the scope (usually a block inside {} brackets). That is why your second code example does not compile.
if (condition)
{
int x = 5;
}
x = 6; // error, x does not exist anymore
My question is : how to move the declaration out of the two "IF's"?
String::Utf8Value szQMN ("");
This is a constructor call of the class String::Utf8Value class. From the error message it takes a parameter of type v8::Handle<v8::Value>. Without knowing what this is I cannot give you an answer how to call it. You wanted to pass "" which is of type const char* or const char[1] and the compiler is telling you that it does not take that parameter.
Edit:
From the link that DeepBlackDwarf provided in the comment, this is how you create a Utf8Value from a string:
std::string something("hello world");
Handle<Value> something_else = String::New( something.c_str() );
So in your case you would do:
String::Utf8Value szQMN (String::New(""));

the definition in the IF's loop is only works in this loop.
So in the (B) sentence,the definition has already been expired.
If you need to use the var both in the IF's loop and outside,you can declare a global variability by this sentence :
extern String::Utf8Value szQMN( args[0]->ToString() ) ;

Antlr4 left-recursive rule appears to produce right-associative parse

The following grammar illustrates the issue:
// test Antlr4 left recursion associativity
grammar LRA;
#parser::members {
public static void main(String[] ignored) throws Exception{
final LRALexer lexer = new LRALexer(new ANTLRInputStream(System.in));
final LRAParser parser = new LRAParser(new CommonTokenStream(lexer));
parser.setTrace(true);
parser.file();
}
}
ID: [A-Za-z_] ([A-Za-z_]|[0-9])*;
CMA: ',';
SMC: ';';
UNK: . -> skip;
file: punctuated EOF;
punctuated
: punctuated cma punctuated
| punctuated smc punctuated
| expression
;
cma: CMA;
smc: SMC;
expression: id;
id: ID;
Given input "a,b,c" i get listener event trace output
( 'a' ) ( ',' ( 'b' ) ( ',' ( 'c' ) ) )
where ( represents enter punctuated, ) represents exit punctuated, and all other rules are omitted for brevity and clarity.
By inspection, this order of listener events represents a right-associative parse.
Common practice, and The Definitive Antlr 4 Reference, lead me to expect a left-associative parse, corresponding to the following listener event trace
( 'a' ) ( ',' ( 'b' ) ) ( ',' ( 'c' ) )
Is there something wrong with my grammar, my expectations, my interpretation of the listener events, or something else?

I would consider the workaround described above to be an adequate answer. The generated parser needs to pass a precedence parameter to a recursive call, and since the precedence is associated with a token, the token has to be directly available in the recursive rule so Antlr can find its precedence.
The working grammar looks like this:
// test Antlr4 left recursion associativity
grammar LRA;
#parser::members {
public static void main(String[] ignored) throws Exception{
final LRALexer lexer = new LRALexer(new ANTLRInputStream(System.in));
final LRAParser parser = new LRAParser(new CommonTokenStream(lexer));
parser.setTrace(true);
parser.file();
}
}
ID: [A-Za-z_] ([A-Za-z_]|[0-9])*;
CMA: ',';
SMC: ';';
UNK: . -> skip;
file: punctuated EOF;
punctuated
: punctuated CMA punctuated
| punctuated SMC punctuated
| expression
;
expression: id;
id: ID;

can Flex return a string match to bison

I'm writing a Bison/Flex program to convert LaTeX into MathML. At the moment, dealing with functions (i.e. \sqrt, \frac, etc) works like this, with a token for every function
\\frac {return FUNC_FRAC;}
and passes the token FUNC_FRAC back to bison, which plays its part in the description of this subtree:
function: FUNC_FRAC LBRACE atom RBRACE LBRACE atom RBRACE {$$ = "<mfrac>" + $3 + $6 + "</mfrac>";}
But this means that I need to define and juggle a potentially unlimited number of tokens. What I would like to do is something like this, which doesn't work as written. In flex:
\\[A-Za-z]+[0-9]* {return the-matched-string;}
and in bison:
function: "\frac" LBRACE atom RBRACE LBRACE atom RBRACE {$$ = "<mfrac>" + $3 + $6 + "</mfrac>";}

Flex should return the abstract token value to Bison.
You can find the lexeme (the string matched) in Flex in the value:
yytext
And so you can do:
{id} { yylval->strval=strdup(yytext); return(TOK_ID); }
And so forth. The yylval struct relates IIRC to the bison union/whatever you are using to evaluate past the token-type .. so I might have in Bison
%union {
char *strval;
int intval;
node node_val;
}
Returning anything other than a token-type will break the automaton in Bison. Your Bison actions can access such as:
id_production: TOK_ID
{
$<node_val>$ = create_id_node(yylval.strval);
xfree(yylval.strval); // func makes a copy, so we are cool.
}
And so on. Any more explanation than this and I will probably start repeating documentation. Things to consult:
Dragon Book (as always)
Modern Compiler Implementation in C (great for getting started)
Bison docs
Flex docs
Good Luck

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Mixing two languages - antlr4

Related

Is there a parser equivalent of 'fragment' marking in ANTLR4?

How to define a token which is all those characters in set A, except those in sub-set B?

how to define a var in C++

Antlr4 left-recursive rule appears to produce right-associative parse

can Flex return a string match to bison

Categories

Resources