How to wrap composite PsiElement on top of Lexer recognized elements in Grammar-Kit - dsl

I am new to intellij plugin writing. I started writing a intellij plugin for one of our custom language. I am following the tutorial given in intellij official site. Also I downloaded their Grammar-Kit repo from github to understand the code base.
While writing my plugin I am using jflex for Lexer and bnf grammar for parser. I find difficulty to implement a lexer to send proper tokens to the parser.
What I see in the Grammar-Kit repo that Lexer (for bnf) is pretty simple and it is only recognizing String, Number, Id, white space, comments and special characters('(', '*' etc.). It does not recognize bnf keywords like 'private', 'external', 'meta' etc.
Now when I am seeing the PSI tree of a example bnf file say for the line,
private myRule ::= '(' myExpression ')' ';'
is as below,
BnfFile:Dummy.bnf(0,43)
BNF_RULE:myRule(0,43)
BNF_MODIFIER(0,7)
PsiElement(id)('private')(0,7)
PsiWhiteSpace(' ')(7,8)
PsiElement(id)('myRule')(8,14)
PsiWhiteSpace(' ')(14,15)
PsiElement(::=)('::=')(15,18)
PsiWhiteSpace(' ')(18,19)
BNF_SEQUENCE: '(' myExpression ')' ';'(19,43)
BNF_STRING_LITERAL_EXPRESSION: '('(19,22)
PsiElement(string)(''('')(19,22)
PsiWhiteSpace(' ')(22,23)
BNF_REFERENCE_OR_TOKEN: myExpression(23,35)
PsiElement(id)('myExpression')(23,35)
PsiWhiteSpace(' ')(35,36)
BNF_STRING_LITERAL_EXPRESSION: ')'(36,39)
PsiElement(string)('')'')(36,39)
PsiWhiteSpace(' ')(39,40)
BNF_STRING_LITERAL_EXPRESSION: ';'(40,43)
PsiElement(string)('';'')(40,43)
What i am seeing here is 'private' is recognized as PsiElement.id by the Lexer, after that some code wrap it with a BnfModifier object which is declared the bnf file. The same for BNF_SEQUENCE, BNF_REFERENCE_OR_TOKEN, BNF_STRING_LITERAL_EXPRESSION etc. They are all not recognized by Lexer, however some code wrap them with bnf file declared objects.
I want to understand how this wrapping is done on top of Lexer recognized tokens.
It will help me to recognize keyword in our DSL and highlight them differently, auto-complete them etc.
Thanks,
Subhojit

Related

Simple Xtext example generates grammar that Antlr4 doesn't like - who's to blame?

While using XText, I have come across a problem and I am not sure if Antlr4 or XText is at fault or if I'm just missing something. I understand that Antlr4 is not supported by Xtext, but it seems like this particular case should not cause a problem.
Here is a simple Xtext file:
grammar com.github.jsculley.antlr4.Test with org.eclipse.xtext.common.Terminals
generate test "http://www.github.com/jsculley/antlr4/test"
aRule:
name=STRING
;
STRING is defined in the XText rule from org.eclipse.xtext.common.Terminals:
terminal STRING :
'"' ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|'"') )* '"' |
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'"
;
The generated Antlr grammar has the following rule:
RULE_STRING : ('"' ('\\' .|~(('\\'|'"')))* '"'|'\'' ('\\' .|~(('\\'|'\'')))* '\'');
The Antlr 3.5.2 tool has no problem with this rule, but the Antlr4 tool spits out the following errors:
error(50): InternalTest.g:102:29: syntax error: '(' came as a complete surprise to me while looking for lexer rule element
error(50): InternalTest.g:102:62: syntax error: '(' came as a complete surprise to me while looking for lexer rule element
error(50): InternalTest.g:102:74: syntax error: mismatched input ')' expecting SEMI while matching a lexer rule
error(50): InternalTest.g:106:25: syntax error: '(' came as a complete surprise to me while looking for lexer rule element
error(50): InternalTest.g:106:36: syntax error: mismatched input ')' expecting SEMI while matching a lexer rule
Antlr4 doesn't like the extra (and seemingly uneccessary) sets of parentheses around the group after each '~' operator. So the question is, is Xtext generating a bad grammar, or is Antlr4 not handling a valid construct?
Xtext generates an Antlr 3.x grammar and Antlr 4 grammars are incompatible.
It seems that ANTLR 4 does not handle parenthesis correctly: Parser issues mutual left recursion error when the left-recursive part of a rule is in parenthesis.
So, just remove useless parenthesis and ANTLR 4 should generate a fully ANLTR 3 compatible parser. I ported PL/SQL grammar from ANTLR 3 -> ANTLR 4. Moreover, ANLTR 4 have a more powerfull parsing algorithm compare to the previous version.

antlr4 token recognition error at: '$'

Trying to build a grammar for PowerScript language. I split the language in several parts and everything seems to be working except for the simple headers. It seems that the $ simbol can't be recognized. Could anyone help me a little? ( I just copy the small example I'm trying)
grammar PowerScript;
compilationUnit : Header EOF;
fragment
Header : ID '.' ID;
ID : [a-zA-Z0-9$_]+ ;
test file just contains:
$PBExportHeader$n_logversion.sru
Thanks
The compilationUnit rule is a parser rule. Parser rules cannot refer to lexer fragments. Just remove the fragment qualifier to make Header a proper lexer rule.
Update
Antlr4 is fully Unicode capable. Just include the characters in standard Unicode encoding form:
ID : ( [a-zA-Z0-9$_] | '\uD83D\uDCB2' )+ ; // Unicode heavy Dollar sign

Solving ambiguous input: mismatched input

I have this grammar:
grammar MkSh;
script
: (statement
| targetRule
)*
;
statement
: assignment
;
assignment
: ID '=' STRING
;
targetRule
: TARGET ':' TARGET*
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS
: ( ' '
| '\t'
| '\r'
| '\n'
) -> channel(HIDDEN)
;
STRING
: '\"' CHR* '\"'
;
fragment
CHR
: ('a'..'z'|'A'..'Z'|' ')
;
TARGET
: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'/'|'.')+
;
and this input file:
hello="world"
target: CLASSES
When running my parser I'm getting this error:
line 3:6 mismatched input ':' expecting '='
line 3:15 mismatched input ';' expecting '='
Which is because of the parser is taking "target" as an ID instead of a TARGET. I want the parser to choose the rule based on the separator character (':' vs '=').
How can I get that to happen?
(This is my first Antlr project so I'm open to anything.)
First, you need to know that the word target is matched as a ID token and not as a TARGET token, and since you have written the rule ID before TARGET, it will always be recognized as ID by the lexer. Notice that the word target completely complies to both ID and TARGET lexer rule, (I'm going to suppose that you are writing a laguage), meaning that the target which is a keyword can also be used as an id. In the book - "The definitive ANTLR reference" there is a subtitle "Treating Keywords As Identifiers" that deals with exactely these kinds of issues. I suggest you take a look at that. Or if you prefer the quick answer the solution is to use lexer modes. Also would be better to split grammar into parser and lexer grammar.
As #cantSleepNow alludes to, you've defined a token (TARGET) that is a lexical superset of another token (ID), and then told the lexer to only tokenize a string as TARGET if it cannot be tokenized as ID. All made more obscure by the fact that ANTLR lexing rules look like ANTLR parsing rules, though they are really quite different beasts.
(Warning: writing off the top of my head without testing :-)
Your real project might be more complex, but in the possibly simplified example you posted, you could defer distinguishing the two to the parsing phase, instead of distinguishing them in the lexer:
id : TARGET
{ complain if not legal identifier (e.g., contains slashes, etc.) }
;
assignment
: id '=' STRING
;
Seems like that would solve the lexing issue, and allow you to give a more intelligent error message than "syntax error" when a user gets the syntax for ID wrong. The grammar remains ambiguous, but maybe ANTLR roulette will happen to make the choice you prefer in the ambiguous case. Of course, unambiguous grammers tend to make for languages that humans find more readable, and now you can see why the classic makefile syntax requires a newline after an assignment or target rule.

why whitespace not allowed between two keywords/constants written in xtext file for DSL

White space between if and ( is not allowed. For example, this works IF( but IF ( causes a parser error.
The Rule is:
Condition returns ResultExpression:
'IF' '(' condition=BooleanExpression ')' '{' then=ResultExpressionRhs '}'
(=> 'ELSE' '{' else=ResultExpression '}')?;
It's hard to tell what's going on from just this minimal grammar snippet.
Please check your xtext file for the following things:
A proper hidden clause that includes the WS
A keyword 'IF(' that may have been introduced by accident
Warnings when executing the workflow.

ANTLR4. How to create properly unicode range lexer rules?

In my grammar I'd like variables to be comprised of latin, cyrillic and mandarin characters.
For this purposes I define lexer rule, like this:
CYRILLIC_RANGE: [\u0400–\u04FF];
this is what I see in my ANTLRWorks 2.1 output when I try to run expression against my query:
line 1:4 token recognition error at: 'н'
What am I missing?
I'm not sure what you are missing, as this seems to be working for me here. Have you tried the other range syntax? Both of these should be equivalent.
CYRILLIC_RANGE : [\u0400-\u04FF] ;
CYRILLIC_RANGE : '\u0400'..'\u04FF' ;

Resources