ANTLR4 g4 grammar to read key/value pair in different blocks

ANTLR4 g4 grammar to read key/value pair in different blocks - antlr4

I'm new to antlr, i'm trying to make simple grammar but i can't succeeded.
I would like to parse this kind of file:
BEGIN HEADER
CharacterSet "CP1252"
END HEADER
BEGIN DSJOB
test "val"
END DSJOB
BEGIN DSJOB
test "val2"
END DS
JOB
I'm using this kind of grammar :
grammar Hello;
dsxFile : headerDeclaration? jobDeclaration* EOF;
headerDeclaration : 'BEGIN HEADER' param* 'END HEADER';
jobDeclaration : 'BEGIN DSJOB' subJobDeclaration* param* 'END DSJOB';
subJobDeclaration : 'BEGIN DSSUBJOB' param* 'END DSSUBJOB';
headParam
: ( 'CharacterSet'
| 'name'
) StringLiteral
;
// ANNOTATIONS
param : PNAME PVALUE;
PNAME :StringCharacters;
PVALUE :StringCharacters;
// STATEMENTS / BLOCKS
//block
// : '{' blockStatement* '}';
// LEXER
// Keywords
ABSTRACT : 'abstract';
ASSERT : 'assert';
BOOLEAN : 'boolean';
BREAK : 'break';
BYTE : 'byte';
CASE : 'case';
CATCH : 'catch';
CHAR : 'char';
CLASS : 'class';
CONST : 'const';
CONTINUE : 'continue';
DEFAULT : 'default';
DO : 'do';
DOUBLE : 'double';
ELSE : 'else';
ENUM : 'enum';
EXTENDS : 'extends';
FINAL : 'final';
FINALLY : 'finally';
FLOAT : 'float';
FOR : 'for';
IF : 'if';
GOTO : 'goto';
IMPLEMENTS : 'implements';
IMPORT : 'import';
INSTANCEOF : 'instanceof';
INT : 'int';
INTERFACE : 'interface';
LONG : 'long';
NATIVE : 'native';
NEW : 'new';
PACKAGE : 'package';
PRIVATE : 'private';
PROTECTED : 'protected';
PUBLIC : 'public';
RETURN : 'return';
SHORT : 'short';
STATIC : 'static';
STRICTFP : 'strictfp';
SUPER : 'super';
SWITCH : 'switch';
SYNCHRONIZED : 'synchronized';
THIS : 'this';
THROW : 'throw';
THROWS : 'throws';
TRANSIENT : 'transient';
TRY : 'try';
VOID : 'void';
VOLATILE : 'volatile';
WHILE : 'while';
// Boolean Literals
BooleanLiteral : 'true' | 'false';
// Character Literals
fragment
SingleCharacter
: ~['\\] ;
// String Literals
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
;
// Separators
LPAREN : '(';
RPAREN : ')';
LBRACE : '{';
RBRACE : '}';
LBRACK : '[';
RBRACK : ']';
SEMI : ';';
COMMA : ',';
DOT : '.';
// Operators
ASSIGN : '=';
GT : '>';
LT : '<';
BANG : '!';
TILDE : '~';
QUESTION : '?';
COLON : ':';
EQUAL : '==';
LE : '<=';
GE : '>=';
NOTEQUAL : '!=';
AND : '&&';
OR : '||';
INC : '++';
DEC : '--';
ADD : '+';
SUB : '-';
MUL : '*';
DIV : '/';
BITAND : '&';
BITOR : '|';
CARET : '^';
MOD : '%';
ADD_ASSIGN : '+=';
SUB_ASSIGN : '-=';
MUL_ASSIGN : '*=';
DIV_ASSIGN : '/=';
AND_ASSIGN : '&=';
OR_ASSIGN : '|=';
XOR_ASSIGN : '^=';
MOD_ASSIGN : '%=';
LSHIFT_ASSIGN : '<<=';
RSHIFT_ASSIGN : '>>=';
URSHIFT_ASSIGN : '>>>=';
//
// Additional symbols not defined in the lexical specification
//
AT : '#';
ELLIPSIS : '...';
//
// Whitespace and comments
//
WS : [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
But i'm still getting this issue :
line 1:0 mismatched input 'BEGIN HEADER\r\n\tCharacterSet ' expecting
{, 'BEGIN HEADER', 'BEGIN DSJOB'} (dsxFile BEGIN
HEADER\r\n\tCharacterSet "CP1252" \r\nEND HEADER\r\nBEGIN
DSJOB\r\n\ttest "val" \r\nEND DSJOB)
Can someone explain me what does it means ? It seems it can't skip \r\t.
Thanks for your help guys !

The problem is that your input is not tokenised as you expect. This is because the lexer matches as much input as possible. So if you look at the PNAME rule:
PNAME : StringCharacters;
fragment StringCharacter
: ~["\\]
;
then you will notice that the input "BEGIN HEADER\n CharacterSet " matches that rule.
This is what the error message:
mismatched input 'BEGIN HEADER\r\n\tCharacterSet ' expecting {, 'BEGIN HEADER', 'BEGIN DSJOB'}
is telling: the token 'BEGIN HEADER\r\n\tCharacterSet ' is found, while the parser expects one of the tokens 'BEGIN HEADER' or 'BEGIN DSJOB'.
You will probably need to add spaces, tabs and line breaks to that class: ~["\\ \t\r\n] (but that is for you to decide)
Also, the lexer operates independently from the parser (the parser has no influence on what tokens are produced). The lexer simply tries to match as much characters as possible, and whenever there are two (or more) rules that match the same characters, the rule defined first "wins". Given this logic, then from the following rules:
PNAME : StringCharacters;
PVALUE : StringCharacters;
it is apparent that the rule PVALUE will never be matched (only PNAME, since that one is defined first).
Here's how you could parse your example input:
grammar Hello;
dsxFile : headerDeclaration? jobDeclaration* EOF;
headerDeclaration : BEGIN HEADER param* END HEADER;
jobDeclaration : BEGIN DSJOB subJobDeclaration* param* END DSJOB;
subJobDeclaration : BEGIN DSSUBJOB param* END DSSUBJOB;
param : PNAME pvalue;
pvalue : STRING /* other alternaives here? */;
STRING : '"' ~["\r\n]* '"';
BEGIN : 'BEGIN';
END : 'END';
HEADER : 'HEADER';
DSJOB : 'DSJOB';
DSSUBJOB : 'DSSUBJOB';
WS : [ \t\r\n\u000C]+ -> skip;
COMMENT : '/*' .*? '*/' -> skip;
LINE_COMMENT : '//' ~[\r\n]* -> skip;
// Be sure to put this rule _after_ the rules BEGIN, END, HEADER, ...
// otherwise this rule will match those keywords instead
PNAME : ~["\\ \t\r\n]+;
Of course you'll need to change it to suit your needs exactly, but it's a start.

Related

ANTLR4 - named function arguments

My goal is to generate parser that could handle following code with named function parameters and nested function calls
fnCallY(namedArgStr = "xxx", namedArgZ=fnCallZ(namedArg="www"))
G4 lang file:
val : type_string
| function_call
;
function_call : function_name=ID arguments='('argument? (',' argument)* ')';
argument : name=ID '=' value=val ;
ID : [a-zA-Z_][a-zA-Z0-9_]*;
type_string : LITERAL;
fragment ESCAPED_QUOTE : '\\"';
LITERAL : '"' ( ESCAPED_QUOTE | ~('\n'|'\r') )*? '"'
| '\'' ( ESCAPED_QUOTE | ~('\n'|'\r') )*? '\'';
#Override
public void exitFunction_call(Test.Function_callContext ctx) {
List<Test.ArgumentContext> argument = ctx.argument();
for (Test.ArgumentContext arg : argument) {
Token name = arg.name;
Test.ValContext value = arg.value;
if (value.type_literal() == null || value.function_call() == null) {
throw new RuntimeException("Could not parse argument value");
}
}
}
arg.name holds correct data, but i cannot make the parser to parse the part after =.

The parser is recognizing the argument values.
(It's really valuable to learn the grun command line utility as it can test the grammar and tree structure without involving any of your own code)
This condition would appear to be your problem:
if (value.type_literal() == null || value.function_call() == null)
One or the other will always be null, so this will fail.
if (value.type_literal() == null && value.function_call() == null)
is probably what you want.

Xtext refering to element from different file does not work

Hello I am having two files in my xtext editor, the first one containing all definitions and the second one containing the executed recipe. The Grammar looks like this:
ServiceAutomationProgram:
('package' name=QualifiedName ';')?
imports+=ServiceAutomationImport*
definitions+=Definition*;
ServiceAutomationImport:
'import' importedNamespace=QualifiedNameWithWildcard ';';
Definition:
'define' ( TypeDefinition | ServiceDefinition |
SubRecipeDefinition | RecipeDefinition) ';';
TypeDefinition:
'quantity' name=ID ;
SubRecipeDefinition:
'subrecipe' name=ID '('( subRecipeParameters+=ServiceParameterDefinition (','
subRecipeParameters+=ServiceParameterDefinition)*)? ')' '{'
recipeSteps+=RecipeStep*
'}';
RecipeDefinition:
'recipe' name=ID '{' recipeSteps+=RecipeStep* '}';
RecipeStep:
(ServiceInvocation | SubRecipeInvocation) ';';
SubRecipeInvocation:
name=ID 'subrecipe' calledSubrecipe=[SubRecipeDefinition] '('( parameters+=ServiceInvocationParameter (',' parameters+=ServiceInvocationParameter)* )?')'
;
ServiceInvocation:
name=ID 'service' service=[ServiceDefinition]
'(' (parameters+=ServiceInvocationParameter (',' parameters+=ServiceInvocationParameter)*)? ')'
;
ServiceInvocationParameter:
ServiceEngineeringQuantityParameter | SubRecipeParameter
;
ServiceEngineeringQuantityParameter:
parameterName=[ServiceParameterDefinition] value=Amount;
ServiceDefinition:
'service' name=ID ('inputs' serviceInputs+=ServiceParameterDefinition (','
serviceInputs+=ServiceParameterDefinition)*)?;
ServiceParameterDefinition:
name=ID ':' (parameterType=[TypeDefinition]);
;
SubRecipeParameter:
parameterName=[ServiceParameterDefinition]
;
QualifiedNameWithWildcard:
QualifiedName '.*'?;
QualifiedName:
ID ('.' ID)*;
Amount:
INT ;
....
definitionfile file.mydsl:
define quantity Temperature;
define service Heater inputs SetTemperature:Temperature;
define subrecipe sub_recursive() {
Heating1 service Heater(SetTemperature 10);
};
....
recipefile secondsfile.mydsl:
define recipe Main {
sub1 subrecipe sub_recursive();
};
.....
In my generator file which looks like this:
override void doGenerate(Resource resource, IFileSystemAccess2 fsa, IGeneratorContext context) {
for (e : resource.allContents. toIterable.filter (RecipeDefinition)){
e.class;//just for demonstration add breakpoint here and //traverse down the tree
}
}
I need as an example the information RecipeDefinition.recipesteps.subrecipeinvocation.calledsubrecipe.recipesteps.serviceinvocation.service.name which is not accessible (null) So some of the very deep buried information gets lost (maybe due to lazylinking?).
To make the project executable also add to the scopeprovider:
public IScope getScope(EObject context, EReference reference) {
if (context instanceof ServiceInvocationParameter
&& reference == MyDslPackage.Literals.SERVICE_INVOCATION_PARAMETER__PARAMETER_NAME) {
ServiceInvocationParameter invocationParameter = (ServiceInvocationParameter) context;
List<ServiceParameterDefinition> candidates = new ArrayList<>();
if(invocationParameter.eContainer() instanceof ServiceInvocation) {
ServiceInvocation serviceCall = (ServiceInvocation) invocationParameter.eContainer();
ServiceDefinition calledService = serviceCall.getService();
candidates.addAll(calledService.getServiceInputs());
if(serviceCall.eContainer() instanceof SubRecipeDefinition) {
SubRecipeDefinition subRecipeCall=(SubRecipeDefinition) serviceCall.eContainer();
candidates.addAll(subRecipeCall.getSubRecipeParameters());
}
return Scopes.scopeFor(candidates);
}
else if(invocationParameter.eContainer() instanceof SubRecipeInvocation) {
SubRecipeInvocation serviceCall = (SubRecipeInvocation) invocationParameter.eContainer();
SubRecipeDefinition calledSub = serviceCall.getCalledSubrecipe();
candidates.addAll(calledSub.getSubRecipeParameters());
return Scopes.scopeFor(candidates);
}
}return super.getScope(context, reference);
}
When I put all in the same file it works as it does the first time executed after launching runtime but afterwards(when dogenerate is triggered via editor saving) some information is missing. Any idea how to get to the missing informations? thanks a lot!

Newbie: 2.4# is accepted as a float. Is '#' a special character?

Wondering why the expression "setvalue(2#)' is happily accepted by the lexer (and parser) given my grammar/visitor. I am sure I am doing something wrong.
Below is a small sample that should illustrate the problem.
Any help is much appreciated.
grammar ExpressionEvaluator;
parse
: block EOF
;
block
: stat*
;
stat
: assignment
;
assignment
: SETVALUE OPAR expr CPAR
;
expr
: atom #atomExpr
;
atom
: OPAR expr CPAR #parExpr
| (INT | FLOAT) #numberAtom
| ID #idAtom
;
OPAR : '(';
CPAR : ')';
SETVALUE : 'setvalue';
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
INT
: [0-9]+
;
FLOAT
: [0-9]+ '.' [0-9]*
| '.' [0-9]+
;
STRING
: '"' (~["\r\n] | '""')* '"'
;
SPACE
: [ \t\r\n] -> skip
;
Code snippet:
public override object VisitParse(ExpressionEvaluatorParser.ParseContext context)
{
return this.Visit(context.block());
}
public override object VisitAssignment(ExpressionEvaluatorParser.AssignmentContext context)
{
// TODO - Set ID Value
return Convert.ToDouble(this.Visit(context.expr()));
}
public override object VisitIdAtom(ExpressionEvaluatorParser.IdAtomContext context)
{
string id = context.GetText();
// TODO - Lookup ID value
return id;
}
public override object VisitNumberAtom(ExpressionEvaluatorParser.NumberAtomContext context)
{
return Convert.ToDouble(context.GetText());
}
public override object VisitParExpr(ExpressionEvaluatorParser.ParExprContext context)
{
return this.Visit(context.expr());
}

The # character actually isn't matching anything at all. When the lexer reaches that character, the following happen in order:
The lexer determines that no lexer rule can match the # character.
The lexer reports an error regarding the failure.
The lexer calls _input.consume() to skip past the bad character.
To ensure errors are reported as easily as possible, always add the following rule as the last rule in your lexer.
ErrChar
: .
;
The parser will report an error when it reaches an ErrChar, so you won't need to add an error listener to the lexer.

ANTLR4 Specific search

Basically I want to find, in a file, using ANTLR, every expression as defined :
WORD.WORD
for example : "end.beginning" matches
For the time being the file can have hundreds and hundreds of lines and a complexe structure.
Is there a way to skip every thing (character?) that does not match with the pattern described above, without making a grammar that fully represents the file ?
So far this is my grammar but i don't know what to do next.
grammar Dep;
program
:
dependencies
;
dependencies
:
(
dependency
)*
;
dependency
:
identifier
DOT
identifier
;
identifier
:
INDENTIFIER
;
DOT : '.' ;
INDENTIFIER
:
[a-zA-Z_] [a-zA-Z0-9_]*
;
OTHER
:
. -> skip
;

The way you're doing it now, the dependency rule would also match the tokens 'end', '.', 'beginning' from the input:
end
#####
.
#####
beginning
because the line breaks and '#'s are being skipped from the token stream.
If that is not what you want, i.e. you'd like to match "end.beginning" without any char in between, you should make a single lexer rule of it, and match that rule in your parser:
grammar Dep;
program
: DEPENDENCY* EOF
;
DEPENDENCY
: [a-zA-Z_] [a-zA-Z0-9_]* '.' [a-zA-Z_] [a-zA-Z0-9_]*
;
OTHER
: . -> skip
;
Then you could use a tree listener to do something useful with your DEPENDENCY's:
public class Main {
public static void main(String[] args) throws Exception {
String input = "### end.beginning ### end ### foo.bar mu foo.x";
DepLexer lexer = new DepLexer(new ANTLRInputStream(input));
DepParser parser = new DepParser(new CommonTokenStream(lexer));
ParseTreeWalker.DEFAULT.walk(new DepBaseListener(){
#Override
public void enterProgram(#NotNull DepParser.ProgramContext ctx) {
for (TerminalNode node : ctx.DEPENDENCY()) {
System.out.println("node=" + node.getText());
}
}
}, parser.program());
}
}
which would print:
node=end.beginning
node=foo.bar
node=foo.x

ANTLR4 Tokenizing the First Non-Whitespace Non-comment Char of a Line

I am looking for a way to tokenize a '#' being the first non-whitespace, non-comment character of a line (this is exactly the same as the standard C++ preprocessing directives requirement). Notice the first non-whitespace requirement implying the # can be preceded by whitespaces and multiline comments such as (using C++ preprocessing directives as examples):
/* */ /* abc */ #define getit(x,y) #x x##y
and
/*
can be preceded by multiline comment spreading across >1 lines
123 */ /* abc */# /* */define xyz(a) #a
The '#' could be preceded and followed by multiline comments spanning >1 lines and whitespaces. Other '#' can appear in the line as operators so being the first effective character in the line is the key requirement.
How do we tokenize the first effective # character ?
I tried this
FIRSTHASH: {getCharPositionInLine() == 0}? ('/*' .*? '*/' | [ \t\f])* '#';
But this is buggy since an input like this
/* */other line
/* S*/ /*SS*/#
is wrongly considered as 2 tokens ( 1 big comment + a single '#'). i.e. the .*? consumed the 2 two */ incorrectly causing the 2 lines combined as 1 comment. (Is it possible to replace the .*? inside multiline comment by something explicitly excludes */?)

I'd lex it without the restraint and check during parsing phase or even after parsing.
It may be not conforming to the grammar to put the '#' elsewhere but it doesn't invalidate the parsing => move it to a later phase where it can be detected more easily!
If you really want to do it "early" (i.e. not after parsing), do it during the parsing phase.
The lexing doesn't depend on it (i.e. unlike strings or comments), so there's no point in doing it during the lexing phase.
Here's a sample in C#.
It checks all 3 defines (first two are ok, the third is not ok).
public class hashTest
{
public static void test()
{
var sample = File.ReadAllText(#"ANTLR\unrelated\hashTest.txt");
var sampleStream = new Antlr4.Runtime.AntlrInputStream(sample);
var lexer = new hashLex(input: sampleStream);
var tokenStream = new CommonTokenStream(tokenSource: lexer);
var parser = new hashParse(input: tokenStream);
var result = parser.compileUnit();
var visitor = new HashVisitor(tokenStream: tokenStream);
var visitResult = visitor.Visit(result);
}
}
public class HashVisitor : hashParseBaseVisitor<object>
{
private readonly CommonTokenStream tokenStream;
public HashVisitor(CommonTokenStream tokenStream)
{
this.tokenStream = tokenStream;
}
public override object VisitPreproc(hashParse.PreprocContext context)
{
;
var startSymbol = context.PreProcStart().Symbol;
var tokenIndex = startSymbol.TokenIndex;
var startLine = startSymbol.Line;
var previousTokens_reversed = tokenStream.GetTokens(0, tokenIndex - 1).Reverse();
var ok = true;
var allowedTypes = new[] { hashLex.RangeComment, hashLex.WS, };
foreach (var token in previousTokens_reversed)
{
if (token.Line < startLine)
break;
if (allowedTypes.Contains(token.Type) == false)
{
ok = false;
break;
}
;
}
if (!ok)
{
; // handle error
}
return base.VisitPreproc(context);
}
}
The lexer:
lexer grammar hashLex;
PreProcStart : Hash -> pushMode(PRE_PROC_MODE)
;
Identifier:
Identifier_
;
LParen : '(';
RParen : ')';
WS
: WS_-> channel(HIDDEN)
;
LineComment
: '//'
~('\r'|'\n')*
(LineBreak|EOF)
-> channel(HIDDEN)
;
RangeComment
: '/*'
.*?
'*/'
-> channel(HIDDEN)
;
mode PRE_PROC_MODE;
PreProcIdentifier : Identifier_;
PreProcHash : Hash;
PreProcEnd :
(EOF|LineBreak) -> popMode
;
PreProcWS : [ \t]+ -> channel(HIDDEN)
;
PreProcLParen : '(';
PreProcRParen : ')';
PreProcRangeComment
: '/*'
(~('\r' | '\n'))*?
'*/'
-> channel(HIDDEN)
;
fragment LineBreak
: '\r' '\n'?
| '\n'
;
fragment Identifier_:
[a-zA-Z]+
;
fragment Hash : '#'
;
fragment WS_
: [ \t\r\n]+
;
The parser:
parser grammar hashParse;
options { tokenVocab=hashLex; }
compileUnit
: (allKindOfStuff | preproc)*
EOF
;
preproc : PreProcStart .*? PreProcEnd
;
allKindOfStuff
: Identifier
| LParen
| RParen
;
The sample:
/*
can be preceded by multiline comment spreading across >1 lines
123 */ /* abc */# /* */define xyz(a) #a
/* def */# /* */define xyz(a) #a
some other code // #
illegal #define a b

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR4 g4 grammar to read key/value pair in different blocks - antlr4

Related

ANTLR4 - named function arguments

Xtext refering to element from different file does not work

Newbie: 2.4# is accepted as a float. Is '#' a special character?

ANTLR4 Specific search

ANTLR4 Tokenizing the First Non-Whitespace Non-comment Char of a Line

Categories

Resources