Want to parse same structure

Want to parse same structure - antlr4

I would like to make ANTLR4 parse this:
FSM
name type String
state type State
Relation
name type String
And i am using this grammar :
grammar Generator;
classToGenerate:
name=Name NL
(attributes NL)+
classToGenerate| EOF;
attributes: attribute=Name WS 'type' WS type=Name;
Name: ('A'..'Z' | 'a'..'z')+ ;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
I would like to read successfully, i don't know why, but each time i run my program, i get this error :
line 6:18 no viable alternative at input '<EOF>'
Any fix?

The trailing EOF is messing things up for you. Try creating a separate rule that matches the EOF token, preceded by one or more classToGenerate (the parse rule in my example):
grammar Generator;
parse
: classToGenerate+ EOF
;
classToGenerate
: name=Name NL (attributes NL)+
;
attributes
: attribute=Name WS 'type' WS type=Name
;
Name: ('A'..'Z' | 'a'..'z')+ ;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
And do you really need to keep the spaces and line breaks? You could let the lexer discard them, which makes your grammar a whole lot easier to read:
grammar Generator;
parse
: classToGenerate+ EOF
;
classToGenerate
: name=Name attributes+
;
attributes
: attribute=Name 'type' type=Name
;
Name : [a-zA-Z]+;
Spaces : [ \t\r\n] -> skip;

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".

There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).

I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR expression)+
| expression (NOT expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).

The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.

Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

I have the following simple grammar:
grammar TestG;
p : pDecl+ ;
pDecl : endianDecl
| dTDecl
;
endianType : E_BIG
| E_LITTLE
;
endianDecl : 'endian' '=' endianType ';' ;
dTDecl : 'dT' '[' STRING ']' '=' ID ';' ;
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
E_BIG : 'big' ;
E_LITTLE : 'little' ;
When I run grun TestG p and input the following:
endian = little;
I get the following:
line 1:9 mismatched input 'little' expecting {'big', 'little'}
What have I done wrong?

Because your lexer rule for ID precedes that for E_LITTLE, your 'little' input is being lexed as an ID.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<ID>,1:9] <== see here it's being lexed as an ID
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
line 1:9 mismatched input 'little' expecting {'big', 'little'}
Moving the these lexer tokens above ID like so:
STRING: '"'.*?'"' ; //Embedded quotes?
COMMENT: '#' .*? [\n\r] -> skip ; // Discard comments for now
E_BIG : 'big' ;
E_LITTLE : 'little' ;
ID : [a-zA-Z][a-zA-Z0-9_]* ;
WS : [ \t\n\r]+ -> skip ;
INT : ('0x')?[0-9]+ ; // How to handle 0xDD and ensure non zero?
yields the correct output from your test input.
[#0,0:5='endian',<'endian'>,1:0]
[#1,7:7='=',<'='>,1:7]
[#2,9:14='little',<'little'>,1:9] <== see here being lexed correctly
[#3,15:15=';',<';'>,1:15]
[#4,18:17='<EOF>',<EOF>,2:0]
Remember, for lexer tokens, the longest match wins, but in the case of a tie, the one that appears FIRST wins. This is why you want your more specific lexer tokens at the top of the lexer token list, and the more general ones (like identifiers, strings, etc.) farther down.

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

ANTLRWorks, whitespaces, memory leaks, and crashing

I wanted to try this tool, antlr, so that I could eventually arrive to parse some code and refactor it. I tried some small grammars, everything was ok, so I took the next step and started parsing a sort of simple C#.
The good news: it takes like 10 minutes to understand the basics.
The extrememly bad news: it takes hours to understand how to parse two spaces instead of just one. Really. This things hates whitespaces, and has no shame in telling you that. Honestly I started to think it was unable to parse them, but then something went the right way... Or at least I thought so.
Now the problem of spaces comes after the fact that ANTLRWorks tries to allocate half a GB of ram and cannot really parse anything.
The grammar is not very hard, being I a beginner:
grammar newEmptyCombinedGrammar;
TokenEndCmd : ';' ;
TokenGlobImport : 'import' ;
TokenGlobNamespace : 'namespace' ;
TokenClass : 'class' ;
TokenSepFloat : ',' ;
TokenSepNamespace : '.' ;
fragment TokenEmptyString : '' ;
TokenUnderscore : '_' ;
TokenArgsSep : ',' ;
TokenArgsOpen : '(' ;
TokenArgsClose : ')' ;
TokenBlockOpen : '{' ;
TokenBlockClose : '}' ;
// --------------------
Digit : [0-9] ;
numberInt : Digit+ ;
numberFloat : numberInt TokenSepFloat numberInt ;
WordCI : [a-zA-Z]+ ;
WordUP : [A-Z]+ ;
WordLW : [a-z]+ ;
// -----------------
keyword : (WordCI | TokenUnderscore+) (numberInt | WordCI | TokenUnderscore)* ;
// ---------------------
spaces : (' ' | '\t')+ ;
spaceLNs : (' ' | '\t' | '\r' | '\n')+ ;
spacesOpt : spaces* ;
spaceLNsOpt : spaceLNs* ;
// ---------------------
// tipo "System" o "System.Net.Socket"
namepaceNameComposited : keyword (TokenSepNamespace keyword)* ;
// import System; import System.IO;
globImport : TokenGlobImport spaces namepaceNameComposited spacesOpt TokenEndCmd ;
// class class1 {}
namespaceClass : TokenClass spaces keyword spaceLNsOpt TokenBlockOpen spaceLNsOpt TokenBlockClose ;
// "namespace ns1 {}", "namespace ns1.sns2{}"
globNamespace : TokenGlobNamespace spaces namepaceNameComposited spaceLNsOpt TokenBlockOpen spaceLNsOpt namespaceClass spaceLNsOpt TokenBlockClose ;
globFile : (globImport | spaceLNsOpt)* (globNamespace | spaceLNsOpt)* ;
but still when globFile or globNamespace are added the ide starts to allocate memory like there's no tomorrow, and that's obviously a problem.
So
-is this way of capturing the whitespaces right? (I don't want to skip them, that's the point)
-is the memory leaking for a recursion I don't see?
The code that this thing is able to parse is something like:
import System;
namespace aNamespace{
class aClass{
}
}
globFile is the main rule, by the way.

You should define a lexer token to treat whitespaces the way you need it to. If you want a group of consecutive space or tab characters to form a single token, use a definition like the following. In this case, you would reference whitespace in the parser rules as Whitespace (required) or Whitespace? (optional).
// ANTLR 3:
Whitespace : (' ' | '\t')+;
// ANTLR 4:
Whitespace : [ \t]+;
If you want every individual whitespace character to be its own token, use something like the following. In this case, you would reference whitespace in the parser rules as Whitespace+ (required) or Whitespace* (optional).
// ANTLR 3:
Whitespace : ' ' | '\t';
// ANTLR 4:
Whitespace : [ \t];
The question about memory leaks probably belongs on the ANTLRWorks issue tracker.
ANTLRWorks 1 issue tracker: https://github.com/antlr/antlrworks/issues
ANTLRWorks 2 issue tracker: https://bitbucket.org/sharwell/antlrworks2/issues

The problem is effectively the last rule
globFile : (globImport | spaceLNsOpt)* (globNamespace | spaceLNsOpt)* ;
I changed it like this:
globFile : (globImport spaceLNsOpt)* (globNamespace spaceLNsOpt)* ;
and it seems that adding a EOF apparently helps:
globFile : (globImport spaceLNsOpt)* (globNamespace spaceLNsOpt)* EOF ;
but this is not sufficient, the rule cannot function in any case.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Want to parse same structure - antlr4

Related

ANTLR4 ambiguity - how to solve

antlr4 all words except the operators

ANTLR4 tells me: mismatched input 'little' expecting {'big', 'little'}

antlr tokenizer starts with the last token

ANTLRWorks, whitespaces, memory leaks, and crashing

Categories

Resources