ANTLRWorks, whitespaces, memory leaks, and crashing

ANTLRWorks, whitespaces, memory leaks, and crashing - memory-leaks

I wanted to try this tool, antlr, so that I could eventually arrive to parse some code and refactor it. I tried some small grammars, everything was ok, so I took the next step and started parsing a sort of simple C#.
The good news: it takes like 10 minutes to understand the basics.
The extrememly bad news: it takes hours to understand how to parse two spaces instead of just one. Really. This things hates whitespaces, and has no shame in telling you that. Honestly I started to think it was unable to parse them, but then something went the right way... Or at least I thought so.
Now the problem of spaces comes after the fact that ANTLRWorks tries to allocate half a GB of ram and cannot really parse anything.
The grammar is not very hard, being I a beginner:
grammar newEmptyCombinedGrammar;
TokenEndCmd : ';' ;
TokenGlobImport : 'import' ;
TokenGlobNamespace : 'namespace' ;
TokenClass : 'class' ;
TokenSepFloat : ',' ;
TokenSepNamespace : '.' ;
fragment TokenEmptyString : '' ;
TokenUnderscore : '_' ;
TokenArgsSep : ',' ;
TokenArgsOpen : '(' ;
TokenArgsClose : ')' ;
TokenBlockOpen : '{' ;
TokenBlockClose : '}' ;
// --------------------
Digit : [0-9] ;
numberInt : Digit+ ;
numberFloat : numberInt TokenSepFloat numberInt ;
WordCI : [a-zA-Z]+ ;
WordUP : [A-Z]+ ;
WordLW : [a-z]+ ;
// -----------------
keyword : (WordCI | TokenUnderscore+) (numberInt | WordCI | TokenUnderscore)* ;
// ---------------------
spaces : (' ' | '\t')+ ;
spaceLNs : (' ' | '\t' | '\r' | '\n')+ ;
spacesOpt : spaces* ;
spaceLNsOpt : spaceLNs* ;
// ---------------------
// tipo "System" o "System.Net.Socket"
namepaceNameComposited : keyword (TokenSepNamespace keyword)* ;
// import System; import System.IO;
globImport : TokenGlobImport spaces namepaceNameComposited spacesOpt TokenEndCmd ;
// class class1 {}
namespaceClass : TokenClass spaces keyword spaceLNsOpt TokenBlockOpen spaceLNsOpt TokenBlockClose ;
// "namespace ns1 {}", "namespace ns1.sns2{}"
globNamespace : TokenGlobNamespace spaces namepaceNameComposited spaceLNsOpt TokenBlockOpen spaceLNsOpt namespaceClass spaceLNsOpt TokenBlockClose ;
globFile : (globImport | spaceLNsOpt)* (globNamespace | spaceLNsOpt)* ;
but still when globFile or globNamespace are added the ide starts to allocate memory like there's no tomorrow, and that's obviously a problem.
So
-is this way of capturing the whitespaces right? (I don't want to skip them, that's the point)
-is the memory leaking for a recursion I don't see?
The code that this thing is able to parse is something like:
import System;
namespace aNamespace{
class aClass{
}
}
globFile is the main rule, by the way.

You should define a lexer token to treat whitespaces the way you need it to. If you want a group of consecutive space or tab characters to form a single token, use a definition like the following. In this case, you would reference whitespace in the parser rules as Whitespace (required) or Whitespace? (optional).
// ANTLR 3:
Whitespace : (' ' | '\t')+;
// ANTLR 4:
Whitespace : [ \t]+;
If you want every individual whitespace character to be its own token, use something like the following. In this case, you would reference whitespace in the parser rules as Whitespace+ (required) or Whitespace* (optional).
// ANTLR 3:
Whitespace : ' ' | '\t';
// ANTLR 4:
Whitespace : [ \t];
The question about memory leaks probably belongs on the ANTLRWorks issue tracker.
ANTLRWorks 1 issue tracker: https://github.com/antlr/antlrworks/issues
ANTLRWorks 2 issue tracker: https://bitbucket.org/sharwell/antlrworks2/issues

The problem is effectively the last rule
globFile : (globImport | spaceLNsOpt)* (globNamespace | spaceLNsOpt)* ;
I changed it like this:
globFile : (globImport spaceLNsOpt)* (globNamespace spaceLNsOpt)* ;
and it seems that adding a EOF apparently helps:
globFile : (globImport spaceLNsOpt)* (globNamespace spaceLNsOpt)* EOF ;
but this is not sufficient, the rule cannot function in any case.

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".

There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).

I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

Problem matching single digits when integers are defined as tokens

I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these:
testing1(2342);
testing2(idfor2);
testing3(4654);
testing4[1..n];
testing5[0..1];
testing6(7);
testing7(1);
testing8(o);
testing9(n);
The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried):
grammar test;
tests
: test* ;
test
: call
| declaration ;
call
: callName '(' callParameter ')' ';' ;
callName : Identifier ;
callParameter : Identifier | Integer ;
declaration
: declarationName '[' declarationParams ']' ';' ;
declarationName : Identifier ;
declarationParams
: decMin '..' decMax ;
decMin : '0' | '1' ;
decMax : '1' | 'n' ;
Integer : [0-9]+ ;
Identifier : [a-zA-Z_][a-zA-Z0-9_]* ;
WS : [ \t\r\n]+ -> skip ;
When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier:
line 8:9 mismatched input '1' expecting {Integer, Identifier}
line 10:9 mismatched input 'n' expecting {Integer, Identifier}
I've tried many variations but I can't make it work fine.

I think your problem comes from not using lexer rules clearly defining what you want.
When you added this rule :
decMin : '0' | '1' ;
You in fact created an unnamed lexer rule that matches '0' and another one matching '1' :
UNNAMED_0_RULE : '0';
UNNAMED_1_RULE : '1';
And your parser rule became :
decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ;
Problem : now, when your lexer see
testing7(1);
**it doesn't see **
callName '(' callParameter ')' ';'
anymore, it sees
callName '(' UNNAMED_1_RULE ')' ';'
and it doesn't understand that.
And that is because lexer rules are effective before the parser rules.
To solve your problem, define your lexer rules efficiently, It would probably look like that :
grammar test;
/*---------------- PARSER ----------------*/
tests
: test*
;
test
: call
| declaration
;
call
: callName '(' callParameter ')' ';'
;
callName
: identifier
;
callParameter
: identifier
| integer
;
declaration
: declarationName '[' declarationParams ']' ';'
;
declarationName
: identifier
;
declarationParams
: decMin '..' decMax
;
decMin
: INTEGER_ZERO
| INTEGER_ONE
;
decMax
: INTEGER_ONE
| LETTER_N
;
integer
: (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+
;
identifier
: LETTER_N
| IDENTIFIER
;
/*---------------- LEXER ----------------*/
LETTER_N: N;
IDENTIFIER
: [a-zA-Z_][a-zA-Z0-9_]*
;
WS
: [ \t\r\n]+ -> skip
;
INTEGER_ZERO: '0';
INTEGER_ONE: '1';
INTEGER_OTHERS: '2'..'9';
fragment N: [nN];
I just tested this grammar and it works.
The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5).
I think it would be better to be less precise and simply write :
decMin: integer | identifier;
But then it depends on what you do with your grammar...

Want to parse same structure

I would like to make ANTLR4 parse this:
FSM
name type String
state type State
Relation
name type String
And i am using this grammar :
grammar Generator;
classToGenerate:
name=Name NL
(attributes NL)+
classToGenerate| EOF;
attributes: attribute=Name WS 'type' WS type=Name;
Name: ('A'..'Z' | 'a'..'z')+ ;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
I would like to read successfully, i don't know why, but each time i run my program, i get this error :
line 6:18 no viable alternative at input '<EOF>'
Any fix?

The trailing EOF is messing things up for you. Try creating a separate rule that matches the EOF token, preceded by one or more classToGenerate (the parse rule in my example):
grammar Generator;
parse
: classToGenerate+ EOF
;
classToGenerate
: name=Name NL (attributes NL)+
;
attributes
: attribute=Name WS 'type' WS type=Name
;
Name: ('A'..'Z' | 'a'..'z')+ ;
WS: (' ' | '\t')+;
NL: '\r'? '\n';
And do you really need to keep the spaces and line breaks? You could let the lexer discard them, which makes your grammar a whole lot easier to read:
grammar Generator;
parse
: classToGenerate+ EOF
;
classToGenerate
: name=Name attributes+
;
attributes
: attribute=Name 'type' type=Name
;
Name : [a-zA-Z]+;
Spaces : [ \t\r\n] -> skip;

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

Is there a language-agnostic way to do simple predicates in the parser?

Goal
I want to reduce (or eliminate) the Java-specific actions and predicates in my parser. Perhaps it isn't possible, but I wanted to ask here just in case there's some ANTLR4 feature I've missed. (The language itself is third-party, so I don't have control over that.)
Simplified example
The predicates I want to use are mostly exact (or perhaps case-insensitive) string-matching. I could make big parallel sets of parser rules, but I'd rather not since the real-life example is considerably more convoluted.
Suppose I'm given something like:
isWidget(int) : "Whether it is a widget" : 4 ;
ownerFirstName(string) : "john" ;
ownerLastName(string) : "This is the last-name of the owner" : "doe" ;
I want the parser to look at the default-value (the last item on the line, like 4, "john" or "doe") and parse it based on the earlier type (int), (string), (string).
main
: stmt SEMIC (stmt SEMIC)* EOF
;
stmt
: propname=IDENTIFIER LPAREN datatype=IDENTIFIER RPAREN (COLON description=QUOTSTRING)? COLON df=defaultVal
;
defaultVal
: QUOTSTRING //TODO only this alt if datatype=string
| NUM //TODO only this alt if datatype=int
;
fragment Letter : 'a'..'z' | 'A'..'Z' ;
fragment Digit : '0'..'9' ;
fragment Underscore : '_' ;
SEMIC : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
IDENTIFIER : (Letter|Underscore) (Letter|Underscore|Digit)* ;
QUOTSTRING : '"' ~('"' |'\n' | '\r' | '\u2029' | '\u2028')* '"' ;
NUM : Digit+ ;
WS : [ \t\n\r]+ -> skip ;
I know I can do it with predicates and rule inputs, but then I'm crossing the line from a language-agnostic grammar to one with embedded Java code.

Your parser should handle things like the following without a problem:
isWidget(int) : "Whether it is a widget" : "foo" ;
In other words, do not add a predicate that would fail in this case, or you will lose the ability to report sane error messages. Instead, use a language-specific listener or visitor implementation after the parse is complete to report a semantic error if the type of the default value does not match the declared type.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLRWorks, whitespaces, memory leaks, and crashing - memory-leaks

Related

ANTLR4 ambiguity - how to solve

Problem matching single digits when integers are defined as tokens

Want to parse same structure

antlr tokenizer starts with the last token

Is there a language-agnostic way to do simple predicates in the parser?

Categories

Resources