Are characters classes allowed in ANTLR4? - antlr4

Are character classes supported in ANTLR 4 lexers? I saw some examples that looked like this is OK:
LITERAL: [a-zA-z]+;
but what I found is that it matches the string "OR[" with the opening bracket. Using ranges worked:
LITERAL: ('a'..'z' | 'A'..'Z')+;
and only identified "OR" as the LITERAL. Here is an example:
grammar Test;
#members {
private void log(String msg) {
System.out.println(msg);
}
}
parse
: expr EOF
;
expr
: atom {log("atom(" + $atom.text + ")");}
| l=expr OR r=expr {log("IOR:left(" + $l.text + ") right(" + $r.text + "}");}
| (OR '[' la=atom ra=atom ']') {log("POR:left(" + $la.text + ") right(" + $ra.text + "}");}
;
atom
: LITERAL
;
OR : O R ;
LITERAL: [a-zA-z]+;
//LITERAL: ('a'..'z' | 'A'..'Z')+;
SPACE
: [ \t\r\n] -> skip
;
fragment O: ('o'|'O');
fragment R: ('r'|'R');
When given the input "OR [ cat dog ]" it parses correctly, but "OR[ cat dog ]" does not.

You can use character sets in ANTLR 4 lexers, but the ranges are case sensitive. You used [a-zA-z] where I believe you meant [a-zA-Z].

Related

ANTLR Grammar to distinguish words, alphanumeric and numbers

I am still working my way through ANTLR and would appreciate any support for an enhanced version of this grammar.
Here is an input string:
SYS [ErrorCode is not Available] : Transaction ID:
d9d1211e-d273-40e1-bdd0-e4c9a8036ef3 . This can be ignored safely to:
map To Not availble : works in progress
Expected Parser Output:
words -> SYS
specials -> [
words -> ErrorCode
words -> is
....
alphanumeric -> d9d1211e-d273-40e1-bdd0-e4c9a8036ef3
...
ANTLR Grammar I have so far came up with:
grammar Expressions;
expression
:
| numbers? specials? words (numbers? specials? words)*
| numbers words specials
| specials words numbers
| specials numbers words
| words specials numbers
| words numbers specials
| specials specials? (specials specials? )*
| words words? (words words?)*
| numbers numbers? (numbers numbers?)*
;
words
: CHARACTERS
;
numbers
: NUMBERS
;
specials
: AND
| OR
| EQUALS
| ASSIGN
| GT
| LT
| GTE
| LTE
| NOTEQUALS
| NOT
| PLUS
| MINUS
| IF
| COLON
| TLB
| TRB
| FLB
| FRB
| DOT
;
AND : '&&' ;
OR : '||' ;
EQUALS : '==' ;
ASSIGN : '=' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
NOTEQUALS : '!=' ;
NOT : '!' ;
PLUS : '+' ;
MINUS : '-' ;
IF : 'if' ;
COLON : ':' ;
TLB : '[' ;
TRB : ']' ;
FLB : ')' ;
FRB : '(' ;
DOT : '.' ;
CHARACTERS
: [a-zA-Z] [a-zA-Z]*
;
NUMBERS
: [0-9]+
| ([0-9]+)? '.' ([0-9])+
;
WS : [ \t\r\n]+ -> skip
;
Wrote this simple golang program to find if the string has any number in it.
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
someString := "ID:8e038845-bd81-4218-9769-8406241fbb34 Operation is failed java.core.CoreRuntimeException: java.core.CoreRuntimeException: The JDBC connection information provided is incomplete"
words := strings.Fields(someString)
var tokens []string
var x int
for _, j := range words {
if HasDigit(j) {
dynamic := "$" + strconv.Itoa(x)
tokens = append(tokens, dynamic)
x++
} else {
tokens = append(tokens, j)
}
}
var tokenized string
tokenized = strings.Join(tokens, " ")
fmt.Println(tokenized)
}
func HasDigit(s string) bool {
for _, r := range s {
if '0' <= r && r <= '9' {
return true
}
}
return false
}

How to use the reserved words inside the string in ANTLR4?

I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);

Why whould antlr rule won't making a nice parse tree?

I'm trying to create a grammar that would help me parse a string like this:
[Hello:/c=0.3//a=hi/] [what:/c=0.4/] [are:/c=0.6//a=is/]
This is my grammar:
grammar MyGrammar;
WS: [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
sentence: WORD+;
WORD: '[' WORD_DESCRIPTOR ']';
WORD_DESCRIPTOR: WORD_IDENTIFIER ':' WORD_FEATURES_DESCRIPTORS;
WORD_IDENTIFIER: STRING;
WORD_FEATURES_DESCRIPTORS: WORD_FEATURE_DESCRIPTOR+;
WORD_FEATURE_DESCRIPTOR: '/' WORD_FEATURE_IDENTIFIER '=' WORD_FEATURE_VALUE '/';
WORD_FEATURE_IDENTIFIER:
C_FEATURE | A_FEATURE
;
C_FEATURE: 'c';
A_FEATURE: 'a';
WORD_FEATURE_VALUE: STRING | NUMBER;
fragment LETTER : LOWER | UPPER ;
fragment LOWER : 'a'..'z' ;
fragment UPPER : 'A'..'Z' ;
fragment DIGIT : '0'..'9' ;
fragment INTEGER: DIGIT+ ;
fragment NUMBER: INTEGER (DOT INTEGER)? ;
fragment STRING: LETTER+ ;
fragment DOT: '.' ;
The problem is that the parse tree has only one level.
What I'm doing wrong?
Your parse tree shows up the way it does because all tokens are leaf nodes, and all parser rules are internal nodes. Since you only have a single parser rule (sentence) and the rest are all tokens, this is the parse tree:
sentence
/ | | \
/ | | \
WORD WORD WORD WORD ...
You should see tokens as the atoms that your language is built from. Once you start creating tokens like TOKEN : TOKEN_A | TOKEN_B;, then that is often better defined as a parser rule: token : TOKEN_A | TOKEN_B;.
Try something like this instead:
sentence : word+ EOF;
word : '[' word_descriptor ']';
word_descriptor : word_identifier ':' word_feature_descriptors;
word_identifier : STRING;
word_feature_descriptors : word_feature_descriptor+;
word_feature_descriptor : '/' word_feature_identifier '=' word_feature_value '/';
word_feature_value : STRING | NUMBER;
word_feature_identifier : C_FEATURE | A_FEATURE;
C_FEATURE : 'c';
A_FEATURE : 'a';
NUMBER : INTEGER (DOT INTEGER)?;
STRING : LETTER+ ;
WS : [ \t\r\n]+ -> skip; // skip spaces, tabs, newlines
fragment LETTER : LOWER | UPPER;
fragment LOWER : [a-z];
fragment UPPER : [A-Z];
fragment DIGIT : [0-9];
fragment INTEGER : DIGIT+;
fragment DOT : '.';
which will create the following parse tree for your input:

Can I make my ANTLR4 Lexer discard a character from the input stream?

I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()\" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?
Here is my simplified parser:
parser grammar PdfStreamParser;
options { tokenVocab=PdfSteamLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):
lexer grammar PdfStreamLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ;
// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE
: '\\' ( [\r\nnrtbf()\\] | [0-7] [0-7]? [0-7]? )
;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.
To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:
lexer grammar DemoLexer;
STRING_OPEN
: '(' -> pushMode(STRING_MODE)
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
mode STRING_MODE;
STRING_CLOSE
: ')' -> popMode
;
ESCAPE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
STRING_PART
: ~[\\()]
;
NESTED_STRING_OPEN
: '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
;
IGNORED_ESCAPE
: '\\' . -> skip
;
which can be used in the parser as follows:
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
parse
: ( string | OTHER )* EOF
;
string
: STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
;
If you now parse the string FU(abc(def)\#\))BAR, you will get the following parse tree:
As you can see, the \) is left in the tree, but \# is omitted.

Parsing quoted string with escape chars

I'm having a problem parsing a list of lines of format in antlr4
* this is a string
* "first" this is "quoted"
* this is "quoted with \" "
I want to build a parse tree like
(list
(line * (value (string this is a string)))
(line * (value (parameter first) (string this is) (parameter quoted)))
(line * (value (string this is) (parameter quoted with " )))
)
I have an antlr4 grammar of this format
grammar List;
list : line+;
line : '*' (WS)+ value* NEWLINE;
value : string
| parameter
;
string : ((WORD) (WS)*)+;
parameter : '"'((WORD) (WS)*)+ '"';
WORD : (~'\n')+;
WS : '\t' | ' ';
NEWLINE : '\n';
But this is failing in the first character recognition of '*' itself, which baffles me.
line 1:0 mismatched input '* this is a string' expecting '*'
The problem is that your lexer is too greedy. The rule
WORD : (~'\n')+;
matches almost everything. This causes the lexer to produce the following tokens for your input:
token 1: WORD (* this is a string)
token 2: NEWLINE
token 3: WORD (`* "first" this is "quoted")
token 4: NEWLINE
token 5: WORD (* this is "quoted with \" ")
Yes, that is correct: only WORD and NEWLINE tokens. ANTLR's lexer tries to construct tokens with as much characters as possible, it does not "listen" to what the parser is trying to match.
The error message:
line 1:0 mismatched input '* this is a string' expecting '*'
is telling you this: on line 1, index 0 the token with text '* this is a string' (type WORD) is encountered, but the parser is trying to match the token: '*'
Try something like this instead:
grammar List;
parse
: NEWLINE* list* NEWLINE* EOF
;
list
: item (NEWLINE item)*
;
item
: '*' (STRING | WORD)*
;
BULLET : '*';
STRING : '"' (~[\\"] | '\\' [\\"])* '"';
WORD : ~[ \t\r\n"*]+;
NEWLINE : '\r'? '\n' | '\r';
SPACE : [ \t]+ -> skip;
which parses your example input as follows:
(parse
(list
(item
* this is a string) \n
(item
* "first" this is "quoted") \n
(item
* this is "quoted with \" "))
\n
<EOF>)

Resources