Parse a string using ANTLR4

Parse a string using ANTLR4 - antlr4

Example: (CHGA/B234A/B231
String:
a) Designator: 3 LETTERS
b) Message number (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.
c) Reference data (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.
Result:
CHG
A/B234
A/B231
In grammar file:
/*
* Parser Rules
*/
tipo3: designador idmensaje? idmensaje?;
designador: PARENTHESIS CHG;
idmensaje: LETTER4 SLASH LETTER4 DIGIT3;
/*
* Lexer Rules
*/
CHG : 'CHG' ;
fragment DIGIT : [0-9] ;
fragment LETTER : [a-zA-Z] ;
SLASH : '/' ;
PARENTHESIS : '(' ;
DIGIT3 : DIGIT DIGIT DIGIT ;
LETTER4 : LETTER LETTER? LETTER? LETTER? ;
But when testing the tipo3 rule its giving me the following message:
line 1:1 missing 'CHG' at 'CHGA'
How can i parse that string in antlr4?

When you're confused why a certain parser rule is not being matched, always start with the lexer. Dump what tokens your lexer is producing on the stdout. Here's how you can do that:
// I've placed your grammar in a file called T.g4 (hence the name `TLexer`)
String source = "(CHGA/B234A/B231";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();
for (Token t : stream.getTokens()) {
System.out.printf("%-20s `%s`%n",
TLexer.VOCABULARY.getSymbolicName(t.getType()),
t.getText().replace("\n", "\\n"));
}
If you runt the Java code above, this will be printed:
PARENTHESIS `(`
LETTER4 `CHGA`
SLASH `/`
LETTER4 `B`
DIGIT3 `234`
LETTER4 `A`
SLASH `/`
LETTER4 `B`
DIGIT3 `231`
EOF `<EOF>`
As you can see, CHGA becomes a single LETTER4, not a CHG + LETTER4 token. Try changing LETTER4 into LETTER4 : LETTER; and re-test. Now you'll get the expected result.
In your current grammar CHGA will always become a single LETTER4. This is just how ANTLR works (the lexer tries to consume as many chars for a single rule as possible). You cannot change this.
What you could do, it move the construction of the multi-letter rule to the parser instead of the lexer:
tipo3 : designador idmensaje? idmensaje?;
designador : PARENTHESIS CHG;
idmensaje : letter4 SLASH letter4 DIGIT3;
letter4 : LETTER LETTER? LETTER? LETTER?
| CHG
;
CHG : 'CHG' ;
LETTER : [a-zA-Z] ;
SLASH : '/';
PARENTHESIS : '(';
DIGIT3 : DIGIT DIGIT DIGIT;
fragment DIGIT : [0-9];
resulting in:

Related

ANTLR4 ambiguity - how to solve

I would like to solve the following ambiguity:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input* EOF;
input
: '%' statement
| inputText
;
inputText
: ~('%')+
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;
Sample input:
%a=5;
aa bbbb
As soon as I put a space after "aa" with values like "bbbb" an ambiguity is created.
In fact I want inputText to contain the full string "aa bbbb".

There is no ambiguity. The input aa bbbb will always be tokenised as 2 Identifier tokens. No matter what any parser rule is trying to match. The lexer operates independently from the parser.
Also, the rule:
inputText
: ~('%')+
;
does not match one or more characters other than '%'.
Inside parser rules, the ~ negates tokens, not characters. So ~'%' inside a parser rule will match any token, other than a '%' token. Inside the lexer, ~'%' matches any character other than '%'.
But creating a lexer rule like this:
InputText
: ~('%')+
;
will cause your example input to be tokenised as a single '%' token, followed by a large 2nd token that'd match this: a=5;\naa bbbb. This is how ANTLR's lexer works: match as much characters as possible (no matter what the parser rule is trying to match).

I found the solution:
grammar test;
WS : (' ' | '\t' | '\n' | '\r' | '\f')+ -> skip;
program
:
input EOF;
input
: inputText ('%' statement inputText)*
;
inputText
: ~('%')*
;
statement
: Identifier '=' DecimalConstant ';'
;
DecimalConstant
: [0-9]+
;
Identifier
: Letter LetterOrDigit*
;
fragment
Letter
: [a-zA-Z$##_.]
;
fragment
LetterOrDigit
: [a-zA-Z0-9$##_.]
;

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR expression)+
| expression (NOT expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).

The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.

Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

antlr tokenizer starts with the last token

I have the following grammar:
grammar Aligner;
line
: emptyLine
| codeLine
;
emptyLine
: ( KW_EMPTY KW_LINE )?
( EOL | EOF )
;
codeLine
: KW_LINE COLON
indent
CODE
( EOL | EOF )
;
indent
: absolute_indent
| relative_indent
;
absolute_indent
: NUMBER
;
relative_indent
: sign NUMBER
;
sign
: PLUS
| MINUS
;
COLON: ':';
MINUS: '-';
PLUS: '+';
KW_EMPTY: 'empty';
KW_LINE: 'line';
NUMBER
: DIGIT+
;
EOL
: ('\n' | '\r\n')
;
SPACING
: LINE_WS -> skip
;
CODE
: (~('\n' | '\r'))+
;
fragment
DIGIT
: '0'..'9'
;
fragment
LINE_WS
: ' '
| '\t'
| '\u000C'
;
when I try to parse - empty line I receive error: line 1:0 no viable alternative at input 'empty line'. When I debug what is going on, the very first token is from type CODE and includes the whole line.
What I am doing wrong?

ANTLR will try to match the longest possible token. When two lexer rules match the same string of a given length, the first rule that appears in the grammar wins.
You rule CODE is basically a catch-all: it will match whole lines of text. So here ANTLR has the choice of matching empty line as one single token of type CODE, and as no other rule can produce a token of length 10, the CODE rule will consume the whole line.
You should rewrite the CODE rule to make it match only what you mean by a code. Right now it's way too broad.

How to disambiguate text with ANTLR4 that is sometimes two tokens, and sometime a third one?

I have a problem with ANTLR4 grammar. I need to parse a text that contains 6 AN characters.
Based on the context of the text, it can represent:
- a 6-AN identifier (flight reservation number - PNR - which looks like 7B22MS, or JPN92Y, or similar),
- airline designator (two letters) + flight number (four number), e.g. LH1856.
The problem is that if I create lexer rules that parse airline, number and PNR identifier like this:
Airline : 'A'..'Z''A'..'Z';
FlNum : ('0'..'9')('0'..'9')('0'..'9')('0'..'9');
PNR : ('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9');
then the PNR rule always wins and eats all the tokens that match its pattern.
How can I change this so that the Airline and FlNum will be parsed, if the context of the grammar needs them?

How about this:
AirlineAndFlNm : 'A'..'Z' 'A'..'Z' ('0'..'9')('0'..'9')('0'..'9')('0'..'9');
PNR : ('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9')('A'..'Z'|'0'..'9');
or more readable:
AirlineAndFlNm : LETTER LETTER DIGIT DIGIT DIGIT DIGIT ;
PNR : AlphaNum AlphaNum AlphaNum AlphaNum AlphaNum AlphaNum;
// fragments can only be used by other rules, will never create a token on their own
fragment LETTER: 'A'..'Z';
fragment DIGIT : '0'..'9';
fragment AlphaNum: LETTER | DIGIT ;
It should be easy to separate AirlineAndFlNm afterwards.
And since AirlineAndFlNm is placed before PNR, it will match if it can.

Let the lexer tokenize single characters and promote these rules to parser rules:
// parser rules
airline
: LETTER
;
fl_num
: DIGIT DIGIT DIGIT DIGIT
;
pnr
: alpha_num alpha_num alpha_num alpha_num alpha_num alpha_num
;
alpha_num
: DIGIT
| LETTER
;
// lexer rules
DIGIT
: [0-9]
;
LETTER
: [A-Z]
;

ANTLR 4 lexer tokens inside other tokens

I have the following grammar for ANTLR 4:
grammar Pattern;
//parser rules
parse : string LBRACK CHAR DASH CHAR RBRACK ;
string : (CHAR | DASH)+ ;
//lexer rules
DASH : '-' ;
LBRACK : '[' ;
RBRACK : ']' ;
CHAR : [A-Za-z0-9] ;
And I'm trying to parse the following string
ab-cd[0-9]
The code parses out the ab-cd on the left which will be treated as a literal string in my application. It then parses out [0-9] as a character set which in this case will translate to any digit. My grammar works for me except I don't like to have (CHAR | DASH)+ as a parser rule when it's simply being treated as a token. I would rather the lexer create a STRING token and give me the following tokens:
"ab-cd" "[" "0" "-" "9" "]"
instead of these
"ab" "-" "cd" "[" "0" "-" "9" "]"
I have looked at other examples, but haven't been able to figure it out. Usually other examples have quotes around such string literals or they have whitespace to help delimit the input. I'd like to avoid both. Can this be accomplished with lexer rules or do I need to continue to handle it in the parser rules like I'm doing?

In ANTLR 4, you can use lexer modes for this.
STRING : [a-z-]+;
LBRACK : '[' -> pushMode(CharSet);
mode CharSet;
DASH : '-';
NUMBER : [0-9]+;
RBRACK : ']' -> popMode;
After parsing a [ character, the lexer will operate in mode CharSet until a ] character is reached and the popMode command is executed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parse a string using ANTLR4 - antlr4

Related

ANTLR4 ambiguity - how to solve

antlr4 all words except the operators

antlr tokenizer starts with the last token

How to disambiguate text with ANTLR4 that is sometimes two tokens, and sometime a third one?

ANTLR 4 lexer tokens inside other tokens

Categories

Resources