lexing multiline define statements with antlr4 - antlr4

I am trying to write a lexer to do preprocessing, which can handle multi-line #define statements. For example, the following input, where the multi-line definition is broken by an subsequent empty line (may contain white spaces though):
aa(bb);
#define XX pqr
#define YY pqr \
+abc
class p(XX,YY,zz);
endclass
The first step is to tokenize the input stream, where for any definition the value will be obtained as one string token. eg, for YY, I am trying to get "pqr+abc" as its string value. I have written the following lexer for tokenizing:
DEF: '#define' -> pushMode(def_mode);
ID: Letter (Letter | DecDigit)* ;
COMMENT : '/*' .*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT : '//' ~('\n'|'\r')* NL -> channel(HIDDEN);
WS: ( ' ' |'\t' | NL )+ -> channel(HIDDEN) ;
SEMICOLN: ';' ;
COMMA: ',' ;
OB: '(' ;
CB: ')' ;
PLUS: '+' ;
fragment NL : '\r'? '\n' ;
fragment DecDigit: '0'..'9' ;
fragment Letter: 'A'..'Z' | 'a'..'z' | '_' ;
mode def_mode;
STR2: '\r'? '\n' -> popMode;
STR1: ~('\n'|'\r')* '\r'? '\n' ;
The above lexer gives the following tokens for the #define lines:
[#10,26:32='#define',<2>,6:0]
[#11,33:42=' YY pqr \\n',<13>,6:7]
[#12,43:51=' +abc\n',<13>,7:0]
[#13,52:52='\n',<12>,8:0]
[#14,53:57='class',<3>,9:0]
The above tokens are obtained only if there is an "empty" line after the #define lines. If there are some whitespace in that line, ie it is not really empty, then the mode is not exited. Here are the tokens when the line has whitespace:
[#10,26:32='#define',<2>,6:0]
[#11,33:42=' YY pqr \\n',<13>,6:7]
[#12,43:51=' +abc\n',<13>,7:0]
[#13,52:55=' \n',<13>,8:0]
[#14,56:74='class p(XX,YY,zz);\n',<13>,9:0]
[#15,75:83='endclass\n',<13>,10:0]
[#16,84:84='\n',<12>,11:0]
[#17,85:84='<EOF>',<-1>,12:0]
Also, the lexer is not joining the two lines. How to fix these errors?

I am not an expert (one grammar per year) and not a fan of mode, and don't like to make too much processing in the lexer. The parser has much more capabilities. Look at this :
grammar Question;
/* Parsing preprocessor #define */
program
: statement+
;
statement
: aClass
| function
| preprocessor
;
aClass
: 'class'
classDef // classBody
'endclass'
;
classDef
: ID '(' list ')' ';'
;
function
: ID '(' list ')' ';'
;
preprocessor
: DEFINE ID replacement
{System.out.println($DEFINE.text + " value=" + $ID.text + " -> replaced by " + $replacement.text);}
;
replacement
: expr+
;
expr
: ID
| ID '+' ID
;
list
: ID ( ',' ID )*
;
ID : LETTER ALPHAMERIC* ;
DEFINE
: '#' 'define' ;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT
: '//' ~('\r' | '\n')* -> channel(HIDDEN) ;
WS : [ \t\r\n]+ -> channel(HIDDEN) ; // keep spaces in $<rule>.text
//WS : [ \t\r\n]+ -> skip ;
CONTINUATION
// if you want to keep the exact value including NL :
// : '\\' '\r'? '\n' -> channel(HIDDEN) ;
// to discard the continuation character :
: '\\' '\r'? '\n' -> skip ; // ignored as in Ruby, concatenates two lines
fragment LETTER : [a-zA-Z_] ;
fragment DIGIT : [0-9] ;
fragment ALPHAMERIC : LETTER | DIGIT ;
With the input data.txt :
/* function
call */
aa(bb);
#define XX pqr
#define YY long replacement value
// multiline :
#define ZZ stu \
+abc
class p(XX,YY,zz);
endclass
#define WW vwx \
+def
// preceding line contains 10 spaces
$ hexdump -C data.txt
...
000000b0 2b 64 65 66 0a 20 20 20 20 20 20 20 20 20 20 0a |+def. .|
000000c0 2f 2f 20 70 72 65 63 65 64 69 6e 67 20 6c 69 6e |// preceding lin|
000000d0 65 20 63 6f 6e 74 61 69 6e 73 20 31 30 20 73 70 |e contains 10 sp|
000000e0 61 63 65 73 |aces|
000000e4
the output is :
$ grun Question program -tokens data.txt
[#0,0:26='/* function\n call */',<COMMENT>,channel=1,1:0]
[#1,27:27='\n',<WS>,channel=1,2:15]
[#2,28:29='aa',<ID>,3:0]
...
[#8,36:42='#define',<DEFINE>,4:0]
[#9,43:43=' ',<WS>,channel=1,4:7]
[#10,44:45='XX',<ID>,4:8]
[#11,46:46=' ',<WS>,channel=1,4:10]
[#12,47:49='pqr',<ID>,4:11]
[#13,50:50='\n',<WS>,channel=1,4:14]
...
#define value=XX -> replaced by pqr
#define value=YY -> replaced by long replacement value
#define value=ZZ -> replaced by stu +abc
#define value=WW -> replaced by vwx +def
If you use the skip version of WS, you'll have pqr+abc but also longreplacementvalue. I leave this point to your astuteness.

Related

Unexpected parser behaviour when adding an option (symbol: '1' | '2';)

The grammar below matches inputs 1 and 2 but not 3:
ма́ма жо 1a
ра́ма ж 1a
хлеб м 1c
grammar Hello;
entry
: headword WS definition EOF
;
headword
: LETTER (LETTER | STRESS_MARK | '-')*
;
definition
: main_symbol WS index_number index_letter
;
main_symbol
: 'жо'
| 'ж'
;
index_number
: '1'
;
index_letter
: 'a'
| 'b'
| 'c'
| 'd'
| 'e'
| 'f'
;
WS : [ \t] ;
LETTER : [а-яА-ЯёЁ] ;
STRESS_MARK : [\u0300\u0301] ;
Obviously, no. 3 is not matched because 'м' is not a valid main_symbol. Now if I add 'м' to main_symbol like this:
main_symbol
: 'жо'
| 'ж'
| 'м'
;
Test no. 3 will pass, but this also makes 1 and 2 fail. Why?
I think it's already answered at https://stackoverflow.com/a/69416290/10109396
Following parser rule created 3 anonymous lexer rule: 'жо', 'ж', and 'м'.
main_symbol
: 'жо'
| 'ж'
| 'м'
;
They occurred before LETTER lexer rule, ANTLR4 lexer will prefer match them than match a LETTER, you can see this in grun -tokens output, 'м' is a 'м' token instead of a LETTER token.
root#antlr:~# grun Hello entry -tree -tokens
ма́ма жо 1a
line 1:11 token recognition error at: '\n'
[#0,0:0='м',<'м'>,1:0]
[#1,1:1='а',<LETTER>,1:1]
[#2,2:2='́',<STRESS_MARK>,1:2]
[#3,3:3='м',<'м'>,1:3]
[#4,4:4='а',<LETTER>,1:4]
[#5,5:5=' ',<WS>,1:5]
[#6,6:7='жо',<'жо'>,1:6]
[#7,8:8=' ',<WS>,1:8]
[#8,9:9='1',<'1'>,1:9]
[#9,10:10='a',<'a'>,1:10]
[#10,12:11='<EOF>',<EOF>,2:0]
line 1:0 extraneous input 'м' expecting LETTER
line 1:3 missing WS at 'м'
line 1:4 extraneous input 'а' expecting WS
line 1:6 mismatched input 'жо' expecting {'-', WS, LETTER, STRESS_MARK}
(entry (headword м а ́) <missing WS> (definition (main_symbol м) а (index_number жо 1) (index_letter a)) <EOF>)
Solution is add a parser rule as following and use it instead of LETTER lexer rule.
letter
: LETTER
| 'жо'
| 'ж'
| 'м'
;

Antlr4 float number

I am trying to use ANTLR4 to parse input from user but having a hard time.
I want to get a list of numbers. Here is part of my grammar:
number
: DEC
| FLOAT
| HEX
| BIN
;
FLOAT : DIGIT? '.' DIGIT*;
DEC : DIGIT+ ;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
When input is 1 .2 3.2 then I get 1 .2 3.2
But if I use 1.2.3 it incorrectly recognizes 1.2 .3
How can I change the grammar to fix this?
FLOAT rule seems wrong. I have updated number and FLOAT definitions. Below code works only single numbers.
number
: (FLOAT | DEC | HEX | BIN) EOF
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+ -> skip;
In most complex grammars, there are other tokens like signs, parenthesis.. etc. So we can easily handle separate tokens with space skipping. However, your grammar has only numbers and I can not separate tokens with skipping spaces. So, I discarded the whitespace skip definition. Below code handle many numbers and fails if the word is like 1.2.3. You should test with numbers and don't process WS tokens.
numbers
: WS? number (WS number)* WS? EOF;
number
: (FLOAT | DEC | HEX | BIN)
;
FLOAT : DIGIT+ '.' DIGIT*
| '.' DIGIT+
;
DEC : DIGIT+;
HEX : '0' [xX] ([A-Fa-f] | DIGIT)+ ;
BIN : '0' [bB] [01]+ ;
fragment ALPHA: [a-zA-Z_];
fragment DIGIT : [0-9];
WS : [ ,\t\r\n]+;

Can I make my ANTLR4 Lexer discard a character from the input stream?

I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()\" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?
Here is my simplified parser:
parser grammar PdfStreamParser;
options { tokenVocab=PdfSteamLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):
lexer grammar PdfStreamLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ;
// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE
: '\\' ( [\r\nnrtbf()\\] | [0-7] [0-7]? [0-7]? )
;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.
To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:
lexer grammar DemoLexer;
STRING_OPEN
: '(' -> pushMode(STRING_MODE)
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
mode STRING_MODE;
STRING_CLOSE
: ')' -> popMode
;
ESCAPE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
STRING_PART
: ~[\\()]
;
NESTED_STRING_OPEN
: '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
;
IGNORED_ESCAPE
: '\\' . -> skip
;
which can be used in the parser as follows:
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
parse
: ( string | OTHER )* EOF
;
string
: STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
;
If you now parse the string FU(abc(def)\#\))BAR, you will get the following parse tree:
As you can see, the \) is left in the tree, but \# is omitted.

how to use sequence for tokenazation but not returned as par of result

I would like to use a rule or sequence as a separator to tokenize a file but not return the separator
I tried using -> channel(hidden) but that messes up the parsing
I have a grammar such that
grammar test;
file
: l1 l2? l3
;
l1
: 'L1:' STRING_LITERAL '\n'
;
l2
: 'L2:'(NUMBER)+ '\n'
;
l3
:'L3:' WORD|NUMBER '\n'
;
NUMBER : [0-9]+ ;
STRING_LITERAL : '"' (~["\\\r\n] | EscapeSequence)* '"';
WORD : ('a'..'z' | 'A'..'Z')+;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
;
and an input file like
L1: "SO LONG"
L2: 42
L3: FISH
I'd like to not return L1: L2: and L3: but do return "SO LONG" 42 and FISH
I get the tokens I'm looking for but I also get \n L1: L2: and L3:
Also I noticed that if I have l1 rule set as l1 : (~["\\r\n])* ; I can match till end of line no problem but I get every word as a separate token. This makes sense to me but is there a way to take it as a single token?
If you want to be able to use these L1: tokens inside the parser, then there's no way to remove them. I don't see a real use-case for that anyway. But, I don't see why you can't just skip (or hide) these tokens during the lexer. This seems to work just fine:
parse
: NL* line ( NL+ line )* NL* EOF
;
line
: l1
| l2
| l3
;
l1 : STRING_LITERAL;
l2 : NUMBER+;
l3 : ( WORD | NUMBER );
NUMBER : [0-9]+;
STRING_LITERAL : '"' ( ~["\\\r\n] | EscapeSequence )* '"';
WORD : [a-zA-Z]+;
IGNORED
: 'L' [0-9] ':' -> skip
;
SPACES
: [ \t]+ -> skip
;
NL
: '\r'? '\n'
;
fragment EscapeSequence
: '\\' [btnfr"'\\]
| '\\' ([0-3]? [0-7])? [0-7]
;
resulting in:
[...] so I should be able to do something like if (parser.l1() == "SO LONG"") then do something
That is not how ANTLR works. The parser produces a parse tree (with all the tokens you defined). That parse tree can then be used to extract values from. Extracting values can be done by manually walking the parse tree, or by using ANTLR's listener (or visitor) class: https://github.com/antlr/antlr4/blob/master/doc/listeners.md
This is my suggestion to you: do not skip the line break and L1: tokens from the lexer, and use a listener or visitor to retrieve the data from you parse tree.

Not Able to Recognize Strings and Characters in ANTLr

In my ANTLr code, we should be able to recognize strings, characters, hexadecimal numbers etc.
However, in my code, when I test it like this:
grun A1_lexer tokens -tokens test.txt
With my test.txt file being a simple string, such as "pineapple", it is unable to recognize the different tokens.
In my lexer, I define the following helper tokens:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ;
fragment Digit: ['0'-'9'] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
And I define the following tokens:
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Id: Alpha Alpha_num* ;
I run it like this:
grun A1_lexer tokens -tokens test.txt
And it outputs this:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really wondering what the problem is and how I could fix it.
Thanks.
UPDATE 1:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '\"' ;
I have updated the code, I got rid of the un-necessary single quotes in my Char classification. However, I get the same output as before.
UPDATE 2:
Even when I make the changes suggested, I still get the same error. I believed the problem is that I am not recompiling, but I am. These are the steps that I take to recompile.
antlr4 A1_lexer.g4
javac A1_lexer*.java
chmod a+x build.sh
./build.sh
grun A1_lexer tokens -tokens test.txt
With my build.sh file looking like this:
#!/bin/bash
FILE="A1_lexer"
ANTLR=$(echo $CLASSPATH | tr ':' '\n' | grep -m 1 "antlr-4.7.1-
complete.jar")
java -jar $ANTLR $FILE.g4
javac $FILE*.java
Even when I recompile, my antlr code is still unable to recognize the tokens.
My code is also now like this:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
UPDATE 3:
Grammar:
program
:'class Program {'field_decl* method_decl*'}'
field_decl
: type (id | id'['int_literal']') ( ',' id | id'['int_literal']')*';'
| type id '=' literal ';'
method_decl
: (type | 'void') id'('( (type id) ( ','type id)*)? ')'block
block
: '{'var_decl* statement*'}'
var_decl
: type id(','id)* ';'
type
: 'int'
| 'boolean'
statement
: location assign_op expr';'
| method_call';'
| 'if ('expr')' block ('else' block )?
| 'switch' expr '{'('case' literal ':' statement*)+'}'
| 'while (' expr ')' statement
| 'return' ( expr )? ';'
| 'break ;'
| 'continue ;'
| block
assign_op
: '='
| '+='
| '-='
method_call
: method_name '(' (expr ( ',' expr )*)? ')'
| 'callout (' string_literal ( ',' callout_arg )* ')'
method_name
: id
location
: id
| id '[' expr ']'
expr
: location
| method_call
| literal
| expr bin_op expr
| '-' expr
| '!' expr
| '(' expr ')'
callout_arg
: expr
| string_literal
bin_op
: arith_op
| rel_op
| eq_op
| cond_op
arith_op
: '+'
| '-'
| '*'
| '/'
| '%'
rel_op
: '<'
| '>'
| '<='
| '>='
eq_op
: '=='
| '!='
cond_op
: '&&'
| '||'
literal
: int_literal
| char_literal
| bool_literal
id
: alpha alpha_num*
alpha
: ['a'-'z''A'-'Z''_']
alpha_num
: alpha
| digit
digit
: ['0'-'9']
hex_digit
: digit
| ['a'-'f''A'-'F']
int_literal
: decimal_literal
| hex_literal
decimal_literal
: digit+
hex_literal
: '0x' hex_digit+
bool_literal
: 'true'
| 'false'
char_literal
: '‘'char'’'
string_literal
: '“'char*'”'
test.txt :
"pineapple"
A1_lexer:
fragment Delimiter: ' ' | '\t' | '\n' ;
fragment Alpha: [a-zA-Z_];
fragment Char: [a-zA-Z0-9] ;
fragment Digit: [0-9] ;
fragment Alpha_num: Alpha | Digit ;
fragment Single_quote: '\'' ;
fragment Double_quote: '"' ;
fragment Hex_digit: Digit | [a-fA-F] ;
fragment Eq_op: '==' | '!=' ;
Char_literal : (Single_quote)Char(Single_quote) ;
String_literal : (Double_quote)Char*(Double_quote) ;
Decimal_literal : Digit+ ;
Id: Alpha Alpha_num* ;
What I Write in Terminal:
grun A1_lexer tokens -tokens test.txt
Output in Terminal:
line 1:0 token recognition error at: '"'
line 1:1 token recognition error at: 'p'
line 1:2 token recognition error at: 'ine'
line 1:6 token recognition error at: 'p'
line 1:7 token recognition error at: 'p'
line 1:8 token recognition error at: 'l'
line 1:9 token recognition error at: 'e"'
[#0,5:5='a',<Id>,1:5]
[#1,12:11='<EOF>',<EOF>,2:0]
I am really not sure why this is happening.
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9']
['a'-'z'] doesn't mean "a to z", it means "a single quote, or a, or a single quote to a single quote, or z, or a single quote", which simplifies to just "a single quote, a or z". What you want is just [a-z] without the quotes and the same applies to the other character classes as well - except that they also contain spaces, so it's "single quote, A, single quote, space to space, single quote, Z, or single quote" etc. Also you don't need to "or" character classes, you can just write everything in one character class like this: [a-zA-Z0-9] (like you already did for the Alpha rule).
The same applies to the Digit rule as well.
Note that it's a bit unusual to only allow these specific characters inside quotes. Usually you'd allow everything that isn't an unescaped quote or an invalid escape sequence. But of course that all depends on the language you're parsing.

Resources