ANTLR Grammar to get a Sentence as Single Token

ANTLR Grammar to get a Sentence as Single Token - antlr4

I am trying to parse PlantUML Sequence diagram using ANTLR grammar.
I am able to generate an AST as per my need.
But the only issue I am facing is while extracting the activity name.
PlantUML SequenceDiagram Grammar :
grammar SequenceDiagram;
uml:
'#startuml'
('autonumber')?
('hide footbox')?
(NEWLINE | sequence_diagram)
'#enduml'
;
sequence_diagram:
(node | relationship | NEWLINE)*
;
node:
('actor' | 'entity') ident 'as' ident;
relationship:
action
(NEWLINE* note)?
;
action:
left=ident
arrow
right=ident
':'
lable=ident
;
note:
'note' direction ':' ident
;
ident:
IDENT;
label:
ident (ident)+ ~(NEWLINE);
direction:
('left'|'right');
arrow:
('->'|'-->'|'<-'|'<--');
IDENT : NONDIGIT ( DIGIT | NONDIGIT )*;
NEWLINE : [\r\n]+ -> skip ;
COMMENT :
('/' '/' .*? '\n' | '/*' .*? '*/') -> channel(HIDDEN)
;
WS : [ ]+ -> skip ; // toss out whitespace
fragment NONDIGIT : [_a-zA-Z];
fragment DIGIT : [0-9];
fragment UNSIGNED_INTEGER : DIGIT+;
Sample SequenceDiagram Code :
#startuml
actor Alice as al
entity Bob as b
Alice -> Bob: Authentication_Request
Bob --> Alice: Authentication_Response
Alice -> Bob: Another_Authentication_Request
Alice <-- Bob: Another_Authentication_Response
note right: example_note
#enduml
Generated AST :
Do note that the labels -
Authentication_Request, Authentication_Response, etc. are a single word (my workaround).
I would like to use them as space separated - "Authentication Request", "Authentication Response" etc.
I am unable to figure out how to get them as a single token.
Any help would be appreciated.
Edit 1 :
How do I extract the description part in the actor and usecase declarations :
Need to extract Chef, "Food Critic", "Eat Food", ..., "Drink", ..., Test1
package Professional {
actor Chef as c
actor "Food Critic" as fc
}
package Restaurant {
usecase "Eat Food" as UC1
usecase "Pay for Food" as UC2
usecase "Drink" as UC3
usecase "Review" as UC4
usecase Test1
}
SOLUTION for the above edit:
node:
('actor' | 'usecase') (ident | description) 'as'? ident?;
description:
DESCRIPTION;
DESCRIPTION: '"' ~["]* '"';

Perhaps use the ‘:’ and EOL as delimiters. (Looking at the PlantUML site, this seems to be how it’s used (at least for sequence diagrams).
You’d need to drop the ’:‘ part of your action rule (and strip the leading : when using your LABELS token). You could avoid this with a Lexer mode, but that seems like overkill.
The plantUML site includes this example:
#startuml
Alice -> Alice: This is a signal to self.\nIt also demonstrates\nmultiline \ntext
#enduml
So you'll need to be pretty flexible about what you accept in the LABEL token. (it's not just one or more IDENTs), so I'm using a rule that just picks up everything from the ':' until the EOL.
action:
left=ident
arrow
right=ident
LABEL
;
LABEL: ‘:’ ~[\r\n]*;

Related

How to define ANTLR Parser Rule for concatenated tokens/

I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR

You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!

Defining Antlr lexer rule with termination condition

There is a case to parse 2 tokens which are separated by ‘2/’ . Both tokens can be of alphanumeric characters with no fixed length.
Examples: Abcd34D22/ERTD34D or ABCD2/DEF
Desired output : TOKEN1 = ‘Abcd34D2’, SEPARATOR: ‘2/’ , TOKEN2 = ‘ERTD34D’
I would like to know if there is a way to define lexer rule for TOKEN1 and manage the ambiguity so that if 2 is followed by /, it should qualified to be matched as SEPARATOR. Below is the sample token definitions for illustration.
fragment ALPHANUM: [0-9A-Za-z];
fragment SLASH: '/';
TOKEN1 : (ALPHANUM)+;
SEPARATOR : '2' SLASH -> mode(TOKEN2_MODE);
mode TOKEN2_MODE;
TOKEN2 : (ALPHANUM)+;

AFAIK, you'll have to use a predicate, which means you'll have to add some target specific code to your grammar. If your target language is Java, you could do something like this:
TOKEN1
: TOKEN1_ATOM+
;
fragment TOKEN1_ATOM
: [013-9A-Za-z] // match a single alpha-num except '2'
| '2' {_input.LA(1) != '/'}? // only match `2` if there's no '/' after it
;

ANTLR4 rules priority

I'm trying to get a simple grammar to work using ANTLR4. Basically a list of keywords separated by ; that can be negated using Not. Something like this, for example:
Not negative keyword;positive
I wrote the following grammar:
grammar input;
input : clauses;
keyword : NOT? WORD;
clauses : keyword (SEPARATOR clauses)?;
fragment N : ('N'|'n') ;
fragment O : ('O'|'o') ;
fragment T : ('T'|'t') ;
fragment SPACE : ' ' ;
SEPARATOR : ';';
NOT : N O T SPACE;
WORD : ~[;]+;
My issue is that in the keyword rule, WORD seems to have more priority than NOT. Not something is recognized as the Not something word instead of a negated something.
For instance, the parse tree I get is this
.
What I'm trying to achieve is something like this
How can you give an expression more priority over another on ANTLR4? Any tip on fixing this?
Please note that while this grammar is very simple and ANTLR4 can seem unecessary here, the true grammar I want to make is more complex and I have just simplified it here to demonstrate my issue.
Thank you for your time!

You have no explicit whitespace rule and you include whitespaces in your WORD rule. Yet you want words separated by whitespaces. That cannot work. Don't include whitespaces in words (that's against the usual meaning of a word anyway). Instead specify exactly what a word is really (usually a combination of letters and digits, not led by a letter). Additionally, I would restructure the grammar such that positive and negative are not part of keyword, but separate entitites. Here I defined them as own keywords, but if that is not what you want replace them with just WORD:
grammar input;
input : clauses EOF;
keyword : NOT? (POSITIVE | NEGATIVE) WORD?;
clauses : keyword (SEPARATOR keyword)*;
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];
SEPARATOR : ';';
NOT : N O T;
POSITIVE : P O S I T I V E;
NEGATIVE : N E G A T I V E;
fragment LETTER: DIGIT | LETTER_NO_DIGIT;
fragment LETTER_NO_DIGIT: [a-zA-Z_$\u0080-\uffff];
WORD: LETTER_NO_DIGIT LETTER*;
WHITESPACE: [ \t\f\r\n] -> channel(HIDDEN);
fragment DIGIT: [0-9];
fragment DIGITS: DIGIT+;
which gives you this parse tree for your input:

Token collision (??) writing ANTLR4 grammar

I have what I thought a very simple grammar to write:
I want it to allow token called fact. These token can start with a letter and then allow a any kind of these: letter, digit, % or _
I want to concat two facts with a . but the the second fact does not have to start by a letter (a digit, % or _ are also valid from the second token)
Any "subfact" (even the initial one) in the whole fact can be "instantiated" like an array (you will get it by reading my examples)
For example:
Foo
Foo%
Foo_12%
Foo.Bar
Foo.%Bar
Foo.4_Bar
Foo[42]
Foo['instance'].Bar
etc
I tried to write such grammar but I can't get it working:
grammar Common;
/*
* Parser Rules
*/
fact: INITIALFACT instance? ('.' SUBFACT instance?)*;
instance: '[' (LITERAL | NUMERIC) (',' (LITERAL | NUMERIC))* ']';
/*
* Lexer Rules
*/
INITIALFACT: [a-zA-Z][a-zA-Z0-9%_]*;
SUBFACT: [a-zA-Z%_]+;
ASSIGN: ':=';
LITERAL: ('\'' .*? '\'') | ('"' .*? '"');
NUMERIC: ([1-9][0-9]*)?[0-9]('.'[0-9]+)?;
WS: [ \t\r\n]+ -> skip;
For example, if I tried to parse Foo.Bar, I get: Syntax error line 1 position 4: mismatched input 'Bar' expecting SUBFACT.
I think this is because ANTLR first finds Bar match INITIALFACT and stops here. How can I fix this ?
If it is relevent, I am using Antlr4cs.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex

Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR Grammar to get a Sentence as Single Token - antlr4

Related

How to define ANTLR Parser Rule for concatenated tokens/

Defining Antlr lexer rule with termination condition

ANTLR4 rules priority

Token collision (??) writing ANTLR4 grammar

Token recognition order

Categories

Resources