Matching the finite closure pattern ({x,y}) of regular expressions - antlr4

I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance

This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.

Related

How to define ANTLR Parser Rule for concatenated tokens/

I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR
You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!

ANTLR4 lexer rule ensuring expression does not end with character

I have a syntax where I need to match given the following example:
some-Text->more-Text
From this example, I need ANTLR4 lexer rules that would match 'some-Text' and 'more-Text' into one lexer rule, and the '->' as another rule.
I am using the lexer rules shown below as my starting point, but the trouble is, the '-' character is allowed in the NAMEDELEMENT rule, which causes the first NAMEDELEMENT match to become 'some-Text-', which then causes the '->' to not be captured by the EDGE rule.
I'm looking for a way to ensure that the '-' is not captured as the last character in the NAMEDELEMENT rule (or some other alternative that produces the desired result).
EDGE
: '->'
;
NAMEDELEMENT
: ('a'..'z'|'A'..'Z'|'_'|'#') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')* { _input.LA(1) != '-' && _input.LA(2) != '>' }?
;
Im trying to use the predicate above to look ahead for a sequence of '-' and '>', but it doesn't seem to work. It doesn't seem to do anything at all, actually, as get the same parsing results both with and without the predicate.
The parser rules are as follows, where I am matching on 'selector' rules:
selector
: namedelement (edge namedelement)*
;
edge
: EDGE
;
namedelement
: NAMEDELEMENT
;
Thanks in advance!
After messing around with this for hours, I have a syntax that works, though I fail to see how it is functionally any different than what I posted in the original question.
(I use the uncommented version so that I can put a break point in the generated lexer to ensure that the equality test is evaluating correctly.)
NAMEDELEMENT
//: [a-zA-Z_#] [a-zA-Z_-]* { String.fromCharCode(this._input.LA(1)) != ">" }?
: [a-zA-Z_#] [a-zA-Z_-]* { (function(a){
var c = String.fromCharCode(a._input.LA(1));
return c != ">";
})(this)
}?
;
My target language is JavaScript and both the commented and uncommented forms of the predicate work fine.
Try this:
NAMEDELEMENT
: [a-zA-Z_#] ( '-' {_input.LA(1) != '>'}? | [a-zA-Z0-9_] )*
;
Not sure if _input.LA(1) != '>' is OK with the JavaScript runtime, but in Java it properly tokenises "some-->more" into "some-", "->" and "more".

How will I implement the lexing of strings using ocamllex?

I am new to the concept of lexing and am trying to write a lexer in ocaml to read the following example input:
(blue, 4, dog, 15)
Basically the input is a list of any random string or integer. I have found many examples for int based inputs as most of them model a calculator, but have not found any guidance through examples or the documentation for lexing strings. Here is what I have so far as my lexer:
(* File lexer.mll *)
{
open Parser
}
rule lexer_main = parse
[' ' '\r' '\t'] { lexer_main lexbuf } (* skip blanks *)
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '(' { LPAREN }
| ')' { RPAREN }
| ',' { COMMA }
| eof { EOF }
| _ { syntax_error "couldn't identify the token" }
As you can see I am missing the ability to parse strings. I am aware that a string can be represented in the form ['a'-'z'] so would it be as simple as ['a'-'z'] { STRING }
Thanks for your help.
The notation ['a'-'z'] represents a single character, not a string. So a string is more or less a sequence of one or more of those. I have a fear that this is an assignment, so I'll just say that you can extend a pattern for a single character into a pattern for a sequence of the same kind of character using the same technique you're using for INT.
However, I wonder whether you really want your strings to be so restrictive. Are they really required to consist of alphabetic characters only?

Parsing fortran-style .op. operators

I'm trying to write an ANTLR4 grammar for a fortran-inspired DSL. I'm having difficulty with the 'ole classic ".op." operators:
if (1.and.1) then
where both "1"s should be intepreted as integer. I looked at the OpenFortranParser for insight, but I can't make sense out of it.
Initially, I had suitable definitions for INTEGER and REAL in my lexer. Consequently, the first "1" above always parsed as a REAL, no matter what I tried. I tried moving things into the parser, and got it to the point where I could reliably recognize the ".and." along with numbers around it as appropriately INTEGER or REAL.
if (1.and.1) # INT/INT
if (1..and..1) # REAL/REAL
...etc...
I of course want to recognize variable-names in such statements:
if (a.and.b)
and have an appropriate rule for ID. In the small grammar below, however, any literals in quotes (ex, 'and', 'if', all the single-character numerical suffixes) are not accepted as an ID, and I get an error; any other ID-conforming string is accepted:
if (a.and.b) # errs, as 'b' is valid INTEGER suffix
if (a.and.c) # OK
Any insights into this behavior, or better suggestions on how to parse the .op. operators in fortran would be greatly appreciated -- Thanks!
grammar Foo;
start : ('if' expr | ID)+ ;
DOT : '.' ;
DIGITS: [0-9]+;
ID : [a-zA-Z0-9][a-zA-Z0-9_]* ;
andOp : DOT 'and' DOT ;
SIGN : [+-];
expr
: ID
| expr andOp expr
| numeric
| '(' expr ')'
;
integer : DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
real
: DIGITS DOT DIGITS? (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
| DOT DIGITS (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
;
numeric : integer | real;
EOLN : '\r'? '\n' -> skip;
WS : [ \t]+ -> skip;
To disambiguate DOT, add a lexer rule with a predicate just before the DOT rule.
DIT : DOT { isDIT() }? ;
DOT : '.' ;
Change the 'andOp'
andOp : DIT 'and' DIT ;
Then add a predicate method
#lexer::members {
public boolean isDIT() {
int offset = _tokenStartCharIndex;
String r = _input.getText(Interval.of(offset-4, offset));
String s = _input.getText(Interval.of(offset, offset+4));
if (".and.".equals(s) || ".and.".equals(r)) {
return true;
}
return false;
}
}
But, that is not really the source of your current problem. The integer parser rule defines lexer constants effectively outside of the lexer, which is why 'b' is not matched to an ID.
Change it to
integer : INT ;
INT: DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;
and the lexer will figure out the rest.

No viable alternative at input ' '

I know this question has been asked before, but I haven't found any solution to my specific problem. I am using Antlr4 with the C# target and I have the following lexer rules:
INT : [0-9]+
;
LETTER : [a-zA-Z_]+
;
WS : [ \t\r\n\u000C]+ -> skip
;
LineComment
: '#' ~[\r\n]* -> skip
;
That are all lexer rules, but there are many parser rules which I will not post here since I don't think it is relevant.
The problem I have is that whitespaces do not get skipped. When I inspect the token stream after the lexer ran my input, the whitespaces are still in there and therefore cause an error. The input I use is relatively basic:
"fd 100"
it parses complete until it reaches this parser rule:
noSignFactor
: ':' ident #NoSignFactorArg
| integer #NoSignFactorInt
| float #NoSignFactorFloat
| BOOLEAN #NoSignFactorBool
| '(' expr ')' #NoSignFactorExpr
| 'not' factor #NoSignFactorNot
;
integer : INT #IntegerInt
;
Start by separating your grammar into a separate lexer grammar and parser grammar. For example, if you have a grammar Foo;, create the following:
Create a file FooLexer.g4, and move all of the lexer rules from Foo.g4 into FooLexer.g4.
Create a file FooParser.g4, and move all of the parser rules from Foo.g4 into FooParser.g4.
Include the following option in FooParser.g4:
options {
tokenVocab=FooLexer;
}
This separation will ensure that your parser isn't silently creating lexer rules for you. In a combined grammar, using a literal such as 'not' in a parser rule will create a lexer rule for you if one does not already exist. When this happens, it's easy to lose track of what kinds of tokens your lexer is able to produce. When you use a separate lexer grammar, you will need to explicitly declare a rule like the following in order to use 'not' in a parser rule.
NOT : 'not';
This should solve the problems with whitespace should you have included the literal ' ' somewhere in a parser rule.

Resources