ANTLR grammar to parse nested JSON Array

ANTLR grammar to parse nested JSON Array - antlr4

My input text is an expression written in prefix notion in JSON array. where array[0]
is operator, and any item after that is operands for operator. array can be nested, so if the one of the array items is an array, i have to evaluate that first.
example
["+", 2, 3]
["+", ["+", 1, 1], 3]
any suggestions on how to create grammar for this representation

// lexer rules:
expression:
NUMBER
|'[' OPERATOR (',' expression)* ']';
// parser rules:
QUOTE: '"';
NUMBER: '0-9'+; // not that simple
OPERATOR_PLUS: '+';
OPERATOR_MINUS: '-';
...
OPERATOR: QUOTE (OPERATOR_PLUS | OPERATOR_MINUS | ...) QUOTE;
Something like that. Think about the lexers and the recursive rules.

Related

How to define ANTLR Parser Rule for concatenated tokens/

I need to match a token that can be combined from two parts:
"string" + any number; e.g. string64, string128, etc.
In the lexer rules I have
STRING: S T R I N G;
NUMERIC_LITERAL:
((DIGIT+ ('.' DIGIT*)?) | ('.' DIGIT+)) (E [-+]? DIGIT+)?
| '0x' HEX_DIGIT+;
In the parser, I defined
type_id_string: STRING NUMERIC_LITERAL;
However, the parser doesn't not match and stop at expecting STRING token
How do I tell the parser that token has two parts?
BR

You probably have some "identifier" rule like this:
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
which will cause input like string64 to be tokenized as an ID token and not as a STRING and NUMERIC_LITERAL tokens.
Also trying to match these sort of things in a parser rule like:
type_id_string: STRING NUMERIC_LITERAL;
will go wrong when you're discarding white spaces in the lexer. If the input would then be "string 64" (string + space + 64) it could possible be matched by type_id_string, which is probably not what you want.
Either do:
type_id_string
: ID
;
or define these tokens in the lexer:
type_id_string
: ID
;
// Important to match this before the `ID` rule!
TYPE_ID_STRING
: [a-zA-Z] [0-9]+
;
ID
: [a-zA-Z_] [a-zA-Z0-9_]*
;
However, when doing that, input like fubar1 will also become a TYPE_ID_STRING and not an ID!

AttributeError: 'str' object has no attribute 'title()'

first="harry"
last="potter"
print(first, first.title())
print(f"Full name: {first.title()} {last.title()}")
print("Full name: {0.title()} {1.title()}".format(first, last))
The first two statements works fine; which means there is attribute title() to 'str' object.
The third print statement gives error. Why is it so?

The str.format() syntax is different from f-string syntax. In particular, while f-strings essentially let you put any expression between the brackets, str.format() is considerably more limited. Per the documentation:
The grammar for a replacement field is as follows:
replacement_field ::= "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name ::= arg_name ("." attribute_name | "[" element_index "]")*
arg_name ::= [identifier | digit+]
attribute_name ::= identifier
element_index ::= digit+ | index_string
index_string ::= <any source character except "]"> +
conversion ::= "r" | "s" | "a"
format_spec ::= <described in the next section>
You'll note that, while attribute names (via the dot operator .) and indices (via square-brackets []) - in other words, values - are valid, actual method calls (or any other expressions) are not. I hypothesize this is because str.format() does not actually execute the text, but just swaps in an object that already exists.
Actual f-strings (your second example) share a similar syntax to the str.format() method, in that they use curly-brackets {} to indicate the areas to replace, but according to the PEP that introduced them,
F-strings provide a way to embed expressions inside string literals, using a minimal syntax. It should be noted that an f-string is really an expression evaluated at run time, not a constant value.
This is clearly different (more complex) than str.format(), which is more of a simple text replacement - an f-string is an expression and is executed as such, and allows full expressions inside its brackets (in fact, you can even nest f-strings inside each other, which is fun).

str.format() passes string object in respective placeholder. And by using '.' you can access the string attributes or functionalities. That is why {0.title()} searching for the specific method in the string class and it is getting nothing about title().
But if you use
print("Full name: {0.title} {1.title}".format(first, last))
>> Full name: <built-in method title of str object at 0x7f5e42d09630><built-in method title of str object at 0x7f5e42d096b0>
Here you can see you can access built-in method of string
If you want to use title() with format() then use like this:
print("Full name: {0} {1}".format(first.title(), last.title()))
>> Full name: Harry Potter

Token collision (??) writing ANTLR4 grammar

I have what I thought a very simple grammar to write:
I want it to allow token called fact. These token can start with a letter and then allow a any kind of these: letter, digit, % or _
I want to concat two facts with a . but the the second fact does not have to start by a letter (a digit, % or _ are also valid from the second token)
Any "subfact" (even the initial one) in the whole fact can be "instantiated" like an array (you will get it by reading my examples)
For example:
Foo
Foo%
Foo_12%
Foo.Bar
Foo.%Bar
Foo.4_Bar
Foo[42]
Foo['instance'].Bar
etc
I tried to write such grammar but I can't get it working:
grammar Common;
/*
* Parser Rules
*/
fact: INITIALFACT instance? ('.' SUBFACT instance?)*;
instance: '[' (LITERAL | NUMERIC) (',' (LITERAL | NUMERIC))* ']';
/*
* Lexer Rules
*/
INITIALFACT: [a-zA-Z][a-zA-Z0-9%_]*;
SUBFACT: [a-zA-Z%_]+;
ASSIGN: ':=';
LITERAL: ('\'' .*? '\'') | ('"' .*? '"');
NUMERIC: ([1-9][0-9]*)?[0-9]('.'[0-9]+)?;
WS: [ \t\r\n]+ -> skip;
For example, if I tried to parse Foo.Bar, I get: Syntax error line 1 position 4: mismatched input 'Bar' expecting SUBFACT.
I think this is because ANTLR first finds Bar match INITIALFACT and stops here. How can I fix this ?
If it is relevent, I am using Antlr4cs.

Matching the finite closure pattern ({x,y}) of regular expressions

I am trying to write a grammar that will match the finite closure pattern for regular expressions ( i.e foo{1,3} matches 1 to 3 'o' appearances after the 'fo' prefix )
To identify the string {x,y} as finite closure it must not include spaces for example { 1, 3} is recognized as a sequence of seven characters.
I have written the following lexer and parser file but i am not sure if this is the best solution. I am using a lexical mode for the closure pattern which is activated when a regular expression matches a valid closure expression.
lexer grammar closure_lexer;
#header { using System;
using System.IO; }
#lexer::members{
public static bool guard = true;
public static int LBindex = 0;
}
OTHER : .;
NL : '\r'? '\n' ;
CLOSURE_FLAG : {guard}? {LBindex =InputStream.Index; }
'{' INTEGER ( ',' INTEGER? )? '}'
{ closure_lexer.guard = false;
// Go back to the opening brace
InputStream.Seek(LBindex);
Console.WriteLine("Enter Closure Mode");
Mode(CLOSURE);
} -> skip
;
mode CLOSURE;
LB : '{';
RB : '}' { closure_lexer.guard = true;
Mode(0); Console.WriteLine("Enter Default Mode"); };
COMMA : ',' ;
NUMBER : INTEGER ;
fragment INTEGER : [1-9][0-9]*;
and the parser grammar
parser grammar closure_parser;
#header { using System;
using System.IO; }
options { tokenVocab = closure_lexer; }
compileUnit
: ( other {Console.WriteLine("OTHER: {0}",$other.text);} |
closure {Console.WriteLine("CLOSURE: {0}",$closure.text);} )+
;
other : ( OTHER | NL )+;
closure : LB NUMBER (COMMA NUMBER?)? RB;
Is there a better way to handle this situation?
Thanks in advance

This looks quite complex for such a simple task. You can easily let your lexer match one construct (preferably that without whitespaces, if you usually skip them) and the parser matches the other form. You don't even need lexer modes for that.
Define your closure rule:
CLOSURE
: OPEN_CURLY INTEGER (COMMA INTEGER?)? CLOSE_CURLY
;
This rule will not match any form that contains e.g. whitespaces. So, if your lexer does not match CLOSURE you will get all the individual tokens like the curly braces and integers ending up in your parser for matching (where you then can treat them as something different).
NB: doesn't the closure definition also allow {,n} (same as {n})? That requires an additional alt in the CLOSURE rule.
And finally a hint: your OTHER rule will probably give you trouble as it matches any char and is even located before other rules. If you have a whildcard rule then it should be the last in your grammar, matching everything not matched by any other rule.

ANTLR v4 grammar fails to parse due to mismatched EOF

Follows a simple grammar with ANTLR v4. This grammar when walked produces a error message
**line 1:14 mismatched input '' expecting DimensionName*
for trivial input such as "sdarsfd integer" (without quotation marks).
SO has mention f similar errors and a bug perhaps were filed in 4.3 timeframe.
I have been using ANTLR 4.5.
Any help/pointer/solution?
/**
A simple parser for a dimension declaration
*/
grammar Simple;
definition : dim;
dim : DimensionName DataType;
DimensionName : LETTER (LETTER)*; // greedy
DataType: 'integer' | 'decimal';
LETTER : [a-zA-Z];
DIGIT : [0-9];
WS: [ \t\n\r]+ -> skip;

You just have to switch the two lexer rules DataType and DimensionName
...
DataType: 'integer' | 'decimal';
DimensionName : LETTER (LETTER)*; // greedy
...
As DimensionName matches every chars, 'integer' is typed as a DimensionName instead of a DataType. For "sdarsfd integer", the lexer produces two DimensionName token, so the dim rule cannot be matched. By switching the two lexer rules, the lexer produces a DimensionName token and a DataType token which match the dim rule.
Also, you can define LETTER and DIGIT as fragment:
fragment LETTER : [a-zA-Z];
fragment DIGIT : [0-9];
Unless you want them to be matched as independent token (in your grammar, "a" will be typed as a LETTER).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR grammar to parse nested JSON Array - antlr4

Related

How to define ANTLR Parser Rule for concatenated tokens/

AttributeError: 'str' object has no attribute 'title()'

Token collision (??) writing ANTLR4 grammar

Matching the finite closure pattern ({x,y}) of regular expressions

ANTLR v4 grammar fails to parse due to mismatched EOF

Categories

Resources