Antlr deep rule set performance issue

Antlr deep rule set performance issue - antlr4

I've tried making a grammar that understands expression priorities for a C#-like language:
var a = expression0.expression1(expression2 + expression3() * expression4)
When correctly prioritized becomes:
var a = (expression0.expression1)(expression2 + ((expression3()) * expression4))
To achieve that, I've sorted the expressions into rules by priority. Here's the relevant excerpt from my grammar:
expression: assignmentExpression;
assignmentExpression
: equalityExpression ASSIGNMENT assignmentExpression
| equalityExpression
;
equalityExpression
: logicalExpression equals equalityExpression
| logicalExpression notEquals equalityExpression
| logicalExpression
;
logicalExpression
: relationalExpression and logicalExpression
| relationalExpression or logicalExpression
| relationalExpression
;
relationalExpression
: addExpression greaterThan relationalExpression
| addExpression lessThan relationalExpression
| addExpression greaterOrEquals relationalExpression
| addExpression lessOrEquals relationalExpression
| addExpression
;
addExpression
: multiplyExpression add addExpression
| multiplyExpression subtract addExpression
| multiplyExpression
;
multiplyExpression
: conversionExpression multiply multiplyExpression
| conversionExpression divide multiplyExpression
| conversionExpression
;
conversionExpression
: unaryExpression AS conversionExpression
| unaryExpression
;
unaryExpression
: subtract unaryExpression
| add unaryExpression
| ifExpression
;
ifExpression
: IF PAREN expression ENDPAREN expression (ELSE expression)?
| instantiationExpression
;
instantiationExpression
: NEW invocationExpression
| invocationExpression
;
invocationExpression
: reachExpression PAREN arguments? ENDPAREN
| reachExpression
;
reachExpression
: primeExpression DOT primeExpression
| primeExpression
;
Using it on the following code:
{ int a = 7 }
Produces the following output:
expression
assignmentExpression
equalityExpression
logicalExpression
relationalExpression
addExpression
multiplyExpression
conversionExpression
unaryExpression
ifExpression
instantiationExpression
invocationExpression
reachExpression
primeExpression
blockExpression
"{"
statement
variableDeclaration
variableType
type
identifier
"int"
identifier
"a"
variableInitialization
"="
expression
assignmentExpression
equalityExpression
logicalExpression
relationalExpression
addExpression
multiplyExpression
conversionExpression
unaryExpression
ifExpression
instantiationExpression
invocationExpression
reachExpression
primeExpression
literal
integer
"7"
"}"
It probably has problems, but I can't really test it, since this kind of grammar incurs a crazy performance penalty. My previous grammar that parsed expressions left to right took a couple of milliseconds to parse, and this one takes 2 whole minutes.
I've tracked it down to ParserATNSimulator.AdaptivePredict, and deeper inside, the issue is a crazy deep stack of
ParserATNSimulator.Closure
ParserATNSimulator.ClosureCheckingStopState
ParserATNSimulator.Closure_
Instead of going down the rabbit hole once to get to the correct expression (number literal: 7), it seems to go all the way down once for every rule level. So instead of this being O(n), it takes O(n^n) just to get to that damn "7".
Is this an issue with Antlr, my grammar or just the way it is?
EDIT:
OK, I've managed to eliminate the performance issue by rewriting my grammar in this style:
assignmentExpression: equalityExpression (ASSIGNMENT assignmentExpression)?;
But I still don't understand why this happens and how did this change solve the issue.

Related

Condition from matches in two column Excel

I am working on a excel like this
I would like to create a condition from second table using matches between two tables columns values (Tool and tools) to automatically replace the column Unit prince
I want this result
<table>
| Tool | United Price |
| : ---|:------------:|
| Axe | 5,9 |
| : ---|:------------:|
| Axe | 5,9 |
| : ---|:------------:|
| Hoe | 9,1 |
| : ---|:------------:|
| Drill| 7,8 |
| : ---|:------------:|
| Hoe | 9,1 |
| : ---|:------------:|
| Hoe | 9,1 |
| : ---|:------------:|
| Drill| 7,8 |
</table>
I tried to use VLOOKUP(A2; E2:F4; 2; FALSE), but it's don't work

I think you want to use a Lookup function in the United Price cells. I’d suggest making both of them tables. From the image it just looks like loose cells but with tables you can use structured references to make the formulas cleaner and easier to maintain.

Try:
=VLOOKUP(A2; $E$2:$F$4; 2; FALSE)
This will fix the position of the lookup array.

Greek Dvorak keyboard layout

Greetings from Greece :D
In my effort to learn English Dvorak layout, which I liked btw, i tried to find a Greek Dvorak layout. I tried Ukelele (mac user), to create a layout from scratch since there are no any available online. The problem is with a single Greek Character/symbol,tone.
In it's single format is just this: ΄
a character that i can assign on Ukelele and works fine.
But that's the part i can't really fix.
This character is being combined with the vowels of the alphabet when is needed so the result i normally get is this: ά (or, ό, ή, έ etc etc).
Ukelele can't really recognize this parameter and I can't assign the tone on the letters so I get this result instead: ΄α, ΄ο, etc etc.
Any possible solutions?
thank you for your time and help!

The thing that you need to do is create a "dead key" state. You can create one in Ukelele with the "Create" button in the tool bar.
You might be interested in a lightly custom layout that I created by applying the results of a paper I found online that suggested the creation of a new greek keyboard layout (called ΝΕΠ) and the Dvorak layout.
+---+---+---+---+---+---+---+---+---+---+
| ' | , | . | π | υ | φ | γ | δ | ρ | λ |
+---+---+---+---+---+---+---+---+---+---+
| α | ο | ε | ς | ι | ΄ | η | τ | ν | σ |
+---+---+---+---+---+---+---+---+---+---+
| ; | ψ | ξ | κ | χ | β | μ | θ | ω | ζ |
+---+---+---+---+---+---+---+---+---+---+

Parsing key value IPV6 pair

I have following input format IP:FE80:CD00::211E:729C to parse.
After parsing, I want Key as IP: and value as FE80:CD00::211E:729C
I have defined following grammar
grammar IPV6;
keyValue : KEY ip_v6_address;
ip_v6_address
: h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| '::' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| h16? '::' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| ((h16 ':')? h16)? '::' h16 ':' h16 ':' h16 ':' ls32
| (((h16 ':')? h16 ':')? h16)? '::' h16 ':' h16 ':' ls32
| ((((h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16 ':' ls32
| (((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' ls32
| ((((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16
;
h16
: hexdig hexdig hexdig hexdig
| hexdig hexdig hexdig
| hexdig hexdig
| hexdig
;
hexdig
: digit
| (A | B | C | D | E | F)
;
ls32
: h16 ':' h16
| ip_v4_address
;
ip_v4_address
: dec_octet '.' dec_octet '.' dec_octet '.' dec_octet
;
dec_octet
: digit
| non_zero_digit digit
| D1 digit digit
| D2 (D0 | D1 | D2 | D3 | D4) digit
| D2 D5 (D0 | D1 | D2 | D3 | D4 | D5)
;
digit
: D0
| non_zero_digit
;
non_zero_digit
: D1
| D2
| D3
| D4
| D5
| D6
| D7
| D8
| D9
;
D0 : '0';
D1 : '1';
D2 : '2';
D3 : '3';
D4 : '4';
D5 : '5';
D6 : '6';
D7 : '7';
D8 : '8';
D9 : '9';
A : 'a'|'A';
B : 'b'|'B';
C : 'c'|'C';
D : 'd'|'D';
E : 'e'|'E';
F : 'f'|'F';
KEY: '['? STRING SPACE* STRING']'?':';
fragment SPACE : ' ';
fragment STRING: [a-zA-Z0-9/._-]+;
WS : [ \t\r\n] + -> skip;
The above grammar gives me following tokens after running against the above example
[TOKENS]
KEY 'IP:'
KEY 'FE80:'
KEY 'CD00:'
':' ':'
KEY '211E:'
D7 '7'
D2 '2'
D9 '9'
C 'C'
EOF '<EOF>'
[PARSE-TREE]
line 1:3 mismatched input 'FE80:' expecting {'::', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', A, B, C, D, E, F}
(keyValue IP:
(ip_v6_address FE80: CD00: : 211E: 7 2 9 C))
I want to have key value pairs as output and not sure if I am writing the correct grammar. I problem that I am facing is that separator ':' can exit in value as well.
Any pointers how to fix the grammar ?

It doesn't work because of overlapping lexer rules (multiple rules matching the same input).
The F char from FE80: is not being tokenised as a hex digit (the F lexer rule). However, the entire chunk FE80: is being tokenised as a KEY token.
You must realise that the lexer operates independently from the parser. The parser might be trying to match a certain token, the lexer does not "listen" to this. The lexer follows 2 very simple rules:
try to match as much characters as possible for a single token
when two or more tokens match the same characters, the rule defined first "wins"
Because of these rules, the input F is tokenised as an F token, but input like FE is tokenised as a KEY token.
The solution is to move the construction of a KEY from the lexer to a key parser rule as shown below:
grammar IPV6;
key_value
: key ':' ip_v6_address
;
key
: '[' string ']'
| string
;
ip_v6_address
: h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| '::' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| h16? '::' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| ((h16 ':')? h16)? '::' h16 ':' h16 ':' h16 ':' ls32
| (((h16 ':')? h16 ':')? h16)? '::' h16 ':' h16 ':' ls32
| ((((h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16 ':' ls32
| (((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' ls32
| ((((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16
;
h16
: hexdig hexdig hexdig hexdig
| hexdig hexdig hexdig
| hexdig hexdig
| hexdig
;
hexdig
: digit
| (A | B | C | D | E | F)
;
ls32
: h16 ':' h16
| ip_v4_address
;
ip_v4_address
: dec_octet '.' dec_octet '.' dec_octet '.' dec_octet
;
dec_octet
: digit
| non_zero_digit digit
| D1 digit digit
| D2 (D0 | D1 | D2 | D3 | D4) digit
| D2 D5 (D0 | D1 | D2 | D3 | D4 | D5)
;
digit
: D0
| non_zero_digit
;
non_zero_digit
: D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9
;
string
: (STRING_ATOM | hexdig)+
;
D0 : '0';
D1 : '1';
D2 : '2';
D3 : '3';
D4 : '4';
D5 : '5';
D6 : '6';
D7 : '7';
D8 : '8';
D9 : '9';
A : [aA];
B : [bB];
C : [cC];
D : [dD];
E : [eE];
F : [fF];
STRING_ATOM : [g-zG-Z/._-];
WS : [ \t\r\n] + -> skip;
resulting in the following parse tree:

antlr4: Grammar ambiguity, left-recursion, both?

My grammar, shown below, does not compile. The returned error (from the antlr4 maven plugin) is:
[INFO] --- antlr4-maven-plugin:4.3:antlr4 (default-cli) # beebell ---
[INFO] ANTLR 4: Processing source directory /Users/kodecharlie/workspace/beebell/src/main/antlr4
[INFO] Processing grammar: DateRange.g4
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
[ERROR] error(20): internal error: Rule HOUR undefined
[ERROR] error(20): internal error: Rule MINUTE undefined
[ERROR] error(20): internal error: Rule SECOND undefined
[ERROR] error(20): internal error: Rule HOUR undefined
[ERROR] error(20): internal error: Rule MINUTE undefined
I can see how the grammar might be confused -- Eg, whether 2 digits is a MINUTE, SECOND, or HOUR (or maybe the start of a year). But a few articles suggest this error results from left-recursion.
Can you tell what's going on?
Thanks. Here's the grammar:
grammar DateRange;
range : startDate (THRU endDate)? | 'Every' LONG_DAY 'from' startDate THRU endDate ;
startDate : dateTime ;
endDate : dateTime ;
dateTime : GMTOFF | SHRT_MDY | YYYYMMDD | (WEEK_DAY)? LONG_MDY ;
// Dates.
GMTOFF : YYYYMMDD 'T' HOUR ':' MINUTE ':' SECOND ('-'|'+') HOUR ':' MINUTE ;
YYYYMMDD : YEAR '-' MOY '-' DOM ;
SHRT_MDY : MOY ('/' | '-') DOM ('/' | '-') YEAR ;
LONG_MDY : (SHRT_MNTH '.'? | LONG_MNTH) WS DOM ','? (WS YEAR (','? WS TIMESPAN)? | WS startTime)? ;
YEAR : DIGIT DIGIT DIGIT DIGIT ; // year
MOY : (DIGIT | DIGIT DIGIT) ; // month of year.
DOM : (DIGIT | DIGIT DIGIT) ; // day of month.
TIMESPAN : startTime (WS THRU WS endTime)? ;
// Time-of-day.
startTime : TOD ;
endTime : TOD ;
TOD : NOON | HOUR2 (':' MINUTE)? WS? MERIDIAN ;
NOON : 'noon' ;
HOUR2 : (DIGIT | DIGIT DIGIT) ;
MERIDIAN : 'AM' | 'am' | 'PM' | 'pm' ;
// 24-hour clock. Sanity-check range in listener.
HOUR : DIGIT DIGIT ;
MINUTE : DIGIT DIGIT ;
SECOND : DIGIT DIGIT ;
// Range verb.
THRU : WS ('-'|'to') WS -> skip ;
// Weekdays.
WEEK_DAY : (SHRT_DAY | LONG_DAY) ','? WS ;
SHRT_DAY : 'Sun' | 'Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' -> skip ;
LONG_DAY : 'Sunday' | 'Monday' | 'Tuesday' | 'Wednesday' | 'Thursday' | 'Friday' | 'Saturday' -> skip ;
// Months.
SHRT_MNTH : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
LONG_MNTH : 'January' | 'February' | 'March' | 'April' | 'May' | 'June' | 'July' | 'August' | 'September' | 'October' | 'November' | 'December' ;
DIGIT : [0-9] ;
WS : [ \t\r\n]+ -> skip ;

I resolved this issue by setting up a unique production rule for each sequence of digits (of length 1, 2, 3, or 4). As well, I simplified several rules -- in effect, trying to make the production rule alternatives more straightforward. Anyway, here is the final result, which does compile:
grammar DateRange;
range : 'Every' WS longDay WS 'from' WS startDate THRU endDate
| startDate THRU endDate
| startDate
;
startDate : dateTime ; endDate : dateTime ; dateTime : utc
| shrtMdy
| yyyymmdd
| longMdy
| weekDay ','? WS longMdy
;
// Dates.
utc : yyyymmdd 'T' hour ':' minute ':' second ('-'|'+') hour ':' minute ;
yyyymmdd : year '-' moy '-' dom ;
shrtMdy : moy ('/' | '-') dom ('/' | '-') year ;
longMdy : longMonth WS dom ','? optYearAndOrTime?
| shrtMonth '.'? WS dom ','? optYearAndOrTime?
;
optYearAndOrTime : WS year ','? WS timespan
| WS year
| WS timespan
;
fragment DIGIT : [0-9] ;
ONE_DIGIT : DIGIT ;
TWO_DIGITS : DIGIT ONE_DIGIT ;
THREE_DIGITS : DIGIT TWO_DIGITS ;
FOUR_DIGITS : DIGIT THREE_DIGITS ;
year : FOUR_DIGITS ; // year
moy : ONE_DIGIT | TWO_DIGITS ; // month of year.
dom : ONE_DIGIT | TWO_DIGITS ; // day of month.
timespan : (tod THRU tod) | tod ;
// Time-of-day.
tod : noon | (hour2 (':' minute)? WS? meridian?) ;
noon : 'noon' ; hour2 : ONE_DIGIT | TWO_DIGITS ;
meridian : ('AM' | 'am' | 'PM' | 'pm' | 'a.m.' | 'p.m.') ;
// 24-hour clock. Sanity-check range in listener.
hour : TWO_DIGITS ;
minute : TWO_DIGITS ;
second : TWO_DIGITS ; // we do not use seconds.
// Range verb.
THRU : WS? ('-'|'–'|'to') WS? ;
// Weekdays.
weekDay : shrtDay | longDay ; shrtDay : 'Sun' | 'Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' ; longDay : 'Sunday' | 'Monday' | 'Tuesday' | 'Wednesday' | 'Thursday' | 'Friday' | 'Saturday' ;
// Months.
shrtMonth : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
longMonth : 'January' | 'February' | 'March' | 'April' | 'May' | 'June' | 'July' | 'August' | 'September' | 'October' | 'November' | 'December' ;
WS : ~[a-zA-Z0-9,.:]+ ;

CQL: comparing two column values

I have a TABLE like this:
id | expected | current
------+----------+--------
123 | 25 | 15
234 | 26 | 26
345 | 37 | 37
Now I want to select all ids where current is equal to expected. In SQL I would do something like this:
SELECT id FROM myTable WHERE current = expected;
But in CQL it seems to be invalid. My cqlsh returns this:
no viable alternative at input 'current'
Is there a valid CQL query to achieve this ?
Edited
According to the CQL-Docs it should work but it doesn't... This is what the doc says:
<selectWhereClause> ::= <relation> ( "AND" <relation> )*
| <term> "IN" "(" <term> ( "," <term> )* ")"
<relation> ::= <term> <relationOperator> <term>
<relationOperator> ::= "=" | "<" | ">" | "<=" | ">="
<term> ::= "KEY"
| <identifier>
| <stringLiteral>
| <integer>
| <float>
| <uuid>
;

I used the wrong docs. I'm using CQL3 and read the docs for CQL2.
The correct doc says:
<where-clause> ::= <relation> ( AND <relation> )*
<relation> ::= <identifier> <op> <term>
| '(' <identifier> (',' <identifier>)* ')' <op> '(' <term> (',' <term>)* ')'
| <identifier> IN '(' ( <term> ( ',' <term>)* )? ')'
| TOKEN '(' <identifier> ( ',' <identifer>)* ')' <op> <term>
So it is not a valid query.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Antlr deep rule set performance issue - antlr4

Related

Condition from matches in two column Excel

Greek Dvorak keyboard layout

Parsing key value IPV6 pair

antlr4: Grammar ambiguity, left-recursion, both?

CQL: comparing two column values

Categories

Resources