Parsing key value IPV6 pair - antlr4

I have following input format IP:FE80:CD00::211E:729C to parse.
After parsing, I want Key as IP: and value as FE80:CD00::211E:729C
I have defined following grammar
grammar IPV6;
keyValue : KEY ip_v6_address;
ip_v6_address
: h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| '::' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| h16? '::' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| ((h16 ':')? h16)? '::' h16 ':' h16 ':' h16 ':' ls32
| (((h16 ':')? h16 ':')? h16)? '::' h16 ':' h16 ':' ls32
| ((((h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16 ':' ls32
| (((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' ls32
| ((((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16
;
h16
: hexdig hexdig hexdig hexdig
| hexdig hexdig hexdig
| hexdig hexdig
| hexdig
;
hexdig
: digit
| (A | B | C | D | E | F)
;
ls32
: h16 ':' h16
| ip_v4_address
;
ip_v4_address
: dec_octet '.' dec_octet '.' dec_octet '.' dec_octet
;
dec_octet
: digit
| non_zero_digit digit
| D1 digit digit
| D2 (D0 | D1 | D2 | D3 | D4) digit
| D2 D5 (D0 | D1 | D2 | D3 | D4 | D5)
;
digit
: D0
| non_zero_digit
;
non_zero_digit
: D1
| D2
| D3
| D4
| D5
| D6
| D7
| D8
| D9
;
D0 : '0';
D1 : '1';
D2 : '2';
D3 : '3';
D4 : '4';
D5 : '5';
D6 : '6';
D7 : '7';
D8 : '8';
D9 : '9';
A : 'a'|'A';
B : 'b'|'B';
C : 'c'|'C';
D : 'd'|'D';
E : 'e'|'E';
F : 'f'|'F';
KEY: '['? STRING SPACE* STRING']'?':';
fragment SPACE : ' ';
fragment STRING: [a-zA-Z0-9/._-]+;
WS : [ \t\r\n] + -> skip;
The above grammar gives me following tokens after running against the above example
[TOKENS]
KEY 'IP:'
KEY 'FE80:'
KEY 'CD00:'
':' ':'
KEY '211E:'
D7 '7'
D2 '2'
D9 '9'
C 'C'
EOF '<EOF>'
[PARSE-TREE]
line 1:3 mismatched input 'FE80:' expecting {'::', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', A, B, C, D, E, F}
(keyValue IP:
(ip_v6_address FE80: CD00: : 211E: 7 2 9 C))
I want to have key value pairs as output and not sure if I am writing the correct grammar. I problem that I am facing is that separator ':' can exit in value as well.
Any pointers how to fix the grammar ?

It doesn't work because of overlapping lexer rules (multiple rules matching the same input).
The F char from FE80: is not being tokenised as a hex digit (the F lexer rule). However, the entire chunk FE80: is being tokenised as a KEY token.
You must realise that the lexer operates independently from the parser. The parser might be trying to match a certain token, the lexer does not "listen" to this. The lexer follows 2 very simple rules:
try to match as much characters as possible for a single token
when two or more tokens match the same characters, the rule defined first "wins"
Because of these rules, the input F is tokenised as an F token, but input like FE is tokenised as a KEY token.
The solution is to move the construction of a KEY from the lexer to a key parser rule as shown below:
grammar IPV6;
key_value
: key ':' ip_v6_address
;
key
: '[' string ']'
| string
;
ip_v6_address
: h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| '::' h16 ':' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| h16? '::' h16 ':' h16 ':' h16 ':' h16 ':' ls32
| ((h16 ':')? h16)? '::' h16 ':' h16 ':' h16 ':' ls32
| (((h16 ':')? h16 ':')? h16)? '::' h16 ':' h16 ':' ls32
| ((((h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16 ':' ls32
| (((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' ls32
| ((((((h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16 ':')? h16)? '::' h16
;
h16
: hexdig hexdig hexdig hexdig
| hexdig hexdig hexdig
| hexdig hexdig
| hexdig
;
hexdig
: digit
| (A | B | C | D | E | F)
;
ls32
: h16 ':' h16
| ip_v4_address
;
ip_v4_address
: dec_octet '.' dec_octet '.' dec_octet '.' dec_octet
;
dec_octet
: digit
| non_zero_digit digit
| D1 digit digit
| D2 (D0 | D1 | D2 | D3 | D4) digit
| D2 D5 (D0 | D1 | D2 | D3 | D4 | D5)
;
digit
: D0
| non_zero_digit
;
non_zero_digit
: D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9
;
string
: (STRING_ATOM | hexdig)+
;
D0 : '0';
D1 : '1';
D2 : '2';
D3 : '3';
D4 : '4';
D5 : '5';
D6 : '6';
D7 : '7';
D8 : '8';
D9 : '9';
A : [aA];
B : [bB];
C : [cC];
D : [dD];
E : [eE];
F : [fF];
STRING_ATOM : [g-zG-Z/._-];
WS : [ \t\r\n] + -> skip;
resulting in the following parse tree:

Related

Antlr deep rule set performance issue

I've tried making a grammar that understands expression priorities for a C#-like language:
var a = expression0.expression1(expression2 + expression3() * expression4)
When correctly prioritized becomes:
var a = (expression0.expression1)(expression2 + ((expression3()) * expression4))
To achieve that, I've sorted the expressions into rules by priority. Here's the relevant excerpt from my grammar:
expression: assignmentExpression;
assignmentExpression
: equalityExpression ASSIGNMENT assignmentExpression
| equalityExpression
;
equalityExpression
: logicalExpression equals equalityExpression
| logicalExpression notEquals equalityExpression
| logicalExpression
;
logicalExpression
: relationalExpression and logicalExpression
| relationalExpression or logicalExpression
| relationalExpression
;
relationalExpression
: addExpression greaterThan relationalExpression
| addExpression lessThan relationalExpression
| addExpression greaterOrEquals relationalExpression
| addExpression lessOrEquals relationalExpression
| addExpression
;
addExpression
: multiplyExpression add addExpression
| multiplyExpression subtract addExpression
| multiplyExpression
;
multiplyExpression
: conversionExpression multiply multiplyExpression
| conversionExpression divide multiplyExpression
| conversionExpression
;
conversionExpression
: unaryExpression AS conversionExpression
| unaryExpression
;
unaryExpression
: subtract unaryExpression
| add unaryExpression
| ifExpression
;
ifExpression
: IF PAREN expression ENDPAREN expression (ELSE expression)?
| instantiationExpression
;
instantiationExpression
: NEW invocationExpression
| invocationExpression
;
invocationExpression
: reachExpression PAREN arguments? ENDPAREN
| reachExpression
;
reachExpression
: primeExpression DOT primeExpression
| primeExpression
;
Using it on the following code:
{ int a = 7 }
Produces the following output:
expression
assignmentExpression
equalityExpression
logicalExpression
relationalExpression
addExpression
multiplyExpression
conversionExpression
unaryExpression
ifExpression
instantiationExpression
invocationExpression
reachExpression
primeExpression
blockExpression
"{"
statement
variableDeclaration
variableType
type
identifier
"int"
identifier
"a"
variableInitialization
"="
expression
assignmentExpression
equalityExpression
logicalExpression
relationalExpression
addExpression
multiplyExpression
conversionExpression
unaryExpression
ifExpression
instantiationExpression
invocationExpression
reachExpression
primeExpression
literal
integer
"7"
"}"
It probably has problems, but I can't really test it, since this kind of grammar incurs a crazy performance penalty. My previous grammar that parsed expressions left to right took a couple of milliseconds to parse, and this one takes 2 whole minutes.
I've tracked it down to ParserATNSimulator.AdaptivePredict, and deeper inside, the issue is a crazy deep stack of
ParserATNSimulator.Closure
ParserATNSimulator.ClosureCheckingStopState
ParserATNSimulator.Closure_
Instead of going down the rabbit hole once to get to the correct expression (number literal: 7), it seems to go all the way down once for every rule level. So instead of this being O(n), it takes O(n^n) just to get to that damn "7".
Is this an issue with Antlr, my grammar or just the way it is?
EDIT:
OK, I've managed to eliminate the performance issue by rewriting my grammar in this style:
assignmentExpression: equalityExpression (ASSIGNMENT assignmentExpression)?;
But I still don't understand why this happens and how did this change solve the issue.

Excel - Formula to calculate difference between columns with blank cells

I have an Excel sheet with values similar to the table below.
-------------------------------------
| A | B | C | D | E | F |
-------------------------------------
| 95| | 98| 96| 95| |
-------------------------------------
| 96| 95| | 92| 91| |
-------------------------------------
| 93| | 92| 98| 94| |
-------------------------------------
| 92| 98| | 95| 92| |
-------------------------------------
| 95| | 99| 92| 98| |
-------------------------------------
The formula for F1 should be =(B1-A1)+(C1-B1)+(D1-C1)+(E1-D1)
However, some cells are blank. So, if the cell is blank, it should take the next cell.
eg; F1 should be =(C1-A1)+(D1-C1)+(E1-D1)
and F2 should be =(B2-A2)+(D2-B2)+(E2-D2)
and so on...
Is there a formula to automate this?
The formula:
= (B1-A1) + (C1-B1) + (D1-C1) + (E1-D1)
can also be written as:
= B1 - A1 + C1 - B1 + D1 - C1 + E1 - D1
or
= - A1 + (B1 - B1) + (C1 - C1) + (D1 - D1) + E1
where only the first and last values prevail as all other void themselves, thus leaving this formula:
= - A1 + E1
So the formula then becomes the last non-blank value minus the first non-blank value.
Try this formula:
= INDEX( $A1:$E1, 0, AGGREGATE( 14, 6, COLUMN(1:1) / ( $A1:$E1 <> "" ), 1 ))
- INDEX( $A1:$E1, 0, AGGREGATE( 15, 6, COLUMN(1:1) / ( $A1:$E1 <> "") ,1 ))
See these pages for further explanations on the Worksheet Functions used:
AGGREGATE function, INDEX function.

Using Sed multiple search pattern printing specific lines

SED command usage multiple pattern
I am using the sed command to search for multiple patterns.
The command works and print the lines when it find matches
However I need to do 2 things ( here is the command I use)
sed -r '/pattern1|pattern2/!d' filename
A - Print the line containing the first pattern
then print not only the line matching the second pattern
but print the number of lines below it. I like to specify
the number of lines below second pattern search .
B - I need to print first pattern and then only a certain number of lines below
the 2nd pattern but omit the line containing the search pattern
In short, I need to control specify the number of lines below
my second serach pattern and omit the line containing the serach patetrn as well if
I decide to do so
Hostname1
section1
a
section2
a
c
d
Hostname2
section1
a
section2
x
y
d
desired Output
hostname1
section2
a
c
hostname2
section2
x
y
# Create test file
(
cat << EOF
Hostname1
section1
a
section2
a
c
d
Hostname2
section1
a
section2
x
y
d
EOF
) > filename
# transformation
cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n' | sed -r '/Hostname1|Hostname2|section2/!d' | cut -d"|" -f-3 | tr '|' '\n' | grep -v "^ *$" | sed -e "s/\(Hostname\)/\n\1/g"
explications
# etape 1 : transforme each section to on ligne, with a dilimiter "|" :
cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n'
#Hostname1|
#section1|a|
#section2|a|c|d|
#
#Hostname2|
#section1|a|
#section2|x|y|d|
# etape 2 : cut n+1 fild ( cut -d"|" -f-3 ) :
cat filename | cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n' | sed -r '/Hostname1|Hostname2|section2/!d' | cut -d"|" -f-3
#Hostname1|
#section2|a|c
#Hostname2|
#section2|x|y
#etape 3 : transfomation to wanted format :
cat filename | cat filename | grep -v "^ *$" | sed -e "s/\(Hostname\)/==\1/g" | sed -e "s/\(section\)/=\1/g" | tr '\n' '|' | tr '=' '\n' | sed -r '/Hostname1|Hostname2|section2/!d' | cut -d"|" -f-3 | tr '|' '\n' | grep -v "^ *$" | sed -e "s/\(Hostname\)/\n\1/g"
#Hostname1
#section2
#a
#c
#
#Hostname2
#section2
#x
#y

antlr4: Grammar ambiguity, left-recursion, both?

My grammar, shown below, does not compile. The returned error (from the antlr4 maven plugin) is:
[INFO] --- antlr4-maven-plugin:4.3:antlr4 (default-cli) # beebell ---
[INFO] ANTLR 4: Processing source directory /Users/kodecharlie/workspace/beebell/src/main/antlr4
[INFO] Processing grammar: DateRange.g4
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from line 13:87 mismatched tree node: startTime expecting <UP>
org\antlr\v4\parse\GrammarTreeVisitor.g: node from after line 13:87 mismatched tree node: RULE expecting <UP>
[ERROR] error(20): internal error: Rule HOUR undefined
[ERROR] error(20): internal error: Rule MINUTE undefined
[ERROR] error(20): internal error: Rule SECOND undefined
[ERROR] error(20): internal error: Rule HOUR undefined
[ERROR] error(20): internal error: Rule MINUTE undefined
I can see how the grammar might be confused -- Eg, whether 2 digits is a MINUTE, SECOND, or HOUR (or maybe the start of a year). But a few articles suggest this error results from left-recursion.
Can you tell what's going on?
Thanks. Here's the grammar:
grammar DateRange;
range : startDate (THRU endDate)? | 'Every' LONG_DAY 'from' startDate THRU endDate ;
startDate : dateTime ;
endDate : dateTime ;
dateTime : GMTOFF | SHRT_MDY | YYYYMMDD | (WEEK_DAY)? LONG_MDY ;
// Dates.
GMTOFF : YYYYMMDD 'T' HOUR ':' MINUTE ':' SECOND ('-'|'+') HOUR ':' MINUTE ;
YYYYMMDD : YEAR '-' MOY '-' DOM ;
SHRT_MDY : MOY ('/' | '-') DOM ('/' | '-') YEAR ;
LONG_MDY : (SHRT_MNTH '.'? | LONG_MNTH) WS DOM ','? (WS YEAR (','? WS TIMESPAN)? | WS startTime)? ;
YEAR : DIGIT DIGIT DIGIT DIGIT ; // year
MOY : (DIGIT | DIGIT DIGIT) ; // month of year.
DOM : (DIGIT | DIGIT DIGIT) ; // day of month.
TIMESPAN : startTime (WS THRU WS endTime)? ;
// Time-of-day.
startTime : TOD ;
endTime : TOD ;
TOD : NOON | HOUR2 (':' MINUTE)? WS? MERIDIAN ;
NOON : 'noon' ;
HOUR2 : (DIGIT | DIGIT DIGIT) ;
MERIDIAN : 'AM' | 'am' | 'PM' | 'pm' ;
// 24-hour clock. Sanity-check range in listener.
HOUR : DIGIT DIGIT ;
MINUTE : DIGIT DIGIT ;
SECOND : DIGIT DIGIT ;
// Range verb.
THRU : WS ('-'|'to') WS -> skip ;
// Weekdays.
WEEK_DAY : (SHRT_DAY | LONG_DAY) ','? WS ;
SHRT_DAY : 'Sun' | 'Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' -> skip ;
LONG_DAY : 'Sunday' | 'Monday' | 'Tuesday' | 'Wednesday' | 'Thursday' | 'Friday' | 'Saturday' -> skip ;
// Months.
SHRT_MNTH : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
LONG_MNTH : 'January' | 'February' | 'March' | 'April' | 'May' | 'June' | 'July' | 'August' | 'September' | 'October' | 'November' | 'December' ;
DIGIT : [0-9] ;
WS : [ \t\r\n]+ -> skip ;
I resolved this issue by setting up a unique production rule for each sequence of digits (of length 1, 2, 3, or 4). As well, I simplified several rules -- in effect, trying to make the production rule alternatives more straightforward. Anyway, here is the final result, which does compile:
grammar DateRange;
range : 'Every' WS longDay WS 'from' WS startDate THRU endDate
| startDate THRU endDate
| startDate
;
startDate : dateTime ; endDate : dateTime ; dateTime : utc
| shrtMdy
| yyyymmdd
| longMdy
| weekDay ','? WS longMdy
;
// Dates.
utc : yyyymmdd 'T' hour ':' minute ':' second ('-'|'+') hour ':' minute ;
yyyymmdd : year '-' moy '-' dom ;
shrtMdy : moy ('/' | '-') dom ('/' | '-') year ;
longMdy : longMonth WS dom ','? optYearAndOrTime?
| shrtMonth '.'? WS dom ','? optYearAndOrTime?
;
optYearAndOrTime : WS year ','? WS timespan
| WS year
| WS timespan
;
fragment DIGIT : [0-9] ;
ONE_DIGIT : DIGIT ;
TWO_DIGITS : DIGIT ONE_DIGIT ;
THREE_DIGITS : DIGIT TWO_DIGITS ;
FOUR_DIGITS : DIGIT THREE_DIGITS ;
year : FOUR_DIGITS ; // year
moy : ONE_DIGIT | TWO_DIGITS ; // month of year.
dom : ONE_DIGIT | TWO_DIGITS ; // day of month.
timespan : (tod THRU tod) | tod ;
// Time-of-day.
tod : noon | (hour2 (':' minute)? WS? meridian?) ;
noon : 'noon' ; hour2 : ONE_DIGIT | TWO_DIGITS ;
meridian : ('AM' | 'am' | 'PM' | 'pm' | 'a.m.' | 'p.m.') ;
// 24-hour clock. Sanity-check range in listener.
hour : TWO_DIGITS ;
minute : TWO_DIGITS ;
second : TWO_DIGITS ; // we do not use seconds.
// Range verb.
THRU : WS? ('-'|'–'|'to') WS? ;
// Weekdays.
weekDay : shrtDay | longDay ; shrtDay : 'Sun' | 'Mon' | 'Tue' | 'Wed' | 'Thu' | 'Fri' | 'Sat' ; longDay : 'Sunday' | 'Monday' | 'Tuesday' | 'Wednesday' | 'Thursday' | 'Friday' | 'Saturday' ;
// Months.
shrtMonth : 'Jan' | 'Feb' | 'Mar' | 'Apr' | 'May' | 'Jun' | 'Jul' | 'Aug' | 'Sep' | 'Oct' | 'Nov' | 'Dec' ;
longMonth : 'January' | 'February' | 'March' | 'April' | 'May' | 'June' | 'July' | 'August' | 'September' | 'October' | 'November' | 'December' ;
WS : ~[a-zA-Z0-9,.:]+ ;

CQL: comparing two column values

I have a TABLE like this:
id | expected | current
------+----------+--------
123 | 25 | 15
234 | 26 | 26
345 | 37 | 37
Now I want to select all ids where current is equal to expected. In SQL I would do something like this:
SELECT id FROM myTable WHERE current = expected;
But in CQL it seems to be invalid. My cqlsh returns this:
no viable alternative at input 'current'
Is there a valid CQL query to achieve this ?
Edited
According to the CQL-Docs it should work but it doesn't... This is what the doc says:
<selectWhereClause> ::= <relation> ( "AND" <relation> )*
| <term> "IN" "(" <term> ( "," <term> )* ")"
<relation> ::= <term> <relationOperator> <term>
<relationOperator> ::= "=" | "<" | ">" | "<=" | ">="
<term> ::= "KEY"
| <identifier>
| <stringLiteral>
| <integer>
| <float>
| <uuid>
;
I used the wrong docs. I'm using CQL3 and read the docs for CQL2.
The correct doc says:
<where-clause> ::= <relation> ( AND <relation> )*
<relation> ::= <identifier> <op> <term>
| '(' <identifier> (',' <identifier>)* ')' <op> '(' <term> (',' <term>)* ')'
| <identifier> IN '(' ( <term> ( ',' <term>)* )? ')'
| TOKEN '(' <identifier> ( ',' <identifer>)* ')' <op> <term>
So it is not a valid query.

Resources