antlr4 empty alternative not working as expected - antlr4

I'd presumed that in general the rule
rule: ( something ? ) ;
could generally be expressed as alternation with nothing, with identical semantics
rule: ( something | ) ; <-- empty alt here
(provided of course 'something' is a single item or bracketed to make it so). It seems obviously correct but antlr4 isn't having it. This code does as I expect
version 1, works
opt_cursor_into_spec :
( cursor_into_spec ? )
;
cursor_into_spec :
INTO
sident ( COMMA sident ) *
;
but this doesn't; failing to parse the input:
version 2, fails
opt_cursor_into_spec : // this rule's changed
cursor_into_spec
|
// empty alt
;
cursor_into_spec : // this is the same
INTO
sident ( COMMA sident ) *
;
Here's part of the diagnostics trace on version 2, note the [***]
consume [#1,8:11='crsr',<483>,2:6] rule regular_ident
exit regular_ident, LT(1)=<EOF>
exit sident, LT(1)=<EOF>
exit cic_cursor_name, LT(1)=<EOF>
exit cursor_ident_clause, LT(1)=<EOF>
enter opt_cursor_into_spec, LT(1)=<EOF>
line 4:0 no viable alternative at input '<EOF>' [***]
exit opt_cursor_into_spec, LT(1)=<EOF>
exit fetch_statement, LT(1)=<EOF>
exit sql_item, LT(1)=<EOF>
enter opt_sql_separators, LT(1)=<EOF>
exit opt_sql_separators, LT(1)=<EOF>
exit sql_items, LT(1)=<EOF>
This is odd as at *** it claims no viable alternative, but at the line before it says it's entered into opt_cursor_into_spec, but this rule has the empty alternative, which surely always matches - one can always match the empty string, I thought?
So is my assumption of this equivalence...
( x ? ) === ( x | <<<nothing>>> )
...incorrect, or what?
This Q isn't about code, but about my understanding of semantics. If anyone thinks these should do the same, I'll try to post reproducible code.
Edit: More confused now. A stripped down grammar didn't reproduce. Something about the end of file was suspicious as the input to parse is just fetch a and it seems to get parsed in full according to the diagnostics trace, then fails. Hmm. I added an explicit EOF to the starting rule, so (a bit simplified)
sql_items : sql_item * ; // ORIGINAL
became
sql_items : sql_item * EOF; // NEW
And both (x? and x|<<<nothing>>>) suddenly work for NEW. Previously only x? worked for ORIGINAL.
Adding an EOF test should surely not cause a previously unsuccessful parse to succeed, can it?
Edit 3: edit 2 struck as it was misleading and unhelpful
Edit 2: on reflection adding EOF to the grammar can of course cause a previously successful parse to fail, as an input can be well-formed at the start but malformed as a whole (ie. imagine parsing an expression 2 + 3 £$%&, the start is valid but overall it's crud) but that's not apparently what's happening here.

In version 1, the rule opt_cursor_into_spec matches iff the rule cursor_into_spec matches. In version 2, the rule opt_cursor_into_spec will always be matched. So the semantics of the grammar, specifically due to rules that have opt_cursor_into_spec as an element, will differ.
Likely, in version 2, you are getting a compile time warning about a rule that can match anything. You cannot ignore the warning unless you really understand the cause and effect.

Related

Odd ANTLR4 error handling behavior with very simply (trivial) behavior

Given the below super simple grammar:
ddlStatement
: defineStatement
;
defineStatement
: 'define' tableNameToken=Identifier ';'?
;
and the input "add 1 to bob"
I would expect to get an error. However, the parser matches the "defineStatement" rule with a missing "define" token. The following Listener will fire
#Override
public void exitDefineStatement(DDLParser.DefineStatementContext ctx) {
log.info(MessageFormat.format("Defining {0}", ctx.tableNameToken.getText()));
}
and log "Defining add".
I can assign 'define' to a variable and test that variable for NULL but that seems like work I shouldn't have to do.
BTW if the grammar becomes more complete - specifically with the addition of alternatives to the ddlStatement rule - error handling works as I would expect.
This ANTLR's error recovery in action.
In many cases, it's VERY beneficial for ANTLR to assume either a missing token, or ignore a token, if it allows parsing to continue. The missing "define" token should have been reported as an error.
Without this capability, ANTLR would frequently get "stumped" at the first sign of problems. With this, ANTLR is saying "Well, if I assume X, then I can make sense of your input. So I'm assuming X and reporting that as an error so I can continue on.
(Filling a few details to get this to build)
grammar Test
;
ddlStatement: defineStatement;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
if I run antlr on this and compile the Java output. The following command:
echo "add 1 to bob" | grun Test ddlStatement -gui
yields the error:
line 1:0 missing 'define' at 'add'
and produces the parse tree:
The highlighted node is the error node in the tree.
The reason it stops after "add" is that input (assuming a missing "define", would be a ddlStatement
ANTLR will stop processing input once it has recognized your stop rule.
To get it to "pay attention" to the entire input, add an EOF token to your start rule:
grammar Test
;
ddlStatement: defineStatement EOF;
defineStatement: 'define' tableNameToken = Identifier ';'?;
Identifier: [a-zA-Z]+;
Number: [0-9]+;
WS: [ \r\n\t]+ -> skip;
gives these errors:
line 1:0 missing 'define' at 'add'
line 1:4 mismatched input '1' expecting {<EOF>, ';'}
and this tree:

Formatting string in Powershell but only first or specific occurrence of replacement token

I have a regular expression that I use several times in a script, where a single word gets changed but the rest of the expression remains the same. Normally I handle this by just creating a regular expression string with a format like the following example:
# Simple regex looking for exact string match
$regexTemplate = '^{0}$'
# Later on...
$someString = 'hello'
$someString -match ( $regexTemplate -f 'hello' ) # ==> True
However, I've written a more complex expression where I need to insert a variable into the expression template and... well regex syntax and string formatting syntax begin to clash:
$regexTemplate = '(?<=^\w{2}-){0}(?=-\d$)'
$awsRegion = 'us-east-1'
$subRegion = 'east'
$awsRegion -match ( $regexTemplate -f $subRegion ) # ==> Error
Which results in the following error:
InvalidOperation: Error formatting a string: Index (zero based) must be greater than or equal to zero and less than the size of the argument list.
I know what the issue is, it's seeing one of my expression quantifiers as a replacement token. Rather than opt for a string-interpolation approach or replace {0} myself, is there a way I can tell PowerShell/.NET to only replace the 0-indexed token? Or is there another way to achieve the desired output using format strings?
If a string template includes { and/or } characters, you need to double these so they do not interfere with the numbered placeholders.
Try
$regexTemplate = '(?<=^\w{{2}}-){0}(?=-\d$)'

invalid input syntax for type numeric: " "

I'm getting this message in Redshift: invalid input syntax for type numeric: " " , even after trying to implement the advice found in SO.
I am trying to convert text to number.
In my inner join, I try to make sure that the text being processed is first converted to null when there is an empty string, like so:
nullif(trim(atl.original_pricev::text),'') as original_price
... I noticed from a related post on coalesce that you have to convert the value to text before you can try and nullif it.
Then in the outer join, I test to see that there's a limited set of acceptable characters and if this test is met I try to do the to_number conversion:
,case
when regexp_instr(trim(atl.original_price),'[^0-9.$,]')=0
then to_number(atl.original_price,'FM999999999D00')
else null
end as original_price2
At this point I get the above error and unfortunately I can't see the details in datagrip to get the offending value.
So my questions are:
I notice that there is an empty space in my error message:
invalid input syntax for type numeric: " " . Does this error have the exact same meaning as
invalid input syntax for type numeric:'' which is what I see in similar posts??
Of course: what am I doing wrong?
Thanks!
It's hard to know for sure without some data and the complete code to try and reproduce the example, but as some have mentioned in the comments the most likely cause is the to_number() function you are using.
In the earlier code fragment you are converting original_price to text (string) and then substituting an empty string ('') if the value is NULL. Calling the to_number() function on an empty string will give you the error described.
Without the full SQL statement it's not clear why you're putting the nullif() function around the original_price in the "inner join" or how whether the CASE statement is really in an outer join clause or one of the columns returned by the query. However you could perhaps alter the nullif() to substitute a value that can be converted to a number e.g. '0.00' instead of ''.
Sorry I couldn't share real data. I spent the weekend testing small sets to try and trap the error. I found that the error was caused by the input string having no numbers, which is permitted by my regex filter:
when regexp_instr(trim(atl.original_price),'[^0-9.$,]') .
I wrongly expected that a non numeric string like "$" would evaluate to NULL and then the to_number function would = NULL . But from experimenting it seems that it needs at least one number somewhere in the string. Otherwise it reduces the string argument to an empty string prior to running the to_number formatting and chokes.
For example select to_number(trim('$1'::text),'FM999999999999D00') will evaluate to 1 but select to_number(trim('$A'::text),'FM999999999999D00') will throw the empty string error.
My fix was to add an additional regex to my initial filter:
and regexp_instr(atl.original_price2,'[0-9]')>0 .
This ensures that at least one number will be in the string and after that the empty string error went away.
Hope my learning experience helps someone else.

Grok regex with escaped “[“, “(“, and “)” chars problems

Elastic newbie here - working with a new 5.5 install. I have a log line that looks like so:
[2015/10/01#19:48:22.785-0400] P-4780 T-2208 I DBUTIL : (451) prostrct
create session begin for timk519 on CON:.
I have the following regex:
\[%{DATE:date}#%{TIME:time}-(?<gmtoffset>\d{4})\]\s*(?<procid>P-[0-9]+)\s*(?<threadid>T-[0-9]+)\s*(?<msgtype>[ifIF])\s*(?<processtype>[a-zA-Z]+)\s*(?<usernumber>[0-9]+|[:])\s*\((?<msgnum>[0-9]+|[\-]+)\)\s*%{GREEDYDATA:message}
When I try it in the kibana grok debugger it doesn't work and I get the following error:
GrokDebugger: [parse_exception] [pattern_definitions] property isn't a
map, but of type [java.lang.String], with { header={
processor_type="grok" & property_name="pattern_definitions" } }
this appears to be due to the \[ at the start of the line. If I replace the leading \[ with a period "." I get this
.%{DATE:date}#%{TIME:time}-(?<gmtoffset>\d{4})\]\s*(?<procid>P-[0-9]+)\s*(?<threadid>T-[0-9]+)\s*(?<msgtype>[ifIF])\s*(?<processtype>[a-zA-Z]+)\s*(?<usernumber>[0-9]+|[:])\s*\((?<msgnum>[0-9]+|[\-]+)\)\s*%{GREEDYDATA:message}
the grok debugger and https://grokdebug.herokuapp.com/ are good with this pattern.
When I put this regex into logstash, it fails to recognize the msgnum (451) part of the line because of the escaped parens \( and \) around the msgnum field, and as a result fails to recognize the line as a legal string.
Am I escaping something incorrectly? Is this a bug?
UPDATE 2017-07-21
I got around the issue with escaping ( and ) by putting them in [(] and [)]. I haven't figured out a way to solve matching the leading [ yet.
UPDATE 2017-07-24
The answer below was an epic catch and I've used that to create the following custom patterns:
DBTIME %{TIME}[-+]\d{4}
DBTIMESTAMP %{YEAR}/%{MONTHNUM}/%{MONTHDAY}#%{DBTIME}
which I've implemented in my grok statement like so:
\[%{DBTIMESTAMP:dbdatetime}\]\s*%{PROCESSID:processid}\s*%{DBTHREADID:threadid}\s*%{DBMSGTYPE:msgtype}\s*%{PROCESSTYPE:processtype}?\s*%{USERNUMBER:usernumber}?\s*:\s*[(]%{MSGNUMBER:msgnumber}[)].\s*%{GREEDYDATA:eventmessage}\s*\r
I then use the date filter to turn the dbdatetime into a #timestamp setting, and now the regex matches the incoming log stream which is what I want. Thx!
The devil is in the detail and the error is not apparent at first. The reason the Grok Debugger fails is because of your use of the DATE pattern. This pattern resolves like this:
DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
MONTHNUM and MONTHDAY are both 2 digit patterns, which in turn actually means they are matching the 15 in your year. This is the reason why the pattern does not work because \[%{DATE} is actually not matching (it is missing the 20). Why does the pattern .%{DATE} work tough? Because you are not matching the [ with the dot, your are matching the 0 of the year.
How to fix this? Use a custom pattern to match the date. Something like this works:
\[(?<date>%{YEAR}/%{MONTHNUM}/%{MONTHDAY})#%{TIME:time}-(?<gmtoffset>\d{4})\]\s*(?<procid>P-[0-9]+)\s*(?<threadid>T-[0-9]+)\s*(?<msgtype>[ifIF])\s*(?<processtype>[a-zA-Z]+)\s*(?<usernumber>[0-9]+|[:])\s*\((?<msgnum>[0-9]+|[\-]+)\)\s*%{GREEDYDATA:message}
This will return the following output:
{
"date": "2015/10/01",
"msgnum": "451",
"procid": "P-4780",
"processtype": "DBUTIL",
"message": "prostrct create session begin for timk519 on CON:.",
"threadid": "T-2208",
"usernumber": ":",
"gmtoffset": "0400",
"time": "19:48:22.785",
"msgtype": "I"
}

how to handle conditionally existing components in action code?

This is another problem I am facing while migrating from antlr3 to antlr4. This problem is with the java action code for handling conditional components of rules. One example is shown below.
The following grammar+code worked in antlr3. Here, if the unary operator is not present, then a value of '0' is returned, and the java code checks for this value and takes appropriate action.
exprUnary returns [Expr e]
: (unaryOp)? e1=exprAtom
{if($unaryOp.i==0) $e = $e1.e;
else $e = new ExprUnary($unaryOp.i, $e1.e);
}
;
unaryOp returns [int i]
: '-' {$i = 1;}
| '~' {$i = 2;}
;
In antlr4, this code results in a null pointer exception during a run, because 'unaryOp' is 'null' if it is not present. But if I change the code like below, then antlr generation itself reports an error:
if($unaryOp==null) ...
java org.antlr.v4.Tool try.g4
error(67): missing attribute access on rule reference 'unaryOp' in '$unaryOp'
How should the action be coded for antlr4?
Another example of this situation is in if-then-[else] - here $s2 is null in antlr4:
ifStmt returns [Stmt s]
: 'if' '(' e=cond ')' s1=stmt ('else' s2=stmt)?
{$s = new StmtIf($e.e, $s1.s, $s2.s);}
;
NOTE: question 16392152 provides a solution to this question with listeners, but I am not using listeners, my requirement is for this to be handled in the action code.
There are at least two potential ways to correct this:
The "ANTLR 4" way to do it is to create a listener or visitor instead of placing the Java code inside of actions embedded in the grammar itself. This is the only way I would even consider solving the problem in my own grammars.
If you still use an embedded action, the most efficient way to check if the item exists or not is to access the ctx property, e.g. $unaryOp.ctx. This property resolves to the UnaryOpContext you were assuming would be accessible by $unaryOp by itself.
ANTLR expects you access an attribute. Try its text attribute instead: $unaryOp.text==null

Resources