Is this just a flawed grammar? - antlr4

I was looking through a grammar for focal and found someone had defined their numbers as follows:
number
: mantissa ('e' signed_)?
;
mantissa
: signed_
| (signed_ '.')
| ('.' signed_)
| (signed_ '.' signed_)
;
signed_
: PLUSMIN? INTEGER
;
PLUSMIN
: '+'
| '-'
;
I was curious because I thought this would mean that, for example, 1.-1 would get identified as a number by the grammar rather than subtraction. Would a branch with unsigned_ be worth it to prevent this issue? I guess this is more of a question for the author, but are there any benefits to structuring it this way (besides the obvious avoiding floats vs ints)?

It’s not necessarily flawed.
It does appear that it will recognize 1.-1 as a mantissa. However, that doesn’t mean that some post-parse validation doesn’t catch this problem.
It would be flawed if there’s an alternative, valid interpretation of 1.-1.
Sometimes, it’s just useful to recognize an invalid construct and produce a parse tree for “the only way to interpret this input”, and then you can detect it in a listener and give the user an error message that might be more meaningful than the default message that ANTLR would produce.
And, then again, it could also just be an oversight.
The `signed_` rule on the other hand, being:
signed_ : PLUSMIN? INTEGER;
Instead of
signed_ : PLUSMIN? INTEGER+;
does make this grammar somewhat suspect as a good example to work from.

Your analyze looks correct to me saying that :
1.-1 is recognized as a number
a branch with unsigned_ could fix it
Saying it's "flawd" taste like a value judgement, which seems not relevant.
If that was for my own usage, I would prefer to :
recognize 0.-4 as an invalid number
recognize -.4 as a valid number
So I do prefer something like :
number
: signed_float('e' signed_integer)?
;
signed_float
: PLUSMIN? unsigned_float
;
unsigned_float
: integer
| (integer '.')
| ('.' integer)
| (integer'.' integer)
;
signed_integer
: PLUSMIN? unsigned_integer
;
PLUSMIN
: '+'
| '-'
;

Related

SyntaxError: Unexpected number in JSON at position 182 [duplicate]

I'm importing some JSON files into my Parse.com project, and I keep getting the error "invalid key:value pair".
It states that there is an unexpected "8".
Here's an example of my JSON:
}
"Manufacturer":"Manufacturer",
"Model":"THIS IS A STRING",
"Description":"",
"ItemNumber":"Number12345",
"UPC":083456789012,
"Cost":"$0.00",
"DealerPrice":" $0.00 ",
"MSRP":" $0.00 ",
}
If I update the JSON by either removing the 0 from "UPC":083456789012, or converting it to "UPC":"083456789012", it becomes valid.
Can JSON really not accept an integer that begins with 0, or is there a way around the problem?
A leading 0 indicates an octal number in JavaScript. An octal number cannot contain an 8; therefore, that number is invalid.
Moreover, JSON doesn't (officially) support octal numbers, so formally the JSON is invalid, even if the number would not contain an 8. Some parsers do support it though, which may lead to some confusion. Other parsers will recognize it as an invalid sequence and will throw an error, although the exact explanation they give may differ.
Solution: If you have a number, don't ever store it with leading zeroes. If you have a value that needs to have a leading zero, don't treat it as a number, but as a string. Store it with quotes around it.
In this case, you've got a UPC which needs to be 12 digits long and may contain leading zeroes. I think the best way to store it is as a string.
It is debatable, though. If you treat it as a barcode, seeing the leading 0 as an integral part of it, then string makes sense. Other types of barcodes can even contain alphabetic characters.
On the other hand. A UPC is a number, and the fact that it's left-padded with zeroes to 12 digits could be seen as a display property. Actually, if you left-pad it to 13 digits by adding an extra 0, you've got an EAN code, because EAN is a superset of UPC.
If you have a monetary amount, you might display it as € 7.30, while you store it as 7.3, so it could also make sense to store a product code as a number.
But that decision is up to you. I can only advice you to use a string, which is my personal preference for these codes, and if you choose a number, then you'll have to remove the 0 to make it work.
One of the more confusing parts of JavaScript is that if a number starts with a 0 that isn't immediately followed by a ., it represents an octal, not a decimal.
JSON borrows from JavaScript syntax but avoids confusing features, so simply bans numbers with leading zeros (unless then are followed by a .) outright.
Even if this wasn't the case, there would be no reason to expect the 0 to still be in the number when it was parsed since 02 and 2 are just difference representations of the same number (if you force decimal).
If the leading zero is important to your data, then you probably have a string and not a number.
"UPC":"083456789012"
A product code is an identifier, not something you do maths with. It should be a string.
Formally, it is because JSON uses DecimalIntegerLiteral in its JSONNumber production:
JSONNumber ::
-_opt DecimalIntegerLiteral JSONFraction_opt ExponentPart_opt
And DecimalIntegerLiteral may only start with 0 if it is 0:
DecimalIntegerLiteral ::
0
NonZeroDigit DecimalDigits_opt
The rationale behind is is probably:
In the JSON Grammar - to reuse constructs from the main ECMAScript grammar.
In the main ECMAScript grammar - to make it easier to distinguish DecimalIntegerLiteral from HexIntegerLiteral and OctalIntegerLiteral. OctalIntegerLiteral in the first place.
See this productions:
HexIntegerLiteral ::
0x HexDigit
0X HexDigit
HexIntegerLiteral HexDigit
...
OctalIntegerLiteral ::
0 OctalDigit
OctalIntegerLiteral OctalDigit
The UPC should be in string format. For the future you may also get other type of UPC such as GS128 or string based product identification codes. Set your DB column to be string.
If an integer start with 0 in JavaScript it is considered to be the Octal (base 8) value of the integer instead of the decimal (base 10) value. For example:
var a = 065; //Octal Value
var b = 53; //Decimal Value
a == b; //true
I think the easiest way to send your number by JSON is send your number as string.

ANTLR4 lexer rule ensuring expression does not end with character

I have a syntax where I need to match given the following example:
some-Text->more-Text
From this example, I need ANTLR4 lexer rules that would match 'some-Text' and 'more-Text' into one lexer rule, and the '->' as another rule.
I am using the lexer rules shown below as my starting point, but the trouble is, the '-' character is allowed in the NAMEDELEMENT rule, which causes the first NAMEDELEMENT match to become 'some-Text-', which then causes the '->' to not be captured by the EDGE rule.
I'm looking for a way to ensure that the '-' is not captured as the last character in the NAMEDELEMENT rule (or some other alternative that produces the desired result).
EDGE
: '->'
;
NAMEDELEMENT
: ('a'..'z'|'A'..'Z'|'_'|'#') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-')* { _input.LA(1) != '-' && _input.LA(2) != '>' }?
;
Im trying to use the predicate above to look ahead for a sequence of '-' and '>', but it doesn't seem to work. It doesn't seem to do anything at all, actually, as get the same parsing results both with and without the predicate.
The parser rules are as follows, where I am matching on 'selector' rules:
selector
: namedelement (edge namedelement)*
;
edge
: EDGE
;
namedelement
: NAMEDELEMENT
;
Thanks in advance!
After messing around with this for hours, I have a syntax that works, though I fail to see how it is functionally any different than what I posted in the original question.
(I use the uncommented version so that I can put a break point in the generated lexer to ensure that the equality test is evaluating correctly.)
NAMEDELEMENT
//: [a-zA-Z_#] [a-zA-Z_-]* { String.fromCharCode(this._input.LA(1)) != ">" }?
: [a-zA-Z_#] [a-zA-Z_-]* { (function(a){
var c = String.fromCharCode(a._input.LA(1));
return c != ">";
})(this)
}?
;
My target language is JavaScript and both the commented and uncommented forms of the predicate work fine.
Try this:
NAMEDELEMENT
: [a-zA-Z_#] ( '-' {_input.LA(1) != '>'}? | [a-zA-Z0-9_] )*
;
Not sure if _input.LA(1) != '>' is OK with the JavaScript runtime, but in Java it properly tokenises "some-->more" into "some-", "->" and "more".

Token Aliases in Antlr

I have rules that look something like this:
INTEGER : [0-9]+;
field3 : INTEGER COMMA INTEGER;
In the parsed tree I get an List called INTEGER with two elements.
I would rather find a way for each of the elements to be named.
But if I do this:
INTEGER : [0-9]+;
DOS : INTEGER;
UNO : INTEGER;
field3 : UNO COMMA DOS;
I still get the array of INTEGERs.
Am I doing it right and I just need to dig deeper to figure out what is wrong?
Is there some kind of syntax to alias INTEGER as UNO just for this command (that is actually what I would prefer)?
Just use labeling to identify the subterms:
field : a=INTEGER COMMA b=INTEGER;
The FieldContext class will be generated with two additional class fields:
TerminalNode a;
TerminalNode b;
The corresponding INTEGER instances will be assigned to these fields. So, no aliasing is actually required in most cases.
However, there can be valid reasons to change the named type of a token and typically is handled in the lexer through the use of modes, actions, and predicates. For example, using modes, if INTEGER alternates between UNO and DOS types:
lexer grammar UD ;
UNO : INT -> mode(two);
mode two;
DOS : INT -> mode(default);
fragment INT : [0-9]+ ;
When to do the mode switch and whether a different specific approach might be more appropriate will depend on details not provided yet.

Token recognition order

My full grammar results in an incarnation of the dreaded "no viable alternative", but anyway, maybe a solution to the problem I'm seeing with this trimmed-down version can help me understand what's going on.
grammar NOVIA;
WS : [ \t\r\n]+ -> skip ; // whitespace rule -> toss it out
T_INITIALIZE : 'INITIALIZE' ;
T_REPLACING : 'REPLACING' ;
T_ALPHABETIC : 'ALPHABETIC' ;
T_ALPHANUMERIC : 'ALPHANUMERIC' ;
T_BY : 'BY' ;
IdWord : IdLetter IdSeparatorAndLetter* ;
IdLetter : [a-zA-Z0-9];
IdSeparatorAndLetter : ([\-]* [_]* [A-Za-z0-9]+);
FigurativeConstant :
'ZEROES' | 'ZERO' | 'SPACES' | 'SPACE'
;
statement : initStatement ;
initStatement : T_INITIALIZE identifier+ T_REPLACING (T_ALPHABETIC | T_ALPHANUMERIC) T_BY (literal | identifier) ;
literal : FigurativeConstant ;
identifier : IdWord ;
and the following input
INITIALIZE ABC REPLACING ALPHANUMERIC BY SPACES
results in
(statement (initStatement INITIALIZE (identifier ABC) REPLACING ALPHANUMERIC BY (identifier SPACES)))
I would have expected to see SPACES being recognized as "literal", not "identifier".
Any and all pointer greatly appreciated,
TIA - Alex
Every string that might match the FigurativeConstant rule will also match the IdWord rule. Because the IdWord rule is listed first and the match length is the same with either rule, the Lexer issues an IdWord token, not a FigurativeConstant token.
List the FigurativeConstant rule first and you will get the result you were expecting.
As a matter of style, the order in which you are listing your rules obscures the significance of their order, particularly for the necessary POV of the Lexer and Parser. Take a look at the grammars in the antlr/grammars-v4 repository as examples -- typically, for a combined grammar, parser on top and a top-down ordering. I would even hazard a guess that others might have answered sooner had your grammar been easier to read.

Perl compare operators and stringified "numbers"

I've been working a lot lately with perl, still I dont really know how <,>,>=,=<, ne,gt, etc.. on stringified "numbers", by "number" I mean something like: '1.4.5.6.7.8.0'
correct me If I'm wrong, the following returns true:
if ('1.4.5' > '8.7.8');
because both will be coerced to true (not an empty string).
but, how does ne,gt,etc string operators work on such numbers?
basically I'm trying to compare version numbers consisted of the following form:
1.3.4.0.2
I can make a numerical comparison of each digit, but before, I ranther want to know of the
string comparing operators perform on such strings.
Thanks,
First: Please use warnings all the time. You would have realized the following at once:
$ perl -wle 'print 1 unless "1.4.5" > "8.7.8"'
Argument "8.7.8" isn't numeric in numeric gt (>) at -e line 1.
Argument "1.4.5" isn't numeric in numeric gt (>) at -e line 1.
Perl v5.9.0 came distributed with version. And this module makes it very easy to compare version numbers:
use warnings;
use version;
my ($small, $large) = (version->parse('1.4.5'), version->parse('8.7.8'));
print "larger\n" if $small > $large;
print "smaller\n" if $small < $large;
A string comparison will only work if every number between the dots has the same length. A string comparison has no knowledge of number and will begin to compare dots and digits (as they are both characters in a string).
There a CPAN module that does exactly what you are looking for: Sort::Versions
When you compare strings using numerical relation operators <, >, etc., Perl issues a warning if you use warnings. However, Perl will still attempt to convert the strings into numbers. If the string starts with digits, Perl will use these, otherwise the string equates to 0. In your example comparing '1.4.5' and '8.7.8' has the same effect as comparing numbers 1.4 and 8.7.
But for ne, gt, etc. it really doesn't matter if your strings consist of numbers or anything else (including dots). Therefore:
print "greater" if '2.3.4' gt '10.1.2' # prints 'greater' because '2' > '1' stringwise
print "greater" if '02.3.4' gt '10.1.2' # prints nothing because '0' < '1' stringwise
Therefore you cannot use neither >, <, etc. nor gt, lt, etc. for version comparison, you have to choose different approach, as proposed in another answers, for example.
Not sure on the overhead of this, but you might try Sort::Naturally. And particularly, the ncmp operator.
As #tent pointed out, #SebastianStumpf's solution is close, but not quite right because:
>perl -Mversion -e 'my #n = ( "1.10", "1.9" ); print "$n[0] is " . ( version->parse($n[0]) > version->parse($n[1]) ? "larger" : "smaller" ) . " than $n[1]\n";'
1.10 is smaller than 1.9
Luckily this is easily solved following the hint in version's documentation:
The leading 'v' is now strongly recommended for clarity, and will
throw a warning in a future release if omitted.
>perl -Mversion -e 'my #n = ( "1.10", "1.9" ); print "$n[0] is " . ( version->parse("v$n[0]") > version->parse("v$n[1]") ? "larger" : "smaller" ) . " than $n[1]\n";'
1.10 is larger than 1.9

Resources