PetitParser not distributive? - pharo

Are rules in PetitParser distributive?
There were next rules:
integerLiteral --> hexIntegerLiteral / octalIntegerLiteral / decimalIntegerLiteral
hexIntegerLiteral --> hexNumeral , (integerTypeSuffix optional)
octalIntegerLiteral --> octalNumeral , (integerTypeSuffix optional)
decimalIntegerLiteral --> decimalNumeral , (integerTypeSuffix optional)
if I change them to:
integerLiteral --> (hexIntegerLiteral / octalIntegerLiteral / decimalIntegerLiteral) , (integerTypeSuffix optional)
hexIntegerLiteral --> hexNumeral
octalIntegerLiteral --> octalNumeral
decimalIntegerLiteral --> decimalNumeral
then 0777L is not parsed anymore. It should match octalNumeral , (integerTypeSuffix optional) or in new version octalIntegerLiteral , (integerTypeSuffix optional) but that isn't happening.

Yes, the ordered-choice in PetitParser is distributive. In your example there is some context missing, so I don't know why it doesn't work for you.
The PetitParser optimizer does the change you suggested automatically. The rewrite rule (in a slightly more general form) is defines as:
PPOptimizer>>#postfixChoice
<optimize>
| before prefix body1 body2 postfix after |
before := PPListPattern any.
prefix := PPPattern any.
body1 := PPListPattern any.
body2 := PPListPattern any.
postfix := PPPattern any.
after := PPListPattern any.
rewriter
replace: before / (prefix , body1) / (prefix , body2) / after
with: before / (prefix , (body1 / body2)) / after.
rewriter
replace: before / (body1 , postfix) / (body2 , postfix) / after
with: before / ((body1 / body2) , postfix) / after

Related

Comma separated with/without spaces change behaviour. Spaces are skipped though

First of all, thanks a lot for your time.
Practicing a little bit more with antlr4, I made this grammar (below).
Input
The tested input is the following:
text to search query_on:fielda,fieldab fielda:"123" sort_by:+fielda,-fieldabc
This produces the next output starting to fail on the query_on-varname rule.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda,fieldab fielda)))) : "123" sort_by : + fielda, - fieldabc\n)
If instead of this input I separate the commas with spaces:
text to search query_on:fielda , fieldab fielda:"123" sort_by:+fielda , -fieldabc
The output is much more similar to "my" expexted output:
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc\n))))) <EOF>)
The only failing part is the last \n.
Expected
The expected results is the same as before but accepting the varname fieldabc and skipping the \n.
(start (query (expr (text_query text to search) (query_on query_on : (varname fielda) , (varname fieldab)) (filters (binary_op (varname fielda) : (value "123"))) (sorting_fields sort_by : (sorting_field (sorting_order (asc +)) (varname fielda)) , (sorting_field (sorting_order (desc -)) (varname fieldabc))))))
Questions
Therefore:
Why the grammar is sensitive to the spaces around a comma ?
Similarly, why the \n char is not skipped at the end ?
Thanks!
GRAMMAR
grammar SearchEngine;
// Grammar
start: query EOF;
query
: '(' query+ ')'
| query (OR query)+
| expr
;
expr: text_query query_on? filters* sorting_fields?;
text_query: STRING+;
query_on: QUERY_ON ':' varname (',' varname)*;
filters: binary_op+;
binary_op: varname ':' value;
sorting_fields: SORT_BY ':' sorting_field (',' sorting_field)*;
sorting_field: sorting_order varname;
sorting_order: (asc|desc);
asc: '+';
desc: '-';
varname
: FIELDA
| FIELDAB
| FIELDABC
;
value: STRING;
// Lexer rules (tokens)
WHITE_SPACE: [ \t\r\n] -> skip;
OR: O R;
QUERY_ON: Q U E R Y '_' O N;
SORT_BY: S O R T '_' B Y;
FIELDA: F I E L D A;
FIELDAB: F I E L D A B;
FIELDABC: F I E L D A B C;
STRING: ~[ :()+-]+;
// Fragments (not tokens)
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
fragment D: [dD];
fragment E: [eE];
fragment F: [fF];
fragment G: [gG];
fragment H: [hH];
fragment I: [iI];
fragment J: [jJ];
fragment K: [kK];
fragment L: [lL];
fragment M: [mM];
fragment N: [nN];
fragment O: [oO];
fragment P: [pP];
fragment Q: [qQ];
fragment R: [rR];
fragment S: [sS];
fragment T: [tT];
fragment U: [uU];
fragment V: [vV];
fragment W: [wW];
fragment X: [xX];
fragment Y: [yY];
fragment Z: [zZ];
Your STRING Lexer rule accepts tabs and linefeeds. Try:
STRING: ~[ :()+-,\t\r\n]+;
(Having your WHITESPACE rule above it won't affect this, because ANTLRs Lexer rules will select the longest sequence of characters that match any Lexer rule). This is also, why you'll usually see grammars require some sort of delimiter on strings. (The delimiters also distinguish between identifiers and string literals in most languages)

regular epxression (sass variables)

(Node.js) I have to match all Sass variables from file. But I can have variables and mixins in one file. I need to update regular epxression to not match a variables from mixin directive or from mixin / function content (nested).
So only:
$test: true;
$white: #fff !default;
$sizes: (25: 0.25rem, 50: 0.5rem) !default;
Regular expression: /\$([^:]*)\s*:\s*([^;]*)\s*;/g
https://regex101.com/r/oRuWjS/1
$test: true;
$white: #fff !default;
$sizes: (
25: 0.25rem,
50: 0.5rem
) !default;
#mixin parent ($first, $second: "") {
.#{$first} {
#content;
}
}
Using
/#mixin[^(]*\([\s\S]*?^}$|\$([^:]*?)\s*:\s*([^;]*?)\s*;/gm
you may match from #mixin to the } that is alone on a line, and skip this match, else collect your other matches. See the regex demo.
var s = "$test: true;\n\n$white: #fff !default;\n\n$sizes: (\n 25: 0.25rem,\n 50: \n0.5rem\n) !default;\n\n#mixin parent ($first, $second: \"\") {\n .#{$first} {\n \n#content;\n }\n}\n";
var rx = /#mixin[^(]*\([\s\S]*?^}$|\$([^:]*?)\s*:\s*([^;]*?)\s*;/gm;
var m, res = [];
while (m=rx.exec(s)) {
if (m[1]) {
res.push([m[1], m[2]]);
}
}
console.log(res);
Details
#mixin[^(]*\([\s\S]*?^}$ - the alternative that will be skipped:
#mixin - a literal substring
[^(]* - 0+ chars other than (
\( - a (
[\s\S]*? - any 0+ chars as few as possible
^}$ - a } that is on a separate line
| - or
\$ - a $ char
([^:]*?) - Group 1: 0+ chars other than : as few as possible
\s*:\s* - a : enclosed with 0+ whitespaces
([^;]*?) - Group 2: 0+ chars other than : as few as possible
\s*; - 0+ whitespaces followed with ;.

non-fragment lexer rule x can match the empty string

What's wrong with the following antlr lexer?
I got an error
warning(146): MySQL.g4:5685:0: non-fragment lexer rule VERSION_COMMENT_TAIL can match the empty string
Attached source code
VERSION_COMMENT_TAIL:
{ VERSION_MATCHED == False }? // One level of block comment nesting is allowed for version comments.
((ML_COMMENT_HEAD MULTILINE_COMMENT) | . )*? ML_COMMENT_END { self.setType(MULTILINE_COMMENT); }
| { self.setType(VERSION_COMMENT); IN_VERSION_COMMENT = True; }
;
You are trying to convert my ANTLR3 grammar for MySQL to ANTLR4? Remove all the comment rules in the lexer and insert this instead:
// There are 3 types of block comments:
// /* ... */ - The standard multi line comment.
// /*! ... */ - A comment used to mask code for other clients. In MySQL the content is handled as normal code.
// /*!12345 ... */ - Same as the previous one except code is only used when the given number is a lower value
// than the current server version (specifying so the minimum server version the code can run with).
VERSION_COMMENT_START: ('/*!' DIGITS) (
{checkVersion(getText())}? // Will set inVersionComment if the number matches.
| .*? '*/'
) -> channel(HIDDEN)
;
// inVersionComment is a variable in the base lexer.
MYSQL_COMMENT_START: '/*!' { inVersionComment = true; setChannel(HIDDEN); };
VERSION_COMMENT_END: '*/' {inVersionComment}? { inVersionComment = false; setChannel(HIDDEN); };
BLOCK_COMMENT: '/*' ~[!] .*? '*/' -> channel(HIDDEN);
POUND_COMMENT: '#' ~([\n\r])* -> channel(HIDDEN);
DASHDASH_COMMENT: DOUBLE_DASH ([ \t] (~[\n\r])* | LINEBREAK | EOF) -> channel(HIDDEN);
You need a local inVersionComment member and a function checkVersion() in your lexer (I have it in the base lexer from which the generated lexer derives) which returns true or false, depending on whether the current server version is equal to or higher than the given version.
And for your question: you cannot have actions in alternatives. Actions can only appear at the end of an entire rule. This differs from ANTLR3.

antlr literal string matching: what am I doing wrong?

I've been using antlr for 3 days. I can parse expressions, write Listeners, interpret parse trees... it's a dream come true.
But then I tried to match a literal string 'foo%' and I'm failing. I can find plenty of examples that claim to do this. I have tried them all.
So I created a tiny project to match a literal string. I must be doing something silly.
grammar Test;
clause
: stringLiteral EOF
;
fragment ESCAPED_QUOTE : '\\\'';
stringLiteral : '\'' ( ESCAPED_QUOTE | ~('\n'|'\r') ) + '\'';
Simple test:
public class Test {
#org.junit.Test
public void test() {
String input = "'foo%'";
TestLexer lexer = new TestLexer(new ANTLRInputStream(input));
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
ParseTree clause = parser.clause();
System.out.println(clause.toStringTree(parser));
ParseTreeWalker walker = new ParseTreeWalker();
}
}
The result:
Running com.example.Test
line 1:1 token recognition error at: 'f'
line 1:2 token recognition error at: 'o'
line 1:3 token recognition error at: 'o'
line 1:4 token recognition error at: '%'
line 1:6 no viable alternative at input '<EOF>'
(clause (stringLiteral ' ') <EOF>)
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.128 sec - in com.example.Test
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
The full maven-ized build tree is available for a quick review here
31 lines of code... most of it borrowed from small examples.
$ mvn clean test
Using antlr-4.5.2-1.
fragment rules can only be used by other lexer rules. So, you need to make stringLiteral a lexer rule instead of a parser rule. Just let it start with an upper case letter.
Also, it's better to expand your negated class ~('\n'|'\r') to include a backslash and quote, and you might want to include a backslash to be able to be escaped:
clause
: StringLiteral EOF
;
StringLiteral : '\'' ( Escape | ~('\'' | '\\' | '\n' | '\r') ) + '\'';
fragment Escape : '\\' ( '\'' | '\\' );

Antlr4 grammar ambiguity

I have the following grammar ( minimized for SO)
grammar Hello;
odataIdentifier : identifierLeadingCharacter identifierCharacter*;
identifierLeadingCharacter : Alpha| UNDERSCORE;
identifierCharacter : identifierLeadingCharacter | Digit;
identifierUnreserved : identifierCharacter | (MINUS | DOT | TILDE);
Digit : ZERO_TO_FIVE |[6-9];
ONEHUNDRED_TO_ONEHUNDREDNINETYNINE : '1' Digit Digit; // 100-199
TWOHUNDRED_TO_TWOHUNDREDFOURTYNINE : '2' ZERO_TO_FOUR Digit; // 200-249
TWOHUNDREDFIFTY_TO_TWOHUNDREDFIFTYFIVE : '25' ZERO_TO_FIVE; // 250-255
TEN_TO_NINETYNINE : ONE_TO_NINE Digit; // 10-99
ZERO_TO_ONE : [0-1];
ZERO_TO_TWO : ZERO_TO_ONE | [2];
ZERO_TO_THREE : ZERO_TO_TWO | [3];
ZERO_TO_FOUR : ZERO_TO_THREE | [4];
ZERO_TO_FIVE : ZERO_TO_FOUR | [5];
ONE_TO_TWO : [1-2];
ONE_TO_THREE : ONE_TO_TWO | [3];
ONE_TO_FOUR : ONE_TO_THREE | [4];
ONE_TO_NINE : ONE_TO_FOUR | [5-9];
Alpha : [a-zA-Z];
MINUS : [-];
DOT : '.';
UNDERSCORE : '_';
TILDE : '~';
WS : (' '|'\r'|'\t'|'\u000C'|'\n') -> skip
;
for input c9 it works fine, but when i have 2 digits for example c10 it says:
extraneous input '92' expecting {<EOF>, Digit, Alpha, '_'}
so i guess it parses 9 and parses 2 and doesn't know if this should be TEN_TO_NINETYNINE or 2 Digit Digit.
i am a noob to this, so wondering if my analysis is right and how could i alleviate this ...
Your input is resulting in an Alpha token followed by a TEN_TO_NINETYNINE token. While the parser rule identifierLeadingCharacter does allow the Alpha token, the identifierCharacter rule cannot match a TEN_TO_NINETYNINE token.
The input 10 will always produce a TEN_TO_NINETYNINE token rather than two Digit tokens, because the former matches more of the input and lexer rules are greedy.

Resources