ANTLR4: Tree construction

ANTLR4: Tree construction - antlr4

I am extending the baseClass Listener and am attempting to read in some values, however there doesnt seem to be any hierrarchy in the order.
A cut down version of my grammar is as follows:
start: config_options+
config_options: (KEY) EQUALS^ (PATH | ALPHANUM) (' '|'\r'|'\n')* ;
KEY: 'key' ;
EQUALS: '=' ;
ALPHANUM: [0-9a-zA-Z]+ ;
However the parse tree of this implementation is flat at the config_options level (Terminal level) i.e.the rule start has many children of config_options but EQUALS is not the root of subtrees of config_options, all of the TOKENS have the rule config_options as root node. How can I make one of the terminals a root node instead?
In this particular rule I dont want any of the spaces to be captured, I know that there is the -> skip directed for the lexer however there are some cases where I do want the space. i.e. in String '"'(ALPHANUM|' ')'"'
(Note: the ^ does not seem to work)
an example for input is:
key=abcdefg
key=90weata
key=acbefg9
All I want to do is extract the key and value pairs. I would expect that the '=' would be the root and the two children would be the key and the value.

When you generate your grammar, you should be getting a syntax error over the use of the ^ operator, which was removed in ANTLR 4. ANTLR 4 generates parse trees, the roots of which are implicitly defined by the rules in your grammar. In other words, for the grammar you gave above the parse tree nodes will be start and config_options.
The generated config_options rule will return an instance of Config_optionsContext, which contains the following methods:
KEY() returns a TerminalNode for the KEY token.
EQUALS() (same for the EQUALS token)
PATH() (same for the PATH token)
ALPHANUM() (same for the ALPHANUM token)
You can call getSymbol() on a TerminalNode to get the Token instance.

Related

UTF16 stored string doesn't match once retrieved back from CoreData

So I am using CoreStore to save a string identifier in CoreData.
The string may have some Swedish UTF16 characters. Inspecting from the debugger console:
> po identifier
"/EXTERNAL/Gemensam RUN/FileCloud Test/Test folder åäö/Test with Swedish characters - åäö.xlsx"
Immediately after being saved back to CoreData:
>po record
<File: 0x281e140a0> (entity: File; id: 0xdcac6620f1e9eb63 <x-coredata://BA0168AF-92CE-4AC2-A934-1020E41C5C20/File/p615>; data: {
// ...
identifier = "filecloud.test#run.se#files.runcloud.se/EXTERNAL/Gemensam RUN/FileCloud Test/Test folder \U00e5\U00e4\U00f6/Test with Swedish characters - \U00e5\U00e4\U00f6.xlsx";
// ...
})
Which looks like the UTF16 string has been stored as an UTF8 one. But still a valid one as:
> po record.identifier == identifier
true
The problem comes later when trying to retrieve the record with again a UTF16 Swedish identifier string as the original above as it doesn't match anymore.
CoreStore.fetchOne(From<Record>().where(\.identifier == identifier)) // Fails
How could I convert identifier to a representation that would match the stored CoreData value?
Update
Even more strange, a hardcoded identifier does succeed:
CoreStore.fetchOne(From<Record>().where(\.identifier == "filecloud.test#run.se#files.runcloud.se/EXTERNAL/Gemensam RUN/FileCloud Test/Test folder åäö/Test with Swedish characters - åäö.xlsx")) // Works
And identifer and this hardcoded string do match:
po identifier == "filecloud.test#run.se#files.runcloud.se/EXTERNAL/Gemensam RUN/FileCloud Test/Test folder åäö/Test with Swedish characters - åäö.xlsx"
true
But using identifier instead of the hardcoded one doesn't.
Update 2
Comparing .unicodeScalars of identifier and the hardcoded string does show that they are indeed different:

CoreData does save and return strings exactly the same.
The issue at trying to retrieve values using complex characters is that CoreData (and most probably SQLite behind it) do not consider my sentences equal as they have different grapheme clusters. Both sentences are valid and compare equal in Swift but not in CoreData as values to retrieve objects.
There doesn't seem to be a proper way to convert grapheme clusters in Swift, so my workaround was to recreate the process that lead to have the original grapheme clusters in the first place. This involved first creating a URL out of the string and then letting the FileProvider framework create the same grapheme clusters by calling persistentIdentifierForItem(at: url)!.rawValue. Then use this value to retrieve my saved object.

Node.JS - if a string includes some strings with any characters in between

I am testing a string that contains an identifier for which type of device submitted the string. The device type identifier will be something like "123456**FF789000AB" where the * denote any character could be used at this position. I run a series of functions to parse additional data and set variables based on the type of device submitting the data. Currently, I have the following statement:
if (payload[4].includes("02010612FF590080BC")) { function(topic, payload, intpl)};
The string tested in the includes() test will always start with 020106, but the next two characters could be anything. Is there a quick regex I could throw in the includes function, or should I organize the test in a different way?

To match the "020106**FF590080BC" pattern, where * can be anything, you can use RegExp.test() and the regular expression /020106..FF590080BC/:
if (/020106..FF590080BC/.test(payload[4])) { ... }
Also, if you require that the pattern must match the beginning of the string:
if (/^020106..FF590080BC/.test(payload[4])) { ... }

The lexer chooses the wrong Token

Hi I am new to antrl and have a problem that I am not able to solve during the last days:
I wanted to write a grammar that recognizes this text (in reality I want to parse something different, but for the case of this question I simplified it)
100abc
150100
200def
Here each rows starts with 3 digits, that identifiy the type of the line (header, content, trailer), than 3 characters follow, that are the payload of the line.
I thought I could recogize this with this grammar:
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
header : '100' A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
But this does not work. When the lexer reads the "100" in the second line ("150100") it reads it into one token with 100 as the value and not as three Tokens of type A. So the parser sees a "100" token where it expects an A Token.
I am pretty sure that this happens because the Lexer wants to match the longest phrase for one Token, so it cluster together the '1','0','0'. I found no way to solve this. Putting the Rule A above the parser Rule that contains the string literal "100" did not work. And also factoring the '100' into a fragement as follows did not work.
grammar Types;
file : header content trailer;
A : [a-z|A-Z|0-9];
NL: '\n';
HUNDRED: '100';
header : HUNDRED A A A NL;
content: '150' A A A NL;
trailer: '200' A A A NL;
I also read some other posts like this:
antlr4 mixed fragments in tokens
Lexer, overlapping rule, but want the shorter match
But I did not think, that it solves my problem, or at least I don't see how that could help me.

One of your token definitions is incorrect: A : [a-z|A-Z|0-9]; Don't use a vertical line inside a range [] set. A correct definition is: A : [a-zA-Z0-9];. ANTLR with version >= 4.6 will notify about duplicated chars | inside range set.
As I understand you mixed tokens and rules concept. Tokens defined with UPPER first letter unlike rules that defined with lower case first letter. Your header, content and trailer are tokens, not rules.
So, the final version of correct grammar on my opinion is
grammar Types;
file : Header Content Trailer;
A : [a-zA-Z0-9];
NL: '\r' '\n'? | '\n' | EOF; // Or leave only one type of newline.
Header : '100' A A A NL;
Content: '150' A A A NL;
Trailer: '200' A A A NL;
Your input text will be parsed to (file 100abc\n 150100\n 200def)

Antl4 no rule index for labelled rules

For the grammar snippet from Java.g4,
statement
: block # blockStmt
| 'if' parExpression statement ('else' statement)? # ifStmt
| 'for' '(' forControl ')' statement # forStmt
| 'while' parExpression statement # whileStmt
;
All the alternatives are labelled.
I can get all StatementContext objects using this method
Trees.getAllRuleNodes(root,JavaParser.Rule_statement);
But if I am only interested in getting the IfStmtContext objects, how can I use the above method without using something like this
for(ParseTree tree : statementContextList)
{
if(tree instanceof IfStmtContext)
{
//add to a list
}
The generated JavaParser doesnt create rule indexes for labelled rules.
Do I have to customize the grammar in some way to make them indexed?
Or there is another ways do this?
My code should be fast and I need to remove as much as iterations and conditions as possible. Need to get rid of the 'instanceof' checks as well as possible

how to handle conditionally existing components in action code?

This is another problem I am facing while migrating from antlr3 to antlr4. This problem is with the java action code for handling conditional components of rules. One example is shown below.
The following grammar+code worked in antlr3. Here, if the unary operator is not present, then a value of '0' is returned, and the java code checks for this value and takes appropriate action.
exprUnary returns [Expr e]
: (unaryOp)? e1=exprAtom
{if($unaryOp.i==0) $e = $e1.e;
else $e = new ExprUnary($unaryOp.i, $e1.e);
}
;
unaryOp returns [int i]
: '-' {$i = 1;}
| '~' {$i = 2;}
;
In antlr4, this code results in a null pointer exception during a run, because 'unaryOp' is 'null' if it is not present. But if I change the code like below, then antlr generation itself reports an error:
if($unaryOp==null) ...
java org.antlr.v4.Tool try.g4
error(67): missing attribute access on rule reference 'unaryOp' in '$unaryOp'
How should the action be coded for antlr4?
Another example of this situation is in if-then-[else] - here $s2 is null in antlr4:
ifStmt returns [Stmt s]
: 'if' '(' e=cond ')' s1=stmt ('else' s2=stmt)?
{$s = new StmtIf($e.e, $s1.s, $s2.s);}
;
NOTE: question 16392152 provides a solution to this question with listeners, but I am not using listeners, my requirement is for this to be handled in the action code.

There are at least two potential ways to correct this:
The "ANTLR 4" way to do it is to create a listener or visitor instead of placing the Java code inside of actions embedded in the grammar itself. This is the only way I would even consider solving the problem in my own grammars.
If you still use an embedded action, the most efficient way to check if the item exists or not is to access the ctx property, e.g. $unaryOp.ctx. This property resolves to the UnaryOpContext you were assuming would be accessible by $unaryOp by itself.

ANTLR expects you access an attribute. Try its text attribute instead: $unaryOp.text==null

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR4: Tree construction - antlr4

Related

UTF16 stored string doesn't match once retrieved back from CoreData

Node.JS - if a string includes some strings with any characters in between

The lexer chooses the wrong Token

Antl4 no rule index for labelled rules

how to handle conditionally existing components in action code?

Categories

Resources