How to go from token value to lexer rule? - antlr4

So far I believed that a token value (as generated by a lexer rule) is the same as the rule index for that rule. Obviously that is not the case, as you can see when you look through the ruleNames and literal/display names fields in your generated code. Rule names are partially in a different order compared to display names (which are only the string representations for a token value) and also contain things like fragment rules. On the other hand there are no entries for virtual tokens (as defined in the tokens section).
Now when you want to get the rule index from a token value, how would you do that? The only way I can imagine is to get the symbolic name (which is the rule name) from the vocabulary and then look this up in the rule names array. But that seems a bit odd. There should be a more direct way. Any idea?
Additional info: the lookup is needed when you want to walk the ATN starting in a parser rule. Lexer tokens are stored as transition labels and that's where they come from. In order to continue walking in the lexer ATN you need the correct rule index.

In general this is not possible. Lexer rules can return tokens which don't correspond at all to the rule name. Hence there is no reliable relationship between tokens and the rules that produced them (often it matches, but not always).

Related

Antlr4 Lexer with two continuous parentheses

My text is : Function(argument, function(argument)). When I use g4 DEBUGGER to generate the tree. It will works when there is blank between the last two parentheses: Function(argument, function(argument) ). But the debugger will say unexpected '))' when there is not a blank. So, how should I revise my grammar to make it?
It confuses me a lot.
(It will be much easier to confirm the answer to your question if you post the grammar, or at least a stripped down version that demonstrates the issue. Note: Often, stripping down to a minimal example to demonstrate the issue will result in you finding your own answer.)
That said, based upon your error message, I would guess that you have a Lexer rule that matches )). If you do, then ANTLR will match that rule and create that token rather than creating two ) tokens.
Most of the time this mistake originates from not understanding that the Lexer is completely independent of the parser. The Lexer recognizes streams of characters and produces a stream of tokens, and then the parser matched rules against that stream of tokens. In this case, I would guess that the parser rule is looking to match a ) token, but finds the )) token instead. Those are two different tokens, so the parser rule will fail.

What are token types and the vocabulary in ANTLR4?

I couldn't find any good resource online that describes this well. Does "token type" mean the types we encounter in a programming language like int, string, char etc.? I see that it is some integer, but what does this integer mean? And what is a vocabulary? Looking for some explanation with a simple bare minimum grammar.
The idea of token types and also the vocabulary is so simple that nobody probably thought about formally describing them. But here it is:
During the lexing process the Lexer assigns numbers to parts of the input text. That means a mapping is created between specific patterns in the input and an arbitrary number. This number is called the token type.
The lexer rules in a grammar describe the patterns which must be matched and the lexer rule names are the textual expression of the token that is created out of the matched input. Usually lexer rules get the token type assigned in the order they appear in the grammar. The first lexer rule gets token type 0, the next token type 1 and so on. In some situations (imported grammars or token vocabularies or virtual tokens) this order can be different, however.
A vocabulary is a generated structure to map a token type to its literal rule name. This is used in cases where you need the name for error messages, code completion or debugging.
Note: there's no such structure to map rule names back to token values (or in the case of the parser from rule names to rule indices). The reason is that a rule can return a different token type than what is defined by the rule name. For example consider this lexer rule from the MySQL grammar:
CHARACTER_SYMBOL: C H A R A C T E R -> type(CHAR_SYMBOL);
CHARACTER_SYMBOL is a rule with an own token value, but it returns the token value for CHAR_SYMBOL instead (type aliasing). Hence you can easily map from the token value to either of these rule names, but not the other way around.

Semantic lexer predicate performance

I have a lexer creates MACRO tokens for a dynamic list of macro strings passed to the lexer. I used a semantic predicate in the very top lexer rule to implement this feature:
MACRO: { macros != null && tryMacro() }? .;
Where tryMacro() just checks if any macro string matches the input sequence.
The performance of this approach was very bad and after some research I tried changing the lexer rule to the following:
MACRO: . { macros != null && tryMacro() }?;
This severely improved the performance but I don't really understand why. :) Since the '.' matches any character, the semantic predicate rule should be invoked exactly as many times as before, shouldn't it? Can someone provide an explanation for this behavior?
The reason is pretty simple: if you put the predicate at the start, the lexer will evaluate it to decide if the MACRO rule should apply. If you put it at the end, it will only perform the check when it has a potential match for the MACRO rule.
Since MACRO is very generic, I suppose you put it at the end of the rules, and due to the priority rules it will surely get tried last. It can match only single character tokens, so more precise rules will be prioritary.
If the MACRO rule is superseded by a more prioritary rule, it won't be considered and your predicate won't be invoked.
I debugged this a bit further and it turned out that the reordering of the rule changed the behavior of the lexer causing macros to not be accepted during parsing. The reason for the perceived increase in performance was because the semantic predicate was only evaluated a couple of times before the lexer dropped the rule while doing its predictions. So the change of the rule was actually invalid and not a performance improvement.
I finally solved the performance issue by moving the macro handling to the parser.

Important algorithm involving random access to a string?

I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.

ANTLR4 Parser Rule granularity

Say I have a grammar that has a parser rule to match one of several specific strings.
Is the proper thing to do in the grammar to make an alternate parser rule for each specific string, or to keep the parser rule general and decode the string in a visitor subclass?
If the specific strings are meaningful (e.g. a keyword in a DSL) it sounds like you want Tokens. Whatever rules you have in the grammar can reference the Tokens you created.
Generally, it's better to have your grammar do as much of the parser work as possible, rather than overly generalizing and having to write a bunch of extra code.
See the following: http://www.antlr.org/wiki/display/ANTLR4/Grammar+Structure

Resources