How to perform Lexer Actions that send an Exception? - antlr4

I'm new to ANTL4 and I can't seem to figure out how to get lexer actions to perform properly.
I have a code snippet that looks for input text:
SIZE10 : [a-zA-Z]* {getText().length() <= 10}?
I would expect that it does not match any combinations of letters that are over 10 letters long, however what this does is treat a 10+ letter string as two different tokens, instead of just nullifying the whole set of 10+ letters. How can I get this action to nullify the whole set of letters?
In addition, where can I go to see all the different token functions I can use (other than getText())? The documentation about lexer actions is really poor. In general, I'm having a hard time figuring out what resources can give me a definitive list of everything in the language. Even an entry point into the source code for me to read would be good at this point. The documentation is too general/basic for me.
EDIT: I've figured out how to send a RuntimeException, but I don't know where to get the elements needed for a proper RecognitionException.

The predicate in a rule directs the parsing process in a way that allows to match only partial input (like in your case) or essentially switch off a part of the grammar depending on certain conditions. In your case the SIZE10 rule is matched until the predicate returns false. Everything up to this event is then returned as a match for SIZE10. After that lexing continues at the point it ended for the previous token and if that is again a letter it will again match SIZE10 as long as the predicate says it is correct. That's a bit different than what you would expect (e.g. using the predicate as an all or nothing switch).
However, if you instead want to match the full set of letters first and then check if the length is <= 10 you can do this in a listener. You can hook into the exitSIZE10() event and reject the match by throwing a recognition exception.
For the usable functions in your actions see the API documentation for ANTLR. For instance here is the one for Token which shows you other possibilities beside getText(). In your action, consider the context you have. In a lexer rule you deal with a Token, hence getText() etc. work on the token. In a parser rule you have a ParserContext instead, which also has a getText() function but that works differently (collecting all child contexts text into a comma separated list).

Related

Using flex to identify variable name without repeating characters

I'm not fully sure how to word my question, so sorry for the rough title.
I am trying to create a pattern that can identify variable names with the following restraints:
Must begin with a letter
First letter may be followed by any combination of letters, numbers, and hyphens
First letter may be followed with nothing
The variable name must not be entirely X's ([xX]+ is a seperate identifier in this grammar)
So for example, these would all be valid:
Avariable123
Bee-keeper
Y
E-3
But the following would not be valid:
XXXX
X
3variable
5
I am able to meet the first three requirements with my current identifier, but I am really struggling to change it so that it doesn't pick up variables that are entirely the letter X.
Here is what I have so far: [a-z][a-z0-9\-]* {return (NAME);}
Can anyone suggest a way of editing this to avoid variables that are made up of just the letter X?
The easiest way to handle that sort of requirement is to have one pattern which matches the exceptional string and another pattern, which comes afterwards in the file, which matches all the strings:
[xX]+ { /* matches all-x tokens */ }
[[:alpha:]][[:alnum:]-]* { /* handle identifiers */ }
This works because lex (and almost all lex derivatives) select the first match if two patterns match the same longest token.
Of course, you need to know what you want to do with the exceptional symbol. If you just want to accept it as some token type, there's no problem; you just do that. If, on the other hand, the intention was to break it into subtokens, perhaps individual letters, then you'll have to use yyless(), and you might want to switch to a new lexing state in order to avoid repeatedly matching the same long sequence of Xs. But maybe that doesn't matter in your case.
See the flex manual for more details and examples.

Arbitrary lookaheads in PLY

I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.

PEG.js and empty sequences

I'm trying to implement the TPTP grammar in PEG. It contains a rule for an empty sequence, which is used in many other rules, and PEG is rejecting this. A Google search finds https://github.com/pegjs/pegjs/commit/df154daafb9c6c952351493af02d3a55e0b05c59#commitcomment-10667420 which seems to be saying PEG by design does not allow empty sequence rules, which would make it unsuitable for implementing grammars such as TPTP which contain such. Do I understand this correctly, or am I missing something?
I believe it is still possible to do so, as explained in the posted link; instead of matching with nothing, you can match with "", and then return whatever you want to return :
Empty
= "" {return null;}

ANTLR get first production

I'm using ANTLR4 and, in particular, the C grammar available in their repo (grammar). It seems that the grammar hasn't an initial rule, so I was wondering how it's possible to get it. In fact, once initialized the parser, I attach my listener, but I obtain syntax errors since I'm trying to parse two files with different code instructions:
int a;
int foo() { return 0; }
In my example I call the parser with "parser.primaryExpression();" which is the first production of the "g4" file. Is it possible to avoid to call the first production and get it automatically by ANTLR instead?
In addition to #GRosenberg's answer:
Also the rule enum (in the generated parser) contains entries for each rule in the order they appear in the grammar and the first rule has the value 0. However, just because it's the first rule in the grammar doesn't mean that it is the main entry point. Only the grammar author knows what the real entry is and sometimes you might even want to parse only with a subrule, which makes this decision even harder.
ANTLR provides no API to obtain the first rule. However, in the parser as generated, the field
public static final String[] ruleNames = ....;
lists the rulenames in the order of occurrence in the grammar. With reflection, you can access the method.
Beware. Nothing in the Antlr 'spec' defines this ordering. Simply has been true to date.

Can I put one check on a Lexial element instead for on a number of parser rules?

I,m trying to use antlr4 with the IDL.g4 grammar, to implement some checks that our idl-files shall follow. One rule is about names. The rule are like:
ID contains only letters, digits and signle underscores,
ID begin with a letter,
ID end with a letter or digit.
ID is not a reserved Word in ADA, C, C++, Java, IDL
One way to do this check is to write a function that check a string for these properties and call it in the exit listeners for every rule that has an ID. E.g(refering to IDL.g4) in exitConst_decl(), exitInit_decl(), exitSimple_declarator() and a lot of more places. Maybe that is the correct way to do it. But I was thinking about putting that check directly on the lexical element ID. But don't know how to do that, or if it is possible at all.
Validating this type of constraint in the lexer would make it significantly more difficult to provide usable error messages for invalid identifiers. However, you can create a new parser rule identifier, and replace all references to ID in various parser rules to reference identifier instead.
identifier
: ID
;
You can then place your identifier validation logic inside of the single method enterIdentifier instead of all of the various rules that currently reference ID.

Resources