ANTLR get first production

ANTLR get first production - antlr4

I'm using ANTLR4 and, in particular, the C grammar available in their repo (grammar). It seems that the grammar hasn't an initial rule, so I was wondering how it's possible to get it. In fact, once initialized the parser, I attach my listener, but I obtain syntax errors since I'm trying to parse two files with different code instructions:
int a;
int foo() { return 0; }
In my example I call the parser with "parser.primaryExpression();" which is the first production of the "g4" file. Is it possible to avoid to call the first production and get it automatically by ANTLR instead?

In addition to #GRosenberg's answer:
Also the rule enum (in the generated parser) contains entries for each rule in the order they appear in the grammar and the first rule has the value 0. However, just because it's the first rule in the grammar doesn't mean that it is the main entry point. Only the grammar author knows what the real entry is and sometimes you might even want to parse only with a subrule, which makes this decision even harder.

ANTLR provides no API to obtain the first rule. However, in the parser as generated, the field
public static final String[] ruleNames = ....;
lists the rulenames in the order of occurrence in the grammar. With reflection, you can access the method.
Beware. Nothing in the Antlr 'spec' defines this ordering. Simply has been true to date.

Related

Finding the start of an expression when the end of the previous one is difficult to express

I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.

Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.

PEG.js and empty sequences

I'm trying to implement the TPTP grammar in PEG. It contains a rule for an empty sequence, which is used in many other rules, and PEG is rejecting this. A Google search finds https://github.com/pegjs/pegjs/commit/df154daafb9c6c952351493af02d3a55e0b05c59#commitcomment-10667420 which seems to be saying PEG by design does not allow empty sequence rules, which would make it unsuitable for implementing grammars such as TPTP which contain such. Do I understand this correctly, or am I missing something?

I believe it is still possible to do so, as explained in the posted link; instead of matching with nothing, you can match with "", and then return whatever you want to return :
Empty
= "" {return null;}

How to perform Lexer Actions that send an Exception?

I'm new to ANTL4 and I can't seem to figure out how to get lexer actions to perform properly.
I have a code snippet that looks for input text:
SIZE10 : [a-zA-Z]* {getText().length() <= 10}?
I would expect that it does not match any combinations of letters that are over 10 letters long, however what this does is treat a 10+ letter string as two different tokens, instead of just nullifying the whole set of 10+ letters. How can I get this action to nullify the whole set of letters?
In addition, where can I go to see all the different token functions I can use (other than getText())? The documentation about lexer actions is really poor. In general, I'm having a hard time figuring out what resources can give me a definitive list of everything in the language. Even an entry point into the source code for me to read would be good at this point. The documentation is too general/basic for me.
EDIT: I've figured out how to send a RuntimeException, but I don't know where to get the elements needed for a proper RecognitionException.

The predicate in a rule directs the parsing process in a way that allows to match only partial input (like in your case) or essentially switch off a part of the grammar depending on certain conditions. In your case the SIZE10 rule is matched until the predicate returns false. Everything up to this event is then returned as a match for SIZE10. After that lexing continues at the point it ended for the previous token and if that is again a letter it will again match SIZE10 as long as the predicate says it is correct. That's a bit different than what you would expect (e.g. using the predicate as an all or nothing switch).
However, if you instead want to match the full set of letters first and then check if the length is <= 10 you can do this in a listener. You can hook into the exitSIZE10() event and reject the match by throwing a recognition exception.
For the usable functions in your actions see the API documentation for ANTLR. For instance here is the one for Token which shows you other possibilities beside getText(). In your action, consider the context you have. In a lexer rule you deal with a Token, hence getText() etc. work on the token. In a parser rule you have a ParserContext instead, which also has a getText() function but that works differently (collecting all child contexts text into a comma separated list).

Converting an ASTNode into code

How does one convert an ASTNode (or at least a CompilationUnit) into a valid piece of source code?
The documentation says that one shouldn't use toString, but doesn't mention any alternatives:
Returns a string representation of this node suitable for debugging purposes only.
CompilationUnits have rewrite, but that one does not work for ASTs created by hand.
Formatting options would be nice to have, but I'd basically be satisfied with anything that turns arbitrary ASTNodes into semantically equivalent source code.

In JDT the normal way for AST manipulation is to start with a basic CompilationUnit and then use a rewriter to add content. Then ASTRewriteAnalyzer / ASTRewriteFormatter should take care of creating formatted source code. Creating a CU just containing a stub type declaration shouldn't be hard, so that's one option.
If that doesn't suite your needs, you may want to experiement with directly calling the internal org.eclipse.jdt.internal.core.dom.rewrite.ASTRewriteFlattener.asString(ASTNode, RewriteEventStore). If not editing existing files, you may probably ignore the events collected in the RewriteEventStore, just use the returned String.

Can I put one check on a Lexial element instead for on a number of parser rules?

I,m trying to use antlr4 with the IDL.g4 grammar, to implement some checks that our idl-files shall follow. One rule is about names. The rule are like:
ID contains only letters, digits and signle underscores,
ID begin with a letter,
ID end with a letter or digit.
ID is not a reserved Word in ADA, C, C++, Java, IDL
One way to do this check is to write a function that check a string for these properties and call it in the exit listeners for every rule that has an ID. E.g(refering to IDL.g4) in exitConst_decl(), exitInit_decl(), exitSimple_declarator() and a lot of more places. Maybe that is the correct way to do it. But I was thinking about putting that check directly on the lexical element ID. But don't know how to do that, or if it is possible at all.

Validating this type of constraint in the lexer would make it significantly more difficult to provide usable error messages for invalid identifiers. However, you can create a new parser rule identifier, and replace all references to ID in various parser rules to reference identifier instead.
identifier
: ID
;
You can then place your identifier validation logic inside of the single method enterIdentifier instead of all of the various rules that currently reference ID.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ANTLR get first production - antlr4

Related

Finding the start of an expression when the end of the previous one is difficult to express

PEG.js and empty sequences

How to perform Lexer Actions that send an Exception?

Converting an ASTNode into code

Can I put one check on a Lexial element instead for on a number of parser rules?

Categories

Resources