Finding the start of an expression when the end of the previous one is difficult to express

Finding the start of an expression when the end of the previous one is difficult to express - antlr4

I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.

Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.

Related

Arbitrary lookaheads in PLY

I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments

You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.

ANTLR get first production

I'm using ANTLR4 and, in particular, the C grammar available in their repo (grammar). It seems that the grammar hasn't an initial rule, so I was wondering how it's possible to get it. In fact, once initialized the parser, I attach my listener, but I obtain syntax errors since I'm trying to parse two files with different code instructions:
int a;
int foo() { return 0; }
In my example I call the parser with "parser.primaryExpression();" which is the first production of the "g4" file. Is it possible to avoid to call the first production and get it automatically by ANTLR instead?

In addition to #GRosenberg's answer:
Also the rule enum (in the generated parser) contains entries for each rule in the order they appear in the grammar and the first rule has the value 0. However, just because it's the first rule in the grammar doesn't mean that it is the main entry point. Only the grammar author knows what the real entry is and sometimes you might even want to parse only with a subrule, which makes this decision even harder.

ANTLR provides no API to obtain the first rule. However, in the parser as generated, the field
public static final String[] ruleNames = ....;
lists the rulenames in the order of occurrence in the grammar. With reflection, you can access the method.
Beware. Nothing in the Antlr 'spec' defines this ordering. Simply has been true to date.

Ternary operator should not be used on a single line in Node.js. Why?

Consider the following sample codes:
1.Sample
var IsAdminUser = (User.Privileges == AdminPrivileges)
? 'yes'
: 'no';
console.log(IsAdminUser);
2.Sample
var IsAdminUser = (User.Privileges == AdminPrivileges)?'yes': 'no';
console.log(IsAdminUser);
The 2nd sample I am very comfortable with & I code in that style, but it was told that its wrong way of doing without any supportive reasons.
Why is it recommended not to use a single line ternary operator in Node.js?
Can anyone put some light on the reason why it is so?
Advance Thanks for great help.

With all coding standards, they are generally for readability and maintainability. My guess is the author finds it more readable on separate lines. The compiler / interpreter for your language will handle it all the same. As long as you / your project have a set standard and stick to it, you'll be fine. I recommend that the standards be worked on or at least reviewed by everyone on the project before casting them in stone. I think that if you're breaking it up on separate lines like that, you may as well define an if/else conditional block and use that.
Be wary of coding standards rules that do not have a justification.
Personally, I do not like the ternary operator as it feels unnatural to me and I always have to read the line a few times to understand what it's doing. I find separate if/else blocks easier for me to read. Personal preference of course.

It is in fact wrong to put the ? on a new line; even though it doesn’t hurt in practice.
The reason is a JS feature called “Automatic Semicolon Insertion”. When a var statement ends with a newline (without a trailing comma, which would indicate that more declarations are to follow), your JS interpreter should automatically insert a semicolon.
This semicolon would have the effect that IsAdminUser is assigned a boolean value (namely the result of User.Privileges == AdminPrivileges). After that, a new (invalid) expression would start with the question mark of what you think is a ternary operator.
As mentioned, most JS interpreters are smart enough to recognize that you have a newline where you shouldn’t have one, and implicitely fix your ternary operator. And, when minifying your script, the newline is removed anyway.
So, no problem in practice, but you’re relying on an implicit fix of common JS engines. It’s better to write the ternary operator like this:
var foo = bar ? "yes" : "no";
Or, for larger expressions:
var foo = bar ?
"The operation was successful" : "The operation has failed.";
Or even:
var foo = bar ?
"Congratulations, the operation was a total success!" :
"Oh, no! The operation has horribly failed!";

I completely disagree with the person who made this recommendation. The ternary operator is a standard feature of all 'C' style languages (C,C++,Java,C#,Javascript etc.), and most developers who code in these languages are completely comfortable with the single line version.
The first version just looks weird to me. If I was maintaining code and saw this, I would correct it back to a single line.
If you want verbose, use if-else. If you want neat and compact use a ternary.
My guess is the person who made this recommendation simply wasn't very familiar with the operator, so found it confusing.

Because it's easier on the eye and easier to read. It's much easier to see what your first snippet is doing at a glance - I don't even have to read to the end of a line. I can simply look at one spot and immediately know what values IsAdminUser will have for what conditions. Much the same reason as why you wouldn't write an entire if/else block on one line.
Remember that these are style conventions and are not necessarily backed up by objective (or technical) reasoning.

The reason for having ? and : on separate lines is so that it's easier to figure out what changed if your source control has a line-by-line comparison.
If you've just changed the stuff between the ? and : and everything is on a single line, the entire line can be marked as changed (based on your comparison tool).

Parser skips lines

I want to write a simple parser for a subset of Jade, generating some XmlHtml for further processing.
The parser is quite simple, but as often with Parsec, a bit long. Since I don't know if I am allowed to make such long code posts, I have the full working example here.
I've dabbled with Parsec before, but rarely successfully. Right now, I don't quite understand why it seems to swallow following lines. For example, the jade input of
.foo.bar
| Foo
| Bar
| Baz
tested with parseTest tag txt, returns this:
Element {elementTag = "div", elementAttrs = [("class","foo bar")], elementChildren = [TextNode "Foo"]}
My parser seems to be able to deal with any kind of nesting, but never more than one line. What did I miss?

If Parsec cannot match the remaining input, it will stop parsing at that point and simply ignore that input. Here, the problem is that after having parsed a tag, you don't consume the whitespace in the beginning of the line before the next tag, so Parsec cannot parse the remaining input and bails. (There might also be other issues, I can't test the code right now)
There are many ways of adding something that consumes the spaces, but I am not familiar with Jade so I cannot tell you which way is the "correct" way (I don't know how the indentation syntax works) but just adding whiteSpace somewhere at the end of tag should do it.
By the way, you should consider splitting up your parser into a Lexer and Parser. The Lexer produces a token stream like [Ident "bind", OpenParen, Ident "tag", Equals, StringLiteral "longname", ..., Indentation 1, ...] and the parser parses that token stream (Yes, Parsec can parse lists of anything). I think that it would make your job easier/less confusing.

lexer/parser ambiguity

How does a lexer solve this ambiguity?
/*/*/
How is it that it doesn't just say, oh yeah, that's the begining of a multi-line comment, followed by another multi-line comment.
Wouldn't a greedy lexer just return the following tokens?
/*
/*
/
I'm in the midst of writing a shift-reduce parser for CSS and yet this simple comment thing is in my way. You can read this question if you wan't some more background information.
UPDATE
Sorry for leaving this out in the first place. I'm planning to add extensions to the CSS language in this form /* # func ( args, ... ) */ but I don't want to confuse an editor which understands CSS but not this extension comment of mine. That's why the lexer just can't ignore comments.

One way to do it is for the lexer to enter a different internal state on encountering the first /*. For example, flex calls these "start conditions" (matching C-style comments is one of the examples on that page).

The simplest way would probably be to lex the comment as one single token - that is, don't emit a "START COMMENT" token, but instead continue reading in input until you can emit a "COMMENT BLOCK" token that includes the entire /*(anything)*/ bit.
Since comments are not relevant to the actual parsing of executable code, it's fine for them to basically be stripped out by the lexer (or at least, clumped into a single token). You don't care about token matches within a comment.

In most languages, this is not ambiguous: the first slash and asterix are consumed to produce the "start of multi-line comment" token. It is followed by a slash which is plain "content" within the comment and finally the last two characters are the "end of multi-line comment" token.
Since the first 2 characters are consumed, the first asterix cannot also be used to produce an end of comment token. I just noted that it could produce a second "start of comment" token... oops, that could be a problem, depending on the amount of context is available for the parser.
I speak here of tokens, assuming a parser-level handling of the comments. But the same applies to a lexer, whereby the underlying rule is to start with '/*' and then not stop till '*/' is found. Effectively, a lexer-level handling of the whole comment wouldn't be confused by the second "start of comment".

Since CSS does not support nested comments, your example would typically parse into a single token, COMMENT.
That is, the lexer would see /* as a start-comment marker and then consume everything up to and including a */ sequence.

Use the regexp's algorithm, search from the beginning of the string working way back to the current location.
if (chars[currentLocation] == '/' and chars[currentLocation - 1] == '*') {
for (int i = currentLocation - 2; i >= 0; i --) {
if (chars[i] == '/' && chars[i + 1] == '*') {
// .......
}
}
}
It's like applying the regexp /\*([^\*]|\*[^\/])\*/ greedy and bottom-up.

One way to solve this would be to have your lexer return:
/
*
/
*
/
And have your parser deal with it from there. That's what I'd probably do for most programming languages, as the /'s and *'s can also be used for multiplication and other such things, which are all too complicated for the lexer to worry about. The lexer should really just be returning elementary symbols.
If what the token is starts to depend too much on context, what you're looking for may very well be a simpler token.
That being said, CSS is not a programming language so /'s and *'s can't be overloaded. Really afaik they can't be used for anything else other than comments. So I'd be very tempted to just pass the whole thing as a comment token unless you have a good reason not to: /\*.*\*/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string