Forbid token to parse extra whitespace - pharo

Now idea is like this: in Java for octalIntegerLiteral I have a rule
octalNumeral, (integerTypeSuffix optional)
But I want to get a numbers as token, so I used:
octalNumeral javaToken, (integerTypeSuffix optional)
The problem is that it starts to parse strings like: 0777 L. Can this be easily solved, or I should just deal with it is subclass?

Seems that #javaToken does two things at once: trimming whitespace and creating tokens. Split this behavior into two separate parsers that you can use individually. To keep old code working you can then redefine #javaToken from the new basic parsers as:
PPParser>>#javaToken
^ self javaPlainToken javaWhitespace
Then your number parser would look like this:
(octalNumber javaPlainToken , integerTypeSuffix optional) javaWhitespace

Related

Finding the start of an expression when the end of the previous one is difficult to express

I've got a file format that looks a little like this:
blockA {
uniqueName42 -> uniqueName aWord1 anotherWord "Some text"
anotherUniqueName -> uniqueName23 aWord2
blockB {
thing -> anotherThing
}
}
Lots more blocks with arbitrary nesting levels.
The lines with the arrow in them define relationships between two things. Each relationship has some optional metadata (multi-word quoted or single word unquoted).
The challenge I'm having is that because the there can be an arbitrary number of metadata items in a relationship my parser is treating anotherUniqueName as a metadata item from the first relationship rather than the start of the second relationship.
You can see this in the image below. The parser is only recognising one relationshipDeclaration when a second should start with StringLiteral: anotherUniqueName
The parser looks a bit like this:
block
: BLOCK LBRACE relationshipDeclaration* RBRACE
;
relationshipDeclaration
: StringLiteral? ARROW StringLiteral StringLiteral*
;
I'm hoping to avoid lexical modes because the fact that these relationships can appear almost anywhere in the file will leave me up to my eyes in NL+ :-(
Would appreciate any ideas on what options I have. Is there a way to look ahead, spot the '->', for example?
Thanks a million.
Your example certainly looks like the NL is what signals the end of a relationshipDeclaration.
If that's the case, then you'll need NLs to be tokens available to your parse rules so the parser can know recognize the end.
As you've alluded to, you could potentially use -> to trigger a different Lexer Mode and generate different tokens for content between the -> and the NL and then use those tokens in your parse rule for relationshipDeclaration.
If it's as simple as your snippet indicates, then just capturing RD_StringLiteral tokens in that lexical mode, would probably be easier to deal with than handling all the places you might need to allow for NL. This would be pretty simple as Lexer modes go.
(BTW you can use x+ to get the same effect as x x*)
relationshipDeclaration
: StringLiteral? ARROW RD_StringLiteral+
;
I don't think there's a third option for dealing with this.

Arbitrary lookaheads in PLY

I am trying to parse a config, which would translate to a structured form. This new form requires that comments within the original config be preserved. The parsing tool is PLY. I am running into an issue with my current approach which I will describe in detail below, with links to code as well. The config file is going to look contain multiple config blocks, each of which is going to be of the following format
<optional comments>
start_of_line request_stmts(one or more)
indent reply_stmts (zero or more)
include_stmts (type 3)(zero or more)
An example config file looks like this.
While I am able to partially parse the config file with the grammar below, I fail to accomodate comments which would exist within the block.
For example, a block like this raises syntax errors, and any comments in a block of config fail to parse.
<optional comments>
start_of_line request_stmts(type 1)(one or more)
indent reply_stmts (type 2)(one or more)
<comments>
include_stmts (type 3)(one or more)(optional)
The parser.out mentions one shift/reduce conflict which I think arises because once the reply_stmts are parsed, a comments section which follows could mark start of a new block or comments within the subblock. Current grammar parsing result for the example file
[['# test comment ', '# more of this', '# does this make sense'], 'DEFAULT', [['x', '=',
'y']], [['y', '=', '1']], ['# Transmode', '# maybe something else', '# comment'],
'/random/location/test.user']
As you might notice, the second config block complete misses the username, request_stmt, reply_stmt sections.
What I have tried
I have tried moving the comments section around in the grammar, by specifying it before specific blocks or in the statement grammar. In the code link pasted above, the comments section has been specified in the overall statement grammar. Both of these approaches fail to parse comments within a config block.
username : comments username
| username
include_stmt : comments includes
| includes
I have two main questions:
Is there a mistake I am making in the implementation/understanding of LR parsing, solving which I could achieve what I want to ?
Is there a better way to achieve the same goal than my current approach ? (PLY-fu, different parser, different grammar)
P.S Wasn't able to include the actual code in the question, mentioned in the comments
You are correct that the problem is that when the parser sees a comment, it cannot know whether the comment belongs to the same section or whether the previous section is finished. In the former case, the parser needs to shift the comment, while in the latter case it needs to reduce the configuration section.
Since there could be any number of comments, the necessary lookahead could be arbitrarily large, in which case LR parsing wouldn't be possible. But a simple trick can reduce the lookahead to two tokens: just combine consecutive comments into a single token.
Any LR(k) grammar has an equivalent LR(1) grammar. In effect, the LR(1) grammars works by delaying all decisions for k-1 tokens, accumulating these tokens into the parser state. That's a massive increase in grammar size, but it's usually possible to achieve the same effect in other ways, and that's certainly the case here.
The basic idea is that any comment is (temporarily) accumulated into a list of comments. When a non-comment token is encountered, this temporary list is attached to that token.
This can be done either in the lexical scanner or in the parser actions, depending on your inclinations.
Before attempting all that, you should make sure that retaining comments is really useful to your application. Comments are normally not relevant to the semantics of a program (or configuration file), and it would certainly be much simpler for the lexer to just drop comments into the bit-bucket. If your application will end up reformatting the input, then it will have to retain comments. But if it only needs to extract information from the configuration, putting a lot of effort into handling comments is hard to justify.

Python3 strip() get unexpect result

It's a weird problem
to_be_stripped="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120"
And two strings below:
s1="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120\\[Content_Types].xml"
s2="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120\\_rels\.rels"
When I use the command below:
s1.strip(to_be_stripped)
s2.strip(to_be_stripped)
I get these outputs:
'[Content_Types].x'
'_rels\\.'
If I use lstrip(), they will be:
'[Content_Types].xml'
'_rels\\.rels'
Which is the right outputs.
However, if we replace all Project Known with zeus_pipeline:
to_be_stripped="D:\\Users\\UserKnown\\PycharmProjects\\zeus_pipeline\\PT\\collections\\120"
And:
s2="D:\\Users\\UserKnown\\PycharmProjects\\zeus_pipeline\\PT\\collections\\120\\_rels\.rels"
s2.lstrip(to_be_stripped)will be '.rels'
If I use / instead of \\, nothing goes wrong. I am wondering why this problem happens.
strip isn't meant to remove full strings exactly. Rather, you give it a string, and every character in that string is removed from the start and of the string to be stripped.
In your case, the variable to_be_stripped contains the characters m and l, so those are stripped from the end of s1. However, it doesn't contain the character x, so the stripping stops there and no characters beyond that are removed.
Check out this question. The accepted answer is probably more extensive than you need - I like another user's suggestion of using replace instead of strip. This would look like:
s1.replace(to_be_stripped, "")

erb - How to substitute a string (gsub) which contains legit backslashes?

I had the following problem with erb in combination with Puppet, Hiera and templates:
Via Hiera I got the following strings as variables:
First the variable example in an array (data[example])
something with _VARIABLE_ in it
and variable example_information with
some kind of \1 and maybe also a \2
Now I wanted to substitute _VARIABLE_ in a Puppet template with the second string which contains a legit backslash () in it. So I did it like this:
result=data['example'].gsub('_VARIABLE_', #example_information)
So I took example out of an array and filled the placeholder with #example_information.
The result was as follows:
something with some kind of and maybe also a in it
There was no backslash as gsub interpreted them as backreferences. So how can I solve my issue to preserve my backslashes without double escape them in the Hiera file? I need the Hiera variable further in the code without double escaped backslashes.
I now made this to solve that specific problem as follows:
Variable again example
something with _VARIABLE_ in it
and variable example_information with
some kind of \1 and maybe also a \2
Code part in the template:
# we need to parse out any backslashes
info_temp=example_information.gsub('\\', '__BACKSLASH__')
# now we substitute the variables with real data (but w/o backslashes)
result_temp=data['example'].gsub(/__ITEM_NAME__/, info_temp)
# now we put together the real string with backslashes again as before
result=result_temp.gsub('__BACKSLASH__', '\\')
Now the result looks as follows:
something with some kind of \1 and maybe also a \2 in it
Note
Maybe there is a better way to do it but on my research I didn't stumble upon a better solution so please add comments if you know a better way to do it.

Prolog : Remove extra spaces in a stream of characters

Total newb to Prolog. This one is frustrating me a bit. My 'solution' below is me trying to make Prolog procedural...
This will remove spaces or insert a space after a comma if needed, that is, until a period is encountered:
squish:-get0(C),put(C),rest(C).
rest(46):-!.
rest(32):-get(C),put(C),rest(C).
rest(44):-put(32), get(C), put(C), rest(C).
rest(Letter):-squish.
GOAL: I'm wondering how to remove any whitespace BEFORE the comma as well.
The following works, but it is so wrong on so many levels, especially the 'exit'!
squish:-
get0(C),
get0(D),
iteratesquish(C,D).
iteratesquish(C,D):-
squishing(C,D),
get0(E),
iteratesquish(D,E).
squishing(46,X):-put(46),write('end.'),!,exit.
squishing(32,32):-!.
squishing(32,44):-!.
squishing(32,X):-put(32),!.
squishing(44,32):-put(44),!.
squishing(44,44):-put(44), put(32),!.
squishing(44,46):-put(44), put(32),!.
squishing(44,X):-put(44), put(32),!.
squishing(X,32):-put(X),!.
squishing(X,44):-put(X),!.
squishing(X,46):-put(X),!.
squishing(X,Y):-put(X),!.
Since you are describing lists (in this case: of character codes), consider using DCG notation. For example, to let any comma be followed by a single whitespace, consider using code similar to:
squish([]) --> [].
squish([(0',),(0' )|Rest]) --> [0',], spaces, !, squish(Rest).
squish([L|Ls]) --> [L], squish(Ls).
spaces --> [0' ], spaces.
spaces --> [].
Example query:
?- phrase(squish(Ls), "a, b,c"), format("~s", [Ls]).
a, b, c
So, first focus on a clear declarative description of the relation between character sequences and the desired "clean" string. You can then use SWI-Prolog's library(pio) to read from files via these grammar rules. To remove all spaces preceding commas, you only have to add a single rule to the DCG above (to squish//1), which I leave as exercise to you. A corner case of course is if a comma is followed by another comma, in which case the requirements are contradictory :-)

Resources