Algorithm for string replacement based on conditional char replacement - string

Usage case: I'm writing a domain specific language (DSL) for a regex-like but way more powerful Lispy string processing system focused on conditional replacements (like simulation of language evolution for conlangers/linguists) rather than matching as regexes do. As usual I wrote down the specs before actually writing down the code.
However, due to a somewhat stupid but hard to fix mistake, I ended up with a system only capable of doing stuff one char at a time. Thus, a rewrite rule might be (in pseudocode) change 'a' to 'e' when last char is 's' and next char is 'd'. Chars can also be deleted: delete 'a' when ....
Since the interpreter for the DSL is a bit spaghetti-ish (not in the sense of unstructured, but in the sense that 1. I haven't figured out OO for my implementation lang Chicken Scheme 2. No IDE, so must remember 20+ variable names and use emacs) I don't want to touch it, but rather "unsugar" string replacements to conditional char replacements.
The trivial example: change "ab" to "cd" unconditionally rewrites to change 'a' to 'c' when followed by 'b'; change 'b' to 'd' when preceded by a. However, when there are conditions, things become very ugly very quick. Is there some easy recursive way to do the rewriting, or is this nearly impossible in the rewriting phase and I should probably fix my DSL interpreter? (Note: my DSL has ways to get the n-th letter before and after the current char)

The problem is that since we are going through the data character-at-a-time, when a condition is applied to a multi-character string, that condition has to be expressed in different ways for every position. For instance "abc" followed by "x" combines in a straightforward way into the condition for a, b and c, but has to change shape. The x is actually three positions away from a, but only two from b. This is bad because it causes a proliferation of conditions, which all get wastefully evaluated.
I'd solve this by adding the concept of frames into the interpreter. A frame is established at the current character position, and then holds that position somehow, allowing frame-relative addressing of the characters.
I can think of a few ways of introducing this position fixing. One would be to introduce variable binding into the interpreter. It could support a pair of instructions bind symbol and unbind n, where we would be using a gensym for the symbol.
When generating the code for an operation on a string like "abc", we would generate an instruction like bind #:g0025, which would fix the position of the a, and then the compiler will analyze the conditions applied to the string, and re-phrase them in terms that are relative to #:g0025. After the processing of "abc", we would emit unbind 1 to drop the most recently bound variable.
We could also bind variables to the Boolean values of conditions.
As an example with the named frames, suppose we have
Replace "abc" with "ijk" when preceded by "x" and followed by "yz".
This goes to something like:
bind #:frame
bind #:cond0 to when #:frame[-1] is "x" and #:frame[3] is "y" and #:frame[4] is "z"
replace "a" with "i" when #:cond0 ; these insns move the char position
replace "b" with "j" when #:cond0
replace "c" with "k" when #:cond0
unbind 2
So the difficulty has been translated to one of compiling the condition into frame-relative addressing. The #:frame[3] is derived from the length of the "abc" pattern, which is available to the translator all at once. That is information not available in the target language, which doesn't have "abc" all at once.
The system almost certainly needs some way to try different matches at the same location. If there is no "abc" at the current position, another rule which replaces "foo" with something has to be tried at the same position. Perhaps when the conditions fail, the instruction doesn't advance the character position. So in our example above, that would work: all instructions share the same condition, so in the case of a match the position moves by three positions, otherwise it doesn't. Still, in spite of that, there may be a requirement to have multiple edits with different conditions at the same spot. The scope of my answer isn't to design the whole thing, though.

Related

Longest common substring via suffix array: do we really need unique sentinels?

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.
Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.
This means we can write code like this:
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
increment count of common characters
However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.
However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:
set sentinel = '#'
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
if suffix_1[c] != sentinel
increment count of common characters
else
return
Or, am I missing something fundamental here?
Actually I just devised an algorithm that doesn't use sentinels at all: https://github.com/BurntSushi/suffix/issues/14
When concatenating the strings, also record the boundary indexes (e.g. for 3 string of length 4, 2, 5, the boundaries 4, 6, and 11 will be recorded, so we know that concatenated_string[5] belongs to the second original string because 4<= 5 < 6).
Then, to identify which original string every suffix belongs to, just do a binary search.
The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."
The longer answer:
At a high level, the basic algorithm described in the video goes like this:
Construct a generalized suffix array for the strings T1 and T2.
Construct an LCP array for that resulting suffix array.
Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
Find the largest LCP between any two such strings; call it k.
Extract the first k characters from either of the two suffixes.
So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.
(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)
Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.
So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.

Read substrings from a string containing multiplication [duplicate]

This question already has answers here:
'*' and '/' not recognized on input by a read statement
(2 answers)
Closed 4 years ago.
I am a scientist programming in Fortran, and I came up with a strange behaviour. In one of my programs I have a string containing several "words", and I want to read all words as substrings. The first word starts with an integer and a wildcard, like "2*something".
When I perform an internal read on that string, I expect to read all wods, but instead, the READ function repeatedly reads the first substring. I do not understand why, nor how to avoid this behaviour.
Below is a minimalist sample program that reproduces this behaviour. I would expect it to read the three substrings and to print "3*a b c" on the screen. Instead, I get "a a a".
What am I doing wrong? Can you please help me and explain what is going on?
I am compiling my programs under GNU/Linux x64 with Gfortran 7.3 (7.3.0-27ubuntu1~18.04).
PROGRAM testread
IMPLICIT NONE
CHARACTER(LEN=1024):: string
CHARACTER(LEN=16):: v1, v2, v3
string="3*a b c"
READ(string,*) v1, v2, v3
PRINT*, v1, v2, v3
END PROGRAM testread
You are using list-directed input (the * format specifier). In list-directed input, a number (n) followed by an asterisk means "repeat this item n times", so it is processed as if the input was a a a b c. You would need to have as input '3*a' b c to get what you want.
I will use this as another opportunity to point out that list-directed I/O is sometimes the wrong choice as its inherent flexibility may not be what you want. That it has rules for things like repeat counts, null values, and undelimited strings is often a surprise to programmers. I also often see programmers complaining that list-directed input did not give an error when expected, because the compiler had an extension or the programmer didn't understand just how liberal the feature can be.
I suggest you pick up a Fortran language reference and carefully read the section on list-directed I/O. You may find you need to use an explicit format or change your program's expectations.
Following the answer of #SteveLionel, here is the relevant part of the reference on list-directed sequential READ statements (in this case, for Intel Fortran, but you could find it for your specific compiler and it won't be much different).
A character string does not need delimiting apostrophes or quotation marks if the corresponding I/O list item is of type default character, and the following is true:
The character string does not contain a blank, comma (,), or slash ( / ).
The character string is not continued across a record boundary.
The first nonblank character in the string is not an apostrophe or a quotation mark.
The leading character is not a string of digits followed by an asterisk.
A nondelimited character string is terminated by the first blank, comma, slash, or end-of-record encountered. Apostrophes and quotation marks within nondelimited character strings are transferred as is.
In total, there are 4 forms of sequential read statements in Fortran, and you may choose the option that best fits your need:
Formatted Sequential Read:
To use this you change the * to an actual format specifier. If you know the length of the strings at advance, this would be as easy as '(a3,a2,a2)'. Or, you could come with a format specifier that matches your data, but this generally demands you knowing the length or format of stuff.
Formatted Sequential List-Directed:
You are currently using this option (the * format descriptor). As we already showed you, this kind of I/O comes with a lot of magic and surprising behavior. What is hitting you is the n*cte thing, that is interpreted as n repetitions of cte literal.
As said by Steve Lionel, you could put quotation marks around the problematic word, so it will be parsed as one-piece. Or, as proposed by #evets, you could split or break your string using the intrinsics index or scan. Another option could be changing your wildcard from asterisk to anything else.
Formatted Namelist:
Well, that could be an option if your data was (or could be) presented in the namelist format, but I really think it's not your case.
Unformatted:
This may not apply to your case because you are reading from a character variable, and an internal READ statement can only be formatted.
Otherwise, you could split your string by means of a function instead of a I/O operation. There is no intrinsic for this, but you could come with one without much trouble (see this thread for reference). As you may have noted already, manipulating strings in fortran is... awkward, at least. There are some libraries out there (like this) that may be useful if you are doing lots of string stuff in Fortran.

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source
line shall be eligible for being part of such a splice.
(ยง5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Shortcut key assignment algorithm

I have multiple choice menus in my program like so:
Which option do you want? (choose one)
f: First option
s: Second option
t: Third option
The user then presses f, s or t to make their choice. For this example, I picked the letters manually, and it should be obvious how.
But in some cases there are conflicts: Suppose I had a Fourth option - I can't use f. Sensible choices include F, h and others, depending on UX philosophy.
Is there an algorithm that will, given a list of strings, generate a unique mnemonic letter to identify each string? By "mnemonic", I mean that the option should suggest the letter (as in my example), so that it is easy to remember which is which (as opposed to just mapping everything to a, b, c or x, y, z).
As I noted above, there are multiple ways of doing this, depending on what you prefer: Capitalized letters, letters within the first word, letters of secondary unique words, etc. For this question, I don't really care about these, so feel free to use your own rules - so long as the algorithm produces reasonably user-friendly results.
The baseline algorithm I've seen used is the following:
Pick the first letter (as in your example). Keep track of which letters you pick.
When the chosen letter is taken, pick the next letter if available.
If there are no more letters, then pick one lexicographically (the first free letter from the alphabet)
If there are no more letters, don't pick anything, the option won't be addressable. This makes sense with menus that are also clickable.
Of course, there are tweaks you can apply:
does your terminal/OS convention distinguish between upper case and lower case? You can use that after step 2 and before step 3 (if there are no more letters left to use, use an upper case letter).
can you use and detect alt, ctrl, win?
are there pre-assigned shortcuts you need to maintain (e.g. s to save)? Assign them before step one.

Extracting information in a string

I would like to parse strings with an arbitrary number of parameters, such as P1+05 or P2-01 all put together like P1+05P2-02. I can get that data from strings with a rather large (too much to post around...) IF tree and a variable keeping track of the position within the string. When reaching a key letter (like P) it knows how many characters to read and proceeds accordingly, nothing special. In this example say I got two players in a game and I want to give +05 and -01 health to players 1 and 2, respectively. (hence the +-, I want them to be somewhat readable).
It works, but I feel this could be done better. I am using Lua to parse the strings, so maybe there is some built-in function, within Lua, to ease that process? Or maybe some general hints , or references for better approaches?
Here is some code:
for w in string.gmatch("P1+05P2-02","%u[^%u]+") do
print(w)
end
It assumes that each "word" begins with an uppercase letter and its parameters contain no uppercase letters.

Resources