Looking for efficient string replacement algorythm - string

I'm trying to create a string replacer that accepts multilpe replacements.
The ideia is that it would scan the string to find substrings and replace those substrings with another substring.
For example, I should be able to ask it to replace every "foo" for "bar". Doing that is trivial.
The issue starts when I'm trying to add multiple replacements for this function. Because if I ask it to replace "foo" for "bar" and "bar" for "biz", running those replacements in sequence would result in "foo" turning to "biz", and this behavior is unintended.
I tried splitting the string into words and running each replacement function in each word. However that's not bullet proof either because still results in unintended behavior, since you can ask it to replace substrings that are not whole words. Also, I find that very inefficient.
I'm thinking in some way of running each replacer once in the whole string and sort of storing those changes and merging them. However I think I'm overengineering.
Searching on the web gives me trivial results on how to use string.replace with regular expressions, it doesn't solve my problem.
Is this a problem already solved? Is there an algorithm that can be used here for this string manipulation efficiently?

If you modify your string while searching for all occurences of substrings to be replaced, you'll end up modifying incorrect states of the string. An easy way out could be to get a list of all indexes to update first, then iterate over the indexes and make replacements. That way, indexes for "bar" would've been already computed, and won't be affected even if you replace any substring with "bar" later.
Adding a rough Python implementation to give you an idea:
import re
string = "foo bar biz"
replacements = [("foo", "bar"), ("bar", "biz")]
replacement_indexes = []
offset = 0
for item in replacements:
replacement_indexes.append([m.start() for m in re.finditer(item[0], string)])
temp = list(string)
for i in range(len(replacement_indexes)):
old, new, indexes = replacements[i][0], replacements[i][1], replacement_indexes[i]
for index in indexes:
temp[offset+index:offset+index+len(old)] = list(new)
offset += len(new)-len(old)
print(''.join(temp)) # "bar biz biz"

Here's the approach I would take.
I start with my text and the set of replacements:
string text = "alpha foo beta bar delta";
Dictionary<string, string> replacements = new()
{
{ "foo", "bar" },
{ "bar", "biz" },
};
Now I create an array of parts that are either "open" or not. Open parts can have their text replaced.
var parts = new List<(string text, bool open)>
{
(text: text, open: true)
};
Now I run through each replacement and build a new parts list. If the part is open I can do the replacements, if it's closed just add it in untouched. It's this last bit that prevents double mapping of replacements.
Here's the main logic:
foreach (var replacement in replacements)
{
var parts2 = new List<(string text, bool open)>();
foreach (var part in parts)
{
if (part.open)
{
bool skip = true;
foreach (var split in part.text.Split(new[] { replacement.Key }, StringSplitOptions.None))
{
if (skip)
{
skip = false;
}
else
{
parts2.Add((text: replacement.Value, open: false));
}
parts2.Add((text: split, open: true));
}
}
else
{
parts2.Add(part);
}
}
parts = parts2;
}
That produces the following:
Now it just needs to be joined back up again:
string result = String.Concat(parts.Select(p => p.text));
That gives:
alpha bar beta biz delta
As requested.

Let's suppose your given string were
str = "Mary had fourteen little lambs"
and the desired replacements were given by the following hash (aka hashmap):
h = { "Mary"=>"Butch", "four"=>"three", "little"=>"wee", "lambs"=>"hippos" }
indicating that we want to replace "Mary" (wherever it appears in the string, if at all) with "Butch", and so on. We therefore want to return the following string:
"Butch had fourteen wee hippos"
Notice that we do not want 'fourteen' to be replaced with 'threeteen' and we want the extra spaces between 'fourteen' and 'wee' to be preserved.
First collect the keys of the hash h into an array (or list):
keys = h.keys
#=> ["Mary", "four", "little", "lambs"]
Most languages have a method or function sub or gsub that works something like the following:
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch had fourteen wee hippos"
The regular expression /\w+/ (r'\w+' in Python, for example) matches one or more word characters, as many as possible (i.e., a greedy match). Word characters are letters, digits and the underscore ('_'). It therefore will sequentially match 'Mary', 'had', 'fourteen', 'little' and 'lambs'.
Each matched word is passed to the "block" do |word| ...end and is held by the variable word. The block calculation then computes and returns the string that is to replace the value of word in a duplicate of the original string. Different languages uses different structures and formats to do this, of course.
The first word passed to the block by gsub is 'Mary'. The following calculation is then performed:
if keys.include?("Mary") # true
# so replace "Mary" with:
h[word] #=> "Butch
else # not executed
# not executed
end
Next, gsub passes the word 'had' to the block and assigns that string to the variable word. The following calculation is then performed:
if keys.include?("had") # false
# not executed
else
# so replace "had" with:
"had"
# that is, leave "had" unchanged
end
Similar calculations are made for each word matched by the regular expression.
We see that punctuation and other non-word characters is not a problem:
str = "Mary, had fourteen little lambs!"
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch, had fourteen wee hippos!"
We can see that gsub does not perform replacements sequentially:
h = { "foo"=>"bar", "bar"=>"baz" }
keys = h.keys
#=> ["foo", "bar"]
"foo bar".gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "bar baz"
Note that a linear search of keys is required to evaluate
keys.include?("Mary")
This could be relatively time-consuming if keys has many elements.
In most languages this can be sped up by making keys a set (an unordered collection of unique elements). Determining whether a set contains a given element is quite fast, comparable to determining if a hash has a given key.
An alternative formulation is to write
str.gsub(/\b(?:Mary|four|little|lambs)\b/) { |word| h[word] }
#=> "Butch had fourteen wee hippos"
where the regular expression is constructed programmatically from h.keys. This regular expression reads, "match one of the four words indicated, preceded and followed by a word boundary (\b). The trailing word boundary prevents 'four' from matching 'fourteen'. Since gsub is now only considering the replacement of those four words the block can be simplified to { |word| h[word] }.
Again, this preserves punctuation and extra spaces.
If for some reason we wanted to be able to replace parts of words (e.g., to replace 'fourteen' with 'threeteen'), simply remove the word boundaries from the regular expression:
str.gsub(/Mary|four|little|lambs/) { |word| h[word] }
#=> "Butch had threeteen wee hippos"
Naturally, different languages provide variations of this approach. In Ruby, for example, one could write:
g = Hash.new { |h,k| k }.merge(h)
The creates a hash g that has the same key-value pairs as h but has the additional property that if g does not have a key k, g[k] (the value of key k) returns k. That allows us to write simply:
str.gsub(/\w+/, g)
#=> "Butch had fourteen wee hippos"
See the second version of String#gsub.
A different approach (which I will show is problematic) is to construct an array (or list) of words from the string, replace those words as appropriate and then rejoin the resulting words to form a string. For example,
words = str.split
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
["Butch", "had", "fourteen", "wee", "hippos"]
arr.join(' ')
#=> "Butch had fourteen wee hippos"
This produces similar results except the extra spaces have been removed.
Now suppose the string contained punctuation:
str = "Mary, had fourteen little lambs!"
words = str.split
#=> ["Mary,", "had", "fourteen", "little", "lambs!"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Mary,", "had", "fourteen", "wee", "lambs!"]
arr.join(' ')
#=> "Mary, had fourteen wee lambs!"
We could deal with punctuation by writing
words = str.scan(/\w+/)
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Butch", "had", "fourteen", "wee", "hippos"]
Here str.scan returns an array of all matches of the regular expression /\w+/ (one or more word characters). The obvious problem is that all punctuation has been lost when arr.join(' ').

You can achieve in a simple way, by using regular expressions:
import re
replaces = {'foo' : 'bar', 'alfa' : 'beta', 'bar': 'biz'}
original_string = 'foo bar, alfa foo. bar other.'
expected_string = 'bar biz, beta bar. biz other.'
replaced = re.compile(r'\w+').sub(lambda m: replaces[m.group()] if m.group() in replaces else m.group(), original_string)
assert replaced == expected_string
I haven't checked the performance, but I believe it is probably faster than using "nested for loops".

Related

How can I replace each letter in the sentence to sentence without breaking it?

Here's my problem.
sentence = "This car is awsome."
and what I want do do is
sentence.replace("a","<emoji:a>")
sentence.replace("b","<emoji:b>")
sentence.replace("c","<emoji:c>")
and so on...
But of course if I do it in that way the letters in "<emoji:>" will also be replaced as I go along. So how can I do it in other way?
As Carlos Gonzalez suggested:
create a mapping dict and apply it to each character in sequence:
sentence = "This car is awsome."
# mapping
up = {"a":"<emoji:a>",
"b":"<emoji:b>",
"c":"<emoji:c>",}
# apply mapping to create a new text (use up[k] if present else default to k)
text = ''.join( (up.get(k,k) for k in sentence) )
print(text)
Output:
This <emoji:c><emoji:a>r is <emoji:a>wsome.
The advantage of the generator expression inside the ''.join( ... generator ...) is that it takes each single character of sentence and either keeps it or replaces it. It only ever touches each char once, so there is no danger of multiple substitutions and it takes only one pass of sentence to convert the whole thing.
Doku: dict.get(key,default) and Why dict.get(key) instead of dict[key]?
If you used
sentence = sentence.replace("a","o")
sentence = sentence.replace("o","k")
you would first make o from a and then make k from any o (or a before) - and you would have to touch each character twice to make it happen.
Using
up = { "a":"o", "o":"k" }
text = ''.join( (up.get(k,k) for k in sentence) )
avoids this.
If you want to replace more then 1 character at a time, it would be easier to do this with regex. Inspired by Passing a function to re.sub in Python
import re
sentence = "This car is awsome."
up = {"is":"Yippi",
"ws":"WhatNot",}
# modified it to create the groups using the dicts key
text2 = re.sub( "("+'|'.join(up)+")", lambda x: up[x.group()], sentence)
print(text2)
Output:
ThYippi car Yippi aWhatNotome.
Doku: re.sub(pattern, repl, string, count=0, flags=0)
You would have to take extra care with your keys, if you wanted to use "regex" specific characters that have another meaning if used as regex-pattern - f.e. .+*?()[]^$

How to remove the common words from two strings, while keeping the white-space intact using Groovy?

I have two strings,
def str1 = "This is test"
def str2 = "That is test"
I want to find the difference between these two strings using Groovy.
I have tried the -operator but it doesn't seem to work properly.
println 'This is test' - 'That is test'
I want the output to be This That
But, the above code evaluates to the first string This is test. Where am I going wrong? Is there any other way to get the difference between two strings using Groovy?
Minus operator for String works differently - it removes part of String. In your case you get This is test as a result because this String does not contain a substring like That is test.
If you want to get a concatenation of words that are different in both strings you can tokenize both strings and transpose them to a pairs of words and remove pairs that contain the same words. Remaining words can be joined with space character, something like:
def str1 = "This is test"
def str2 = "That is test"
def diff = [str1.tokenize(), str2.tokenize()].transpose() // creates a list of pairs like [["This", "That"], ["is", "is"], ["test", "test"]]
.findAll { it[0] != it[1] } // filters out pairs containing the same word
.flatten() // flats [["This", "That"]] to ["This", "That"]
.join(' ') // creates a final String "This That"
assert diff == 'This That'

Efficient algorithm for phrase anagrams

What is an efficient way to produce phrase anagrams given a string?
The problem I am trying to solve
Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.
My solution
I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.
What sucks
My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way.
Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.
Your problem can be decomposed to 2 sub-problems:
Find combination of words that use up all characters of the input string
Find all permutations of the words found in the first sub-problem
Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1
First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.
Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:
result = empty;
function findAnagram(pool)
if (pool empty) then print result;
for (word in dictionary) {
if (word fit in charpool) {
result = result + word;
update pool to exclude characters in word;
findAnagram(pool);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.
Now we remove redundant results and apply some optimizations:
Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word
If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.
Update the pseudo-code to reflect those optimizations:
function findAnagram(charpool, minDictionaryIndex)
pool_bitmask <- bitmask(charpool);
if (pool empty) then print result;
for (word in dictionary AND word's index >= minDictionaryIndex) {
// bitmask of every words in the dictionary should be pre-calculated
word_bitmask <- bitmask(word)
if (word_bitmask contains bit(s) that is not in pool_bitmask)
then skip this for iteration
if (word fit in charpool) {
result = result + word;
update charpool to exclude characters in word;
findAnagram(charpool, word's index);
// as with any backtracking algorithm, we have to restore global states
restore pool;
restore result;
}
}
}
My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .
Instead of a two stage solution where you generate permutations and then try and break them into words, you could speed it up by checking for valid words as you recursively generate the permutations. If at any point your current partially-complete permutation does not correspond to any valid words, stop there and do not recurse any further. This means you don't waste time generating useless permutations. For example, if you generate "tt", there is no need to permute "peanubuter" and append all the permutations to "tt" because there are no English words beginning with tt.
Suppose you are doing basic recursive permutation generation, keep track of the current partial word you have generated. If at any point it is a valid word, you can output a space and start a new word, and recursively permute the remaining character. You can also try adding each of the remaining characters to the current partial word, and only recurse if doing so results in a valid partial word (i.e. a word exists starting with those characters).
Something like this (pseudo-code):
void generateAnagrams(String partialAnagram, String currentWord, String remainingChars)
{
// at each point, you can either output a space, or each of the remaining chars:
// if the current word is a complete valid word, you can output a space
if(isValidWord(currentWord))
{
// if there are no more remaining chars, output the anagram:
if(remainingChars.length == 0)
{
outputAnagram(partialAnagram);
}
else
{
// output a space and start a new word
generateAnagrams(partialAnagram + " ", "", remainingChars);
}
}
// for each of the chars in remainingChars, check if it can be
// added to currentWord, to produce a valid partial word (i.e.
// there is at least 1 word starting with these characters)
for(i = 0 to remainingChars.length - 1)
{
char c = remainingChars[i];
if(isValidPartialWord(currentWord + c)
{
generateAnagrams(partialAnagram + c, currentWord + c,
remainingChars.remove(i));
}
}
}
You could call it like this
generateAnagrams("", "", "peanutbutter");
You could optimize this algorithm further by passing the node in the trie corresponding to the current partially completed word, as well as passing currentWord as a string. This would make your isValidPartialWord check even faster.
You can enforce uniqueness by changing your isValidWord check to only return true if the word is in ascending (greater or equal) alphabetic order compared to the previous word output. You might also need another check for dupes at the end, to catch cases where two of the same word can be output.

How to split a string into a list of words in TCL, ignoring multiple spaces?

Basically, I have a string that consists of multiple, space-separated words. The thing is, however, that there can be multiple spaces instead of just one separating the words. This is why [split] does not do what I want:
split "a b"
gives me this:
{a {} {} {} b}
instead of this:
{a b}
Searching Google, I found a page on the Tcler's wiki, where a user asked more or less the same question.
One proposed solution would look like this:
split [regsub -all {\s+} "a b" " "]
which seems to work for simple string. But a test string such as [string repeat " " 4] (used string repeat because StackOverflow strips multiple spaces) will result in regsub returning " ", which split would again split up into {{} {}} instead of an empty list.
Another proposed solution was this one, to force a reinterpretation of the given string as a list:
lreplace "a list with many spaces" 0 -1
But if there's one thing I've learned about TCL, it is that you should never use list functions (starting with l) on strings. And indeed, this one will choke on strings containing special characters (namely { and }):
lreplace "test \{a b\}"
returns test {a b} instead of test \{a b\} (which would be what I want, every space-separated word split up into a single element of the resulting list).
Yet another solution was to use a 'filter':
proc filter {cond list} {
set res {}
foreach element $list {if [$cond $element] {lappend res $element}}
set res
}
You'd then use it like this:
filter llength [split "a list with many spaces"]
Again, same problem. This would call llength on a string, which might contain special characters (again, { and }) - passing it "\{a b\}" would result in TCL complaining about an "unmatched open brace in list".
I managed to get it to work by modifying the given filter function, adding a {*} in front of $cond in the if, so I could use it with string length instead of llength, which seemed to work for every possible input I've tried to use it on so far.
Is this solution safe to use as it is now? Would it choke on some special input I didn't test so far? Or, is it possible to do this right in a simpler way?
The easiest way is to use regexp -all -inline to select and return all words. For example:
# The RE matches any non-empty sequence of non-whitespace characters
set theWords [regexp -all -inline {\S+} $theString]
If instead you define words to be sequences of alphanumerics, you instead use this for the regular expression term: {\w+}
You can use regexp instead:
From tcl wiki split:
Splitting by whitespace: the pitfalls
split { abc def ghi}
{} abc def {} ghi
Usually, if you are splitting by whitespace and do not want those blank fields, you are better off doing:
regexp -all -inline {\S+} { abc def ghi}
abc def ghi

Lua frontier pattern match (whole word search)

can someone help me with this please:
s_test = "this is a test string this is a test string "
function String.Wholefind(Search_string, Word)
_, F_result = string.gsub(Search_string, '%f[%a]'..Word..'%f[%A]',"")
return F_result
end
A_test = String.Wholefind(s_test,"string")
output: A_test = 2
So the frontier pattern finds the whole word no problem and gsub counts the whole words no problem but what if the search string has numbers?
s_test = " 123test 123test 123"
B_test = String.Wholefind(s_test,"123test")
output: B_test = 0
seems to work with if the numbers aren't at the start or end of the search string
Your pattern doesn't match because you are trying to do the impossible.
After including your variable value, the pattern looks like this: %f[%a]123test%f[%A]. Which means:
%f[%a] - find a transition from a non letter to a letter
123 - find 123 at the position after transition from a non letter to a letter. This itself is a logical impossibility as you can't match a transition to a letter when a non-letter follows it.
Your pattern (as written) will not work for any word that starts or ends with a non-letter.
If you need to search for fragments that include letters and numbers, then your pattern needs to be changed to something like '%f[%S]'..Word..'%f[%s]'.

Resources