I'm trying to create a string replacer that accepts multilpe replacements.
The ideia is that it would scan the string to find substrings and replace those substrings with another substring.
For example, I should be able to ask it to replace every "foo" for "bar". Doing that is trivial.
The issue starts when I'm trying to add multiple replacements for this function. Because if I ask it to replace "foo" for "bar" and "bar" for "biz", running those replacements in sequence would result in "foo" turning to "biz", and this behavior is unintended.
I tried splitting the string into words and running each replacement function in each word. However that's not bullet proof either because still results in unintended behavior, since you can ask it to replace substrings that are not whole words. Also, I find that very inefficient.
I'm thinking in some way of running each replacer once in the whole string and sort of storing those changes and merging them. However I think I'm overengineering.
Searching on the web gives me trivial results on how to use string.replace with regular expressions, it doesn't solve my problem.
Is this a problem already solved? Is there an algorithm that can be used here for this string manipulation efficiently?
If you modify your string while searching for all occurences of substrings to be replaced, you'll end up modifying incorrect states of the string. An easy way out could be to get a list of all indexes to update first, then iterate over the indexes and make replacements. That way, indexes for "bar" would've been already computed, and won't be affected even if you replace any substring with "bar" later.
Adding a rough Python implementation to give you an idea:
import re
string = "foo bar biz"
replacements = [("foo", "bar"), ("bar", "biz")]
replacement_indexes = []
offset = 0
for item in replacements:
replacement_indexes.append([m.start() for m in re.finditer(item[0], string)])
temp = list(string)
for i in range(len(replacement_indexes)):
old, new, indexes = replacements[i][0], replacements[i][1], replacement_indexes[i]
for index in indexes:
temp[offset+index:offset+index+len(old)] = list(new)
offset += len(new)-len(old)
print(''.join(temp)) # "bar biz biz"
Here's the approach I would take.
I start with my text and the set of replacements:
string text = "alpha foo beta bar delta";
Dictionary<string, string> replacements = new()
{
{ "foo", "bar" },
{ "bar", "biz" },
};
Now I create an array of parts that are either "open" or not. Open parts can have their text replaced.
var parts = new List<(string text, bool open)>
{
(text: text, open: true)
};
Now I run through each replacement and build a new parts list. If the part is open I can do the replacements, if it's closed just add it in untouched. It's this last bit that prevents double mapping of replacements.
Here's the main logic:
foreach (var replacement in replacements)
{
var parts2 = new List<(string text, bool open)>();
foreach (var part in parts)
{
if (part.open)
{
bool skip = true;
foreach (var split in part.text.Split(new[] { replacement.Key }, StringSplitOptions.None))
{
if (skip)
{
skip = false;
}
else
{
parts2.Add((text: replacement.Value, open: false));
}
parts2.Add((text: split, open: true));
}
}
else
{
parts2.Add(part);
}
}
parts = parts2;
}
That produces the following:
Now it just needs to be joined back up again:
string result = String.Concat(parts.Select(p => p.text));
That gives:
alpha bar beta biz delta
As requested.
Let's suppose your given string were
str = "Mary had fourteen little lambs"
and the desired replacements were given by the following hash (aka hashmap):
h = { "Mary"=>"Butch", "four"=>"three", "little"=>"wee", "lambs"=>"hippos" }
indicating that we want to replace "Mary" (wherever it appears in the string, if at all) with "Butch", and so on. We therefore want to return the following string:
"Butch had fourteen wee hippos"
Notice that we do not want 'fourteen' to be replaced with 'threeteen' and we want the extra spaces between 'fourteen' and 'wee' to be preserved.
First collect the keys of the hash h into an array (or list):
keys = h.keys
#=> ["Mary", "four", "little", "lambs"]
Most languages have a method or function sub or gsub that works something like the following:
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch had fourteen wee hippos"
The regular expression /\w+/ (r'\w+' in Python, for example) matches one or more word characters, as many as possible (i.e., a greedy match). Word characters are letters, digits and the underscore ('_'). It therefore will sequentially match 'Mary', 'had', 'fourteen', 'little' and 'lambs'.
Each matched word is passed to the "block" do |word| ...end and is held by the variable word. The block calculation then computes and returns the string that is to replace the value of word in a duplicate of the original string. Different languages uses different structures and formats to do this, of course.
The first word passed to the block by gsub is 'Mary'. The following calculation is then performed:
if keys.include?("Mary") # true
# so replace "Mary" with:
h[word] #=> "Butch
else # not executed
# not executed
end
Next, gsub passes the word 'had' to the block and assigns that string to the variable word. The following calculation is then performed:
if keys.include?("had") # false
# not executed
else
# so replace "had" with:
"had"
# that is, leave "had" unchanged
end
Similar calculations are made for each word matched by the regular expression.
We see that punctuation and other non-word characters is not a problem:
str = "Mary, had fourteen little lambs!"
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch, had fourteen wee hippos!"
We can see that gsub does not perform replacements sequentially:
h = { "foo"=>"bar", "bar"=>"baz" }
keys = h.keys
#=> ["foo", "bar"]
"foo bar".gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "bar baz"
Note that a linear search of keys is required to evaluate
keys.include?("Mary")
This could be relatively time-consuming if keys has many elements.
In most languages this can be sped up by making keys a set (an unordered collection of unique elements). Determining whether a set contains a given element is quite fast, comparable to determining if a hash has a given key.
An alternative formulation is to write
str.gsub(/\b(?:Mary|four|little|lambs)\b/) { |word| h[word] }
#=> "Butch had fourteen wee hippos"
where the regular expression is constructed programmatically from h.keys. This regular expression reads, "match one of the four words indicated, preceded and followed by a word boundary (\b). The trailing word boundary prevents 'four' from matching 'fourteen'. Since gsub is now only considering the replacement of those four words the block can be simplified to { |word| h[word] }.
Again, this preserves punctuation and extra spaces.
If for some reason we wanted to be able to replace parts of words (e.g., to replace 'fourteen' with 'threeteen'), simply remove the word boundaries from the regular expression:
str.gsub(/Mary|four|little|lambs/) { |word| h[word] }
#=> "Butch had threeteen wee hippos"
Naturally, different languages provide variations of this approach. In Ruby, for example, one could write:
g = Hash.new { |h,k| k }.merge(h)
The creates a hash g that has the same key-value pairs as h but has the additional property that if g does not have a key k, g[k] (the value of key k) returns k. That allows us to write simply:
str.gsub(/\w+/, g)
#=> "Butch had fourteen wee hippos"
See the second version of String#gsub.
A different approach (which I will show is problematic) is to construct an array (or list) of words from the string, replace those words as appropriate and then rejoin the resulting words to form a string. For example,
words = str.split
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
["Butch", "had", "fourteen", "wee", "hippos"]
arr.join(' ')
#=> "Butch had fourteen wee hippos"
This produces similar results except the extra spaces have been removed.
Now suppose the string contained punctuation:
str = "Mary, had fourteen little lambs!"
words = str.split
#=> ["Mary,", "had", "fourteen", "little", "lambs!"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Mary,", "had", "fourteen", "wee", "lambs!"]
arr.join(' ')
#=> "Mary, had fourteen wee lambs!"
We could deal with punctuation by writing
words = str.scan(/\w+/)
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Butch", "had", "fourteen", "wee", "hippos"]
Here str.scan returns an array of all matches of the regular expression /\w+/ (one or more word characters). The obvious problem is that all punctuation has been lost when arr.join(' ').
You can achieve in a simple way, by using regular expressions:
import re
replaces = {'foo' : 'bar', 'alfa' : 'beta', 'bar': 'biz'}
original_string = 'foo bar, alfa foo. bar other.'
expected_string = 'bar biz, beta bar. biz other.'
replaced = re.compile(r'\w+').sub(lambda m: replaces[m.group()] if m.group() in replaces else m.group(), original_string)
assert replaced == expected_string
I haven't checked the performance, but I believe it is probably faster than using "nested for loops".
I have a large set of JavaScript snippets each containing a line like:
function('some string without numbers', '123,71')
and I'm hoping to get a regex together to pull the numbers from the second argument. The second argument can contain an arbitrary number of comma separated numbers (inlcuding zero numbers), so the following are all valid:
''
'2'
'17,888'
'55,1,6000'
...
The regex '(?:\d+|,)*' successfully matches the quoted numbers, but I have no idea how to match each of the numbers. Placing a capture group around the \d+ seems to capture the last number (if there is one present -- it doesn't work if the second argument is just ''), but none of the others.
In your case, you may match and capture the digits inside the single quotes and then split them with a comma:
var s = "function('some string without numbers', '123,71')";
var res = s.match(/'([\d,]+)'/) || ["", ""];
console.log(res[1].split(','));
The /'([\d,]+)'/ regex will match a ', then 1+ digits or commas (placing that value into Group 1) and then a closing '.
If you want to run the regex globally, use
var s = "function('some string without numbers', '123,71')\nfunction('some string without numbers', '13,4,0')";
var rx = /'([\d,]+)'/g;
var res = [], m;
while ((m=rx.exec(s)) !== null) {
res.push(m[1].split(','));
}
console.log(res);
If you have a numbers in a variable x like this:
var x = '55,1,6000';
then use this to have the list of numbers:
var array = x.split(',');
If you can have some whitespace before/after the comma then use:
var array = x.split('\s*,\s*');
or something like that.
Sometimes it is easier to match the thing that you don't want and split on that.
I've got this string :
var str:String = mySharedObject.data.theDate;
where mySharedObject.data.theDate can contain the word January, February, March..etc (depends on which button the user clicked).
Is it possible to tell to my code to replace "January" by "1" (if mySharedObject contain the word January), "February" by "2"...etc ?
The most basic way to do what you'd like, is to the use the replace method of a string.
str = str.replace("January","1");
Now, you could repeat or chain together for all 12 months (eg str = str.replace("January","1").replace("February","2").replace("March","3")..etc) or you could do it in a loop:
//have an array of all 12 months in order
var months:Array = ["January","February","March","April","May"]; //etc
//create a function to replace all months with the corresponding number
function swapMonthForNumber(str:String):String {
//do the same line of code for every item in the array
for(var i:int=0;i<months.length;i++){
//i is the item, which is 0 based, so we have to add 1 to make the right month number
str = str.replace(months[i],String(i+1));
}
//return the updated string
return str;
}
var str:String = swapMonthForNumber(mySharedObject.data.theDate);
Now, there are a few other ways to replace strings in ActionScript that are all a little different in terms of complexity and performance, but if you're just getting started I would stick with the replace method.
The only possible caveat with replace is that it only replaces the first instance of the word, so if your string was "January January January", it would come out as "1 January January".
I need to split a uint to a list of bits (list of chars, where every char is "0" or "1", is also Ok). The way I try to do it is to concatenate the uint into string first, using binary representation for numeric types - bin(), and then to split it using str_split_all():
var num : uint(bits:4) = 0xF; // Can be any number
print str_split_all(bin(num), "/w");
("/w" is string match pattern that means any char).
The output I expect:
"0"
"b"
"1"
"1"
"1"
"1"
But the actual output is:
0. "0b1111"
Why doesn't it work? Thank you for your help.
If you want to split an integer into a list of bits, you can use the %{...} operator:
var num_bits : list of bit = %{num};
You can find a working example on EDAPlayground.
As an extra clarification to your question, "/w" doesn't mean match any character. The string "/\w/" means match any single character in AWK Syntax. If you put that into your match expression, you'll get (almost) the output you want, but with some extra blanks interleaved (the separators).
Regardless, if you want to split a string into its constituting characters, str_split_all(...) isn't the way to go. It's easier to convert the string into ASCII characters and then convert those back to string again:
extend sys {
run() is also {
var num : uint(bits:4) = 0xF; // Can be any number
var num_bin : string = bin(num);
var num_bin_chars := num_bin.as_a(list of byte);
for each (char) in num_bin_chars {
var char_as_string : string;
unpack(packing.low, %{8'b0, char}, char_as_string);
print char_as_string;
};
};
};
The unpack(...) syntax is directly from the e Reference Manual, Section 2.8.3 Type Conversion Between Strings and Scalars or Lists of Scalars
I need to sort my Linked List, the problem is that each of my Linked List elements are Strings with sentences. So the question is... how to detect each number in my Linked List and get the value?.
I tried to split my linked list so I can pass trough each element.
private LinkedList<String> list = new LinkedList<String>();
list.add("Number One: 1")
list.add("Number Three: 3")
list.add("Number two:2")
for(Iterator<String> iterator =list.iterator(); iterator.hasNext(); )
{
String string = iterator.next();
for (String word : string.split(" ")){
}
I also tried with "if((word.contains("1") || (word.contains("2")...." inside the for loop, and then pass the value "word" to Double... but I think is not very smart
So my goal is this Output (Number One: 1 , Number Two: 2, Number Three: 3), therefore I need the value of each number first.
why not use tryParse on the string,
for (String word : string.split(" ")){
int outvalue
if(int.TryParse(word, outvalue)){
//DoSomething with result
}
}