Substitute substring within a string bidrectionally [duplicate] - string

This question already has answers here:
Replace multiple strings with multiple other strings
(27 answers)
Closed 7 years ago.
Given a string M that contains term A and B, I would like to substitute every A for B and every B for A to for M'. Naively one would try replacing A by B and then subsequently B by A but in that case the M' contains only of A. I can think of replacing the terms and record their position so that the terms do not get replaced again. This works when we only have A and B to replace. But if we need to substitute more than 2 terms and they are of different length then it gets tricky.
So I thought about doing this:
We are given M as input string and R = [(x1, y1), (x2, y2), ... (xn, yn)] as terms to replace, where we replace xi for yi for all i.
With M, Initiate L = [(M, false)] to be a list of (string * boolean) tuple where false means that this string has not been replaced.
Search for occurence of xi in each member L(i) of L with second term false. Partition L(i) into [(pre, false), (xi, false), (post, false)], and map to [(pre, false), (yi, true), (post, false)] where pre and post are string before and after xi. Flatten L.
Repeat the above until R is exhausted.
Concatenate the first element of each tuple of L to from M'.
Is there a more efficient way of doing this?

Here's a regex solution:
var M = 'foobazbar123foo match';
var pairs = {
'foo': 'bar',
'bar': 'foo',
'baz': 'quz',
'no': 'match'
};
var re = new RegExp(Object.keys(pairs).join('|'),'g');
alert(M.replace(re, function(m) { return pairs[m]; }));
Note: This is a demonstration / POC. A real implementation would have to handle proper escaping of the lookup strings.

Another approach is to replace strings by intermediate temporary strings (symbols) and then replace symbols by their original counterparts. So the transformation 'foo' => 'bar' can be transformed in two steps as, say, 'foo' => '___1' => 'bar'. The other transformation 'bar' ==> 'foo' will then become 'bar' ==> '___2' ==> 'foo'. This will prevent the mixup you describe.
Sample python code for the same example as the other answer follows:
import re
def substr(string):
repdict = {"foo":"bar", "bar":"foo", "baz":"quz", "no":"match"}
tmpdict = dict()
count = 0
for left, right in repdict.items():
tmpleft = "___" + count.__str__()
tmpdict[tmpleft] = right
count = count + 1
tmpright = "___" + count.__str__()
tmpdict[tmpright] = left
count = count + 1
string = re.sub(left, tmpleft, string)
string = re.sub(right, tmpright, string)
for tmpleft, tmpright in tmpdict.items():
string = re.sub(tmpleft, tmpright, string)
print string
>>> substr("foobazbar123foo match")
barquzfoo123bar no

Related

how to add characters from array into one string python

I'm trying to change characters from x into upper or lower character depending whether they are in r or c. And the problem is that i can't get all the changed characters into one string.
import unittest
def fun_exercise_6(x):
y = []
r = 'abcdefghijkl'
c = 'mnopqrstuvwxz'
for i in range(len(x)):
if(x[i] in r):
y += x[i].lower()
elif(x[i] in c):
y += x[i].upper()
return y
class TestAssignment1(unittest.TestCase):
def test1_exercise_6(self):
self.assertTrue(fun_exercise_6("osso") == "OSSO")
def test2_exercise_6(self):
self.assertTrue(fun_exercise_6("goat") == "gOaT")
def test3_exercise_6(self):
self.assertTrue(fun_exercise_6("bag") == "bag")
def test4_exercise_6(self):
self.assertTrue(fun_exercise_6("boat") == "bOaT" )
if __name__ == '__main__':
unittest.main()
Using a list as you are using is probably the best approach while you are figuring out whether or not each character should be uppered or lowered. You can join your list using str's join method. In your case, you could have your return statement look like this:
return ''.join(y)
What this would do is join a collection of strings (your individual characters into one new string using the string you join on ('').
For example, ''.join(['a', 'b', 'c']) will turn into 'abc'
This is a much better solution than making y a string as strings are immutable data types. If you make y a string when you are constructing it, you would have to redefine and reallocate the ENTIRE string each time you appended a character. Using a list, as you are doing, and joining it at the end would allow you to accumulate the characters and then join them all at once, which is comparatively very efficient.
If you define y as an empty string y = "" instead of an empty list you will get y as one string. Since when you declare y = [] and add an item to the list, you add a string to a list of string not a character to a string.
You can't compare a list and a string.
"abc" == ["a", "b", "c'] # False
The initial value of y in the fun_exercise_6 function must be ""

count number of chars in String

In SML, how can i count the number of appearences of chars in a String using recursion?
Output should be in the form of (char,#AppearenceOfChar).
What i managed to do is
fun frequency(x) = if x = [] then [] else [(hd x,1)]#frequency(tl x)
which will return tupels of the form (char,1). I can too eliminate duplicates in this list, so what i fail to do now is to write a function like
fun count(s:string,l: (char,int) list)
which 'iterates' trough the string incrementing the particular tupel component. How can i do this recursively? Sorry for noob question but i am new to functional programming but i hope the question is at least understandable :)
I'd break the problem into two: Increasing the frequency of a single character, and iterating over the characters in a string and inserting each of them. Increasing the frequency depends on whether you have already seen the character before.
fun increaseFrequency (c, []) = [(c, 1)]
| increaseFrequency (c, ((c1, count)::freqs)) =
if c = c1
then (c1, count+1)
else (c1,count)::increaseFrequency (c, freqs)
This provides a function with the following type declaration:
val increaseFrequency = fn : ''a * (''a * int) list -> (''a * int) list
So given a character and a list of frequencies, it returns an updated list of frequencies where either the character has been inserted with frequency 1, or its existing frequency has been increased by 1, by performing a linear search through each tuple until either the right one is found or the end of the list is met. All other character frequencies are preserved.
The simplest way to iterate over the characters in a string is to explode it into a list of characters and insert each character into an accumulating list of frequencies that starts with the empty list:
fun frequencies s =
let fun freq [] freqs = freqs
| freq (c::cs) freqs = freq cs (increaseFrequency (c, freqs))
in freq (explode s) [] end
But this isn't a very efficient way to iterate a string one character at a time. Alternatively, you can visit each character by indexing without converting to a list:
fun foldrs f e s =
let val len = size s
fun loop i e' = if i = len
then e'
else loop (i+1) (f (String.sub (s, i), e'))
in loop 0 e end
fun frequencies s = foldrs increaseFrequency [] s
You might also consider using a more efficient representation of sets than lists to reduce the linear-time insertions.

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

How can i determine if a string is a concatenation of a string list

Suppose we are given a string S, and a list of some other strings L.
How can we know if S is a one of all the possible concatenations of L?
For example:
S = "abcdabce"
L = ["abcd", "a", "bc", "e"]
S is "abcd" + "a" + "bc" + "e", then S is a concatenation of L, whereas "ababcecd" is not.
In order to solve this question, I tried to use DFS/backtracking. The pseudo code is as follows:
boolean isConcatenation(S, L) {
if (L.length == 1 && S == L[0]) return true;
for (String s: L) {
if (S.startwith(s)) {
markAsVisited(s);
if (isConcatnation(S.exclude(s), L.exclude(s)))
return true;
markAsUnvisited(s);
}
}
return false;
}
However, DFS/backtracking is not a efficient solution. I am curious what is the fastest algorithm to solve this question or if there is any other algorithm to solve it in a faster way. I hope there are algorithms like KMP, which can solve it in O(n) time.
In python:
>>> yes = 'abcdabce'
>>> no = 'ababcecd'
>>> L = ['abcd','a','bc','e']
>>> yes in [''.join(p) for p in itertools.permutations(L)]
True
>>> no in [''.join(p) for p in itertools.permutations(L)]
False
edit: as pointed out, this is n! complex, so is inappropriate for large L. But hey, development time under 10 seconds.
You can instead build your own permutation generator, starting with the basic permutator:
def all_perms(elements):
if len(elements) <=1:
yield elements
else:
for perm in all_perms(elements[1:]):
for i in range(len(elements)):
yield perm[:i] + elements[0:1] + perm[i:]
And then discard branches that you don't care about by tracking what the concatenation of the elements would be and only iterating if it adds up to your target string.
def all_perms(elements, conc=''):
...
for perm in all_perms(elements[1:], conc + elements[0]):
...
if target.startswith(''.join(conc)):
...
A dynamic programming approach would be to work left to right, building up an array A[x] where A[x] is true if the first x characters of the string form one of the possible concatenations of L. You can work out A[n] given earlier A[n] by checking each possible string in the list - if the characters of S up to the nth character match a candidate string of length k and if A[n-k] is true, then you can set A[n] true.
I note that you can use https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm to find the matches you need as input to the dynamic program - the matching costs will be linear in the size of the input string, the total size of all candidate strings, and the number of matches between the input string and candidate strings.
i would try the following:
find all positions of L_i patterns in S
let n=length(S)+1
create a graph with n nodes
for all L_i positions i: directed edges: node_i --> L_i matches node --> node_{i+length(L_i)}
to enable the permutation constrains you have to add some more node/edges to exclude multiple usage of the same pattern
now i can ask a new question: is there exists a directed path from 0 to n ?
notes:
if there exists a node(0 < i < n) with degree <2 then no match is possible
all nodes which have d-=1, d+=1 are part of the permutation
bread first or diskstra to look for the solution
You can use the Trie data structure. First, construct a trie from strings in L.
Then, for the input string S, search for the S in the trie.
During searching, for every visited node which is an end of one of the words in L, call a new search on the trie (from the root) with remaining (yet unmatched) suffix of S. So, we are using recursion. If you consume all characters of S in that process then you know, that S is a contatenation of some strings from L.
I would suggest this solution:
Take an array of size 256 which will store the occurence count of each character in all strings of L. Now try to match that with count of each character of S. If both are unequal then we can confidently say that they cannot form the given character.
If counts are same, Do the following, using KMP algorithm try to find simultaneously each string in L in S. If at any time there is a match we remove that string from L and continue search for other strings in L. If at any time we dont find a match we just print that it cannot be represented. If at the end L is empty we conclude that S indeed is a concatenation of L.
Assuming that L is a set of unique strings.
Two Haskell propositions:
There may be some counter examples to this...just for fun...sort L by a custom sort:
import Data.List (sortBy,isInfixOf)
h s l = (concat . sortBy wierd $ l) == s where
wierd a b | isInfixOf (a ++ b) s = LT
| isInfixOf (b ++ a) s = GT
| otherwise = EQ
More boring...attempt to build S from L:
import Data.List (delete,isPrefixOf)
f s l = g s l [] where
g str subs result
| concat result == s = [result]
| otherwise =
if null str || null subs'
then []
else do sub <- subs'
g (drop (length sub) str) (delete sub subs) (result ++ [sub])
where subs' = filter (flip isPrefixOf str) subs
Output:
*Main> f "abcdabce" ["abcd", "a", "bc", "e", "abc"]
[["abcd","a","bc","e"],["abcd","abc","e"]]
*Main> h "abcdabce" ["abcd", "a", "bc", "e", "abc"]
False
*Main> h "abcdabce" ["abcd", "a", "bc", "e"]
True
Your algorithm has complexity N^2 (N is the length of list). Let's see in actual C++
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
typedef pair<string::const_iterator, string::const_iterator> stringp;
typedef vector<string> strings;
bool isConcatenation(stringp S, const strings L) {
for (strings::const_iterator p = L.begin(); p != L.end(); ++p) {
auto M = mismatch(p->begin(), p->end(), S.first);
if (M.first == p->end()) {
if (L.size() == 1)
return true;
strings T;
T.insert(T.end(), L.begin(), p);
strings::const_iterator v = p;
T.insert(T.end(), ++v, L.end());
if (isConcatenation(make_pair(M.second, S.second), T))
return true;
}
}
return false;
}
Instead of looping on the entire vector, we could sort it, then reduce the search to O(LOG(N)) steps in the optimum case, where all strings start with different chars. The worst case will remain O(N^2).

Scala split string to tuple

I would like to split a string on whitespace that has 4 elements:
1 1 4.57 0.83
and I am trying to convert into List[(String,String,Point)] such that first two splits are first two elements in the list and the last two is Point. I am doing the following but it doesn't seem to work:
Source.fromFile(filename).getLines.map(string => {
val split = string.split(" ")
(split(0), split(1), split(2))
}).map{t => List(t._1, t._2, t._3)}.toIterator
How about this:
scala> case class Point(x: Double, y: Double)
defined class Point
scala> s43.split("\\s+") match { case Array(i, j, x, y) => (i.toInt, j.toInt, Point(x.toDouble, y.toDouble)) }
res00: (Int, Int, Point) = (1,1,Point(4.57,0.83))
You could use pattern matching to extract what you need from the array:
case class Point(pts: Seq[Double])
val lines = List("1 1 4.34 2.34")
val coords = lines.collect(_.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
})
You are not converting the third and fourth tokens into a Point, nor are you converting the lines into a List. Also, you are not rendering each element as a Tuple3, but as a List.
The following should be more in line with what you are looking for.
case class Point(x: Double, y: Double) // Simple point class
Source.fromFile(filename).getLines.map(line => {
val tokens = line.split("""\s+""") // Use a regex to avoid empty tokens
(tokens(0), tokens(1), Point(tokens(2).toDouble, tokens(3).toDouble))
}).toList // Convert from an Iterator to List
case class Point(pts: Seq[Double])
val lines = "1 1 4.34 2.34"
val splitLines = lines.split("\\s+") match {
case Array(s1, s2, points # _*) => (s1, s2, Point(points.map(_.toDouble)))
}
And for the curious, the # in pattern matching binds a variable to the pattern, so points # _* is binding the variable points to the pattern *_ And *_ matches the rest of the array, so points ends up being a Seq[String].
There are ways to convert a Tuple to List or Seq, One way is
scala> (1,2,3).productIterator.toList
res12: List[Any] = List(1, 2, 3)
But as you can see that the return type is Any and NOT an INTEGER
For converting into different types you use Hlist of
https://github.com/milessabin/shapeless

Resources