Pythonic way of finding differences between two strings

Pythonic way of finding differences between two strings - python-3.x

a = "3030104AF43B000A3F1D200619D09FE00403031324354650004FFFFF"
b = "3030104BE3B000C3DF1D200617183BA00403030335F5B6F0004FFFFF"
Let's say a and b are two hexadecimal string values of which are equal and even in length. They also share the same format such that differences occurring between both strings always happen at the same position (But, initially I do not know which position these differences occur). For example a's first six digits are the same as b's first six digits. i.e., 303010. The following 4 digits are different 4AF43B in a compared with b, after which the next two digit are the same for a and b (00). This pattern follows on until the end of both strings.
I have written code to store the differences occurring as different elements in a list.
seed = "3030104AF43B000A3F1D200619D09FE00403031324354650004FFFFF"
seed2 = "3030104BE3B000C3DF1D200617183BA00403030335F5B6F0004FFFFF"
seed = seed.rstrip("FF")
seed2 = seed2.rstrip("FF")
differences_list1 = []
differences_list2 = []
sequence1 = ""
sequence2 = ""
for pair in range(int(len(seed) / 2)):
data_pair1 = seed[pair * 2:(pair * 2) + 2]
data_pair2 = seed2[pair * 2:(pair * 2) + 2]
if data_pair1 == data_pair2:
if sequence1 == "" and sequence2 == "":
continue
# here, we know it is not an empty sequence
differences_list1.append(sequence1)
differences_list2.append(sequence2)
sequence1 = ""
sequence2 = ""
continue
# when they are not equal to each other
sequence1 = sequence1 + data_pair1
sequence2 = sequence2 + data_pair2
print(str(differences_list1))
print(str(differences_list2))
Output (which I want):
['4AF43B', '0A3F', '19D09FE0', '1324354650']
['4BE3B0', 'C3DF', '17183BA0', '0335F5B6F0']
I've gotten the output as I desired somehow but I would like to know how can I improve/write my code in a more pythonic way (specifically python3.9)?

Related

Count number of continuous matching elements in two different numbers in Python

Suppose
we have two numbers a and b we need to calculate the continuous matching digits between the two numbers.
some examples are shown below:
a = 123456 b = 456 ==> I need count as : 3 digits matching
a = 556789 b = 55678 ==> I need count as : 5 digits matching
I don't want unique but continuous matching numbers and need the count. Also display the matching ones will be helpful. Also can we can we do in two different lists if numbers?
I am very new to python and trying out few things. Thanks

Given two numbers a and b:
a = 123456
b = 456
First you need to covert them to strings:
a_str = str(a)
b_str = str(b)
Then you need to check if there is a continuous match of b_str in a_str:
if b_str in a_str:
...
Finally you can check the length of b_str:
len(b_str)
This is the complete function:
def count_matching_elements(a, b):
a_str, b_str = str(a), str(b)
if b_str in a_str:
return len(b_str)
else:
return -1 # no matches

What you want here is know as the Longest common substring, you can find it like this (this code can be found here Find common substring between two strings, just a little difference that you actually want the len(answer)) :
def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return len(answer)
Note that a and b would have to be strings

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

Given a string S of length n. Choose an integer K and two non-empty sub-sequences A and B of length K such that it satisfies the following conditions:
A = B i.e. for each i the ith character in A is same as the ith character in B.
Let's denote the indices used to construct A as a1,a2,a3,...,an where ai belongs to S and B as b1,b2,b3,...,bn where bi belongs to S. If we denote the number of common indices in A and B by M then M + 1 <= K.
Find the maximum value of K such that it is possible to find the sub-sequences A and B which satisfies the above conditions.
Constraints:
0 < N <= 10^5
Things which I observed are:
The value of K = 0 if the number of characters in the given string are all distinct i.e S = abcd.
K = length of S - 1 if all the characters in the string are same i.e. S = aaaa.
The value of M cannot be equal to K because then M + 1 <= K will not be true i.e you cannot have a sub-sequence A and B that satifies A = B and a1 = b1, a2 = b2, a3 = b3, ..., an = bn.
If the string S is palindrome then K = (Total number of times a character is repeated in the string if the repeatation count > 1) - 1. i.e. S = tenet then t is repeated 2 times, e is repeated 2 times, Total number of times a character is repeated = 4, K = 4 - 1 = 3.
I am having trouble designing the algorithm to solve the above problem.
Let me know in the comments if you need more clarification.

(Update: see O(n) answer.)
We can modify the classic longest common subsequence recurrence to take an extra parameter.
JavaScript code (not memoised) that I hope is self explanatory:
function f(s, i, j, haveUncommon){
if (i < 0 || j < 0)
return haveUncommon ? 0 : -Infinity
if (s[i] == s[j]){
if (haveUncommon){
return 1 + f(s, i-1, j-1, true)
} else if (i == j){
return Math.max(
1 + f(s, i-1, j-1, false),
f(s, i-1, j, false),
f(s, i, j-1, false)
)
} else {
return 1 + f(s, i-1, j-1, true)
}
}
return Math.max(
f(s, i-1, j, haveUncommon),
f(s, i, j-1, haveUncommon)
)
}
var s = "aabcde"
console.log(f(s, s.length-1, s.length-1, false))

I believe we are just looking for the closest equal pair of characters since the only characters excluded from A and B would be one of the characters in the pair and any characters in between.
Here's O(n) in JavaScript:
function f(s){
let map = {}
let best = -1
for (let i=0; i<s.length; i++){
if (!map.hasOwnProperty(s[i])){
map[s[i]] = i
continue
}
best = Math.max(best, s.length - i + map[s[i]])
map[s[i]] = i
}
return best
}
var strs = [
"aabcde", // 5
"aaababcd", // 7
"aebgaseb", // 4
"aefttfea",
// aeft fea
"abcddbca",
// abcd bca,
"a" // -1
]
for (let s of strs)
console.log(`${ s }: ${ f(s) }`)
O(n) solution in Python3:
def compute_maximum_k(word):
last_occurences = {}
max_k = -1
for i in range(len(word)):
if(not last_occurences or not word[i] in last_occurences):
last_occurences[word[i]] = i
continue
max_k = max(max_k,(len(word) - i) + last_occurences[word[i]])
last_occurences[word[i]] = i
return max_k
def main():
words = ["aabcde","aaababcd","aebgaseb","aefttfea","abcddbca","a","acbdaadbca"]
for word in words:
print(compute_maximum_k(word))
if __name__ == "__main__":
main()

A solution for the maximum length substring would be the following:
After building a Suffix Array you can derive the LCP Array. The maximum value in the LCP array corresponds to the K you are looking for. The overall complexity of both constructions is O(n).
A suffix array will sort all prefixes in you string S in ascending order. The longest common prefix array then computes the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in the sorted suffix array. Thus the maximum value in this array corresponds to the length of the two maximum length substrings of S.
For a nice example using the word "banana", check out the LCP Array Wikipage

I deleted my previous answer as I don't think we need an LCS-like solution (LCS=longest Common Subsequence).
It is sufficient to find the couple of subsequences (A, B) that differ in one character and share all the others.
The code below finds the solution in O(N) time.
def function(word):
dp = [0]*len(word)
lastOccurences = {}
for i in range(len(dp)-1, -1, -1):
if i == len(dp)-1:
dp[i] = 0
else:
if dp[i+1] > 0:
dp[i] = 1 + dp[i+1]
elif word[i] in lastOccurences:
dp[i] = len(word)-lastOccurences[word[i]]
lastOccurences[word[i]] = i
return dp[0]
dp[i] is equal to 0 when all characters from i to the end of the string are different.
I will explain my code by an example.
For "abcack", there are two cases:
Either the first 'a' will be shared by the two subsequences A and B, in this case the solution will be = 1 + function("bcack")
Or 'a' will not be shared between A and B. In this case the result will be 1 + "ck". Why 1 + "ck" ? It's because we have already satisfied M+1<=K so just add all the remaining characters. In terms of indices, the substrings are [0, 4, 5] and [3, 4, 5].
We take the maximum between these two cases.
The reason I'm scanning right to left is to not have O(N) search for the current character in the rest of the string, I maintain the index of the last visited occurence of the character in the dict lastOccurences.

Given two strings, how do I find number of reoccurences of one in another?

For example, s1='abc', s2='kokoabckokabckoab'.
Output should be 3. (number of times s1 appears in s2).
Not allowed to use for or strfind. Can only use reshape,repmat,size.
I thought of reshaping s2, so it would contain all of the possible strings of 3s:
s2 =
kok
oko
koa
oab
.... etc
But I'm having troubles from here..

Assuming you have your matrix reshaped into the format you have in your post, you can replicate s1 and stack the string such that it has as many rows as there are in the reshaped s2 matrix, then do an equality operator. Rows that consist of all 1s means that we have found a match and so you would simply search for those rows where the total sum is equal to the total length of s1. Referring back to my post on dividing up a string into overlapping substrings, we can decompose your string into what you have posted in your question like so:
%// Define s1 and s2 here
s1 = 'abc';
len = length(s1);
s2 = 'kokoabckokabckoab';
%// Hankel starts here
c = (1 : len).';
r = (len : length(s2)).';
nr = length(r);
nc = length(c);
x = [ c; r((2:nr)') ]; %-- build vector of user data
cidx = (1:nc)';
ridx = 0:(nr-1);
H = cidx(:,ones(nr,1)) + ridx(ones(nc,1),:); % Hankel subscripts
ind = x(H); % actual data
%// End Hankel script
%// Now get our data
subseqs = s2(ind.');
%// Case where string length is 1
if len == 1
subseqs = subseqs.';
end
subseqs contains the matrix of overlapping characters that you have alluded to in your post. You've noticed a small bug where if the length of the string is 1, then the algorithm won't work. You need to make sure that the reshaped substring matrix consists of a single column vector. If we ran the above code without checking the length of s1, we would get a row vector, and so simply transpose the result if this is the case.
Now, simply replicate s1 for as many times as we have rows in subseqs so that all of these strings get stacked into a 2D matrix. After, do an equality operator.
eqs = subseqs == repmat(s1, size(subseqs,1), 1);
Now, find the column-wise sum and see which elements are equal to the length of your string. This will produce a single column vector where 1 indicates that we have found a match, and zero otherwise:
sum(eqs, 2) == len
ans =
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
Finally, to add up how many times the substring matched, you just have to add up all elements in this vector:
out = sum(sum(eqs, 2) == len)
out =
2
As such, we have two instances where abc is found in your string.

Here is another one,
s1='abc';
s2='bkcokbacaabcsoabckokabckoabc';
[a,b] = ismember(s2,s1);
b = [0 0 b 0 0];
a1=circshift(b,[0 -1]);
a2=circshift(b,[0 -2]);
sum((b==1)&(a1==2)&(a2==3))
It gives 3 for your input and 4 for my example, and it seems to work well if ismember is okey.

Just for the fun of it: this can be done with nlfilter from the Image Processing Toolbox (I just discovered this function today and am eager to apply it!):
ds1 = double(s1);
ds2 = double(s2);
result = sum(nlfilter(ds2, [1 numel(ds1)], #(x) all(x==ds1)));

Matlab, order cells of strings according to the first one

I have 2 cell of strings and I would like to order them according to the first one.
A = {'a';'b';'c'}
B = {'b';'a';'c'}
idx = [2,1,3] % TO FIND
B=B(idx);
I would like to find a way to find idx...

Use the second output of ismember. ismember tells you whether or not values in the first set are anywhere in the second set. The second output tells you where these values are located if we find anything. As such:
A = {'a';'b';'c'}
B = {'b';'a';'c'}
[~,idx] = ismember(A, B);
Note that there is a minor typo when you declared your cell arrays. You have a colon in between b and c for A and a and c for B. I placed a semi-colon there for both for correctness.
Therefore, we get:
idx =
2
1
3
Benchmarking
We have three very good algorithms here. As such, let's see how this performs by doing a benchmarking test. What I'm going to do is generate a 10000 x 1 random character array of lower case letters. This will then be encapsulated into a 10000 x 1 cell array, where each cell is a single character array. I construct A this way, and B is a random permutation of the elements in A. This is the code that I wrote to do this for us:
letters = char(97 + (0:25));
rng(123); %// Set seed for reproducibility
ind = randi(26, [10000, 1]);
lettersMat = letters(ind);
A = mat2cell(lettersMat, ones(10000,1), 1);
B = A(randperm(10000));
Now... here comes the testing code:
clear all;
close all;
letters = char(97 + (0:25));
rng(123); %// Set seed for reproducibility
ind = randi(26, [10000, 1]);
lettersMat = letters(ind);
A = mat2cell(lettersMat, 1, ones(10000,1));
B = A(randperm(10000));
tic;
[~,idx] = ismember(A,B);
t = toc;
fprintf('ismember: %f\n', t);
clear idx; %// Make sure test is unbiased
tic;
[~,idx] = max(bsxfun(#eq,char(A),char(B)'));
t = toc;
fprintf('bsxfun: %f\n', t);
clear idx; %// Make sure test is unbiased
tic;
[~, indA] = sort(A);
[~, indB] = sort(B);
idx = indB(indA);
t = toc;
fprintf('sort: %f\n', t);
This is what I get for timing:
ismember: 0.058947
bsxfun: 0.110809
sort: 0.006054
Luis Mendo's approach is the fastest, followed by ismember, and then finally bsxfun. For code compactness, ismember is preferred but for performance, sort is better. Personally, I think bsxfun should win because it's such a nice function to use ;).

This seems to be significantly faster than using ismember (although admittedly less clear than #rayryeng's answer). With thanks to #Divakar for his correction on this answer.
[~, indA] = sort(A);
[~, indB] = sort(B);
idx = indA(indB);

I had to jump in as it seems runtime performance could be a criteria here :)
Assuming that you are dealing with scalar strings(one character in each cell), here's my take that works even when you have not-commmon elements between A and B and uses the very powerful bsxfun and as such I am really hoping this would be runtime-efficient -
[v,idx] = max(bsxfun(#eq,char(A),char(B)'));
idx = v.*idx
Example -
A =
'a' 'b' 'c' 'd'
B =
'b' 'a' 'c' 'e'
idx =
2 1 3 0
For a specific case when you have no not-common elements between A and B, it becomes a one-liner -
[~,idx] = max(bsxfun(#eq,char(A),char(B)'))
Example -
A =
'a' 'b' 'c'
B =
'b' 'a' 'c'
idx =
2 1 3

String lexicographical permutation and inversion

Consider the following function on a string:
int F(string S)
{
int N = S.size();
int T = 0;
for (int i = 0; i < N; i++)
for (int j = i + 1; j < N; j++)
if (S[i] > S[j])
T++;
return T;
}
A string S0 of length N with all pairwise distinct characters has a total of N! unique permutations.
For example "bac" has the following 6 permutations:
bac
abc
cba
bca
acb
cab
Consider these N! strings in lexicographical order:
abc
acb
bac
bca
cab
cba
Now consider the application of F to each of these strings:
F("abc") = 0
F("acb") = 1
F("bac") = 1
F("bca") = 2
F("cab") = 2
F("cba") = 3
Given some string S1 of this set of permutations, we want to find the next string S2 in the set, that has the following relationship to S1:
F(S2) == F(S1) + 1
For example if S1 == "acb" (F = 1) than S2 == "bca" (F = 1 + 1 = 2)
One way to do this would be to start at one past S1 and iterate through the list of permutations looking for F(S) = F(S1)+1. This is unfortunately O(N!).
By what O(N) function on S1 can we calculate S2 directly?

Suppose length of S1 is n, biggest value for F(S1) is n(n-1)/2, if F(S1) = n(n-1)/2, means it's a last function and there isn't any next for it, but if F(S1) < n(n-1)/2, means there is at least one char x which is bigger than char y and x is next to y, find such a x with lowest index, and change x and y places. let see it by example:
S1 == "acb" (F = 1) , 1 < 3 so there is a char x which is bigger than another char y and its index is bigger than y, here smallest index x is c, and by first try you will replace it with a (which is smaller than x so algorithm finishes here)==> S2= "cab", F(S2) = 2.
Now let test it with S2, cab: x=b, y=a, ==> S3 = "cba".\
finding x is not hard, iterate the input, and have a variable name it min, while current visited character is smaller than min, set min as newly visited char, and visit next character, first time you visit a character which is bigger than min stop iteration, this is x:
This is pseudocode in c# (but I wasn't careful about boundaries e.g in input.Substring):
string NextString(string input)
{
var min = input[0];
int i=1;
while (i < input.Length && input[i] < min)
{
min = input[i];
i++;
}
if (i == input.Length) return "There isn't next item";
var x = input[i], y=input[i-1];
return input.Substring(0,i-2) + x + y + input.Substring(i,input.Length - 1 - i);
}

Here's the outline of an algorithm for a solution to your problem.
I'll assume that you have a function to directly return the n-th permutation (given n) and its inverse, ie a function to return n given a permutation. Let these be perm(n) and perm'(n) respectively.
If I've figured it correctly, when you have a 4-letter string to permute the function F goes like this:
F("abcd") = 0
F("abdc") = 1
F(perm(3)) = 1
F(...) = 2
F(...) = 2
F(...) = 3
F(perm(7)) = 1
F(...) = 2
F(...) = 2
F(...) = 3
F(...) = 3
F(...) = 4
F(perm(13)) = 2
F(...) = 3
F(...) = 3
F(...) = 4
F(...) = 4
F(...) = 5
F(perm(19)) = 3
F(...) = 4
F(...) = 4
F(...) = 5
F(...) = 5
F(perm(24)) = 6
In words, when you go from 3 letters to 4 you get 4 copies of the table of values of F, adding (0,1,2,3) to the (1st,2nd,3rd,4th) copy respectively. In the 2nd case, for example, you already have one derangement by putting the 2nd letter in the 1st place; this simply gets added to the other derangements in the same pattern as would be true for the original 3-letter strings.
From this outline it shouldn't be too difficult (but I haven't got time right now) to write the function F. Strictly speaking the inverse of F isn't a function as it would be multi-valued, but given n, and F(n) there are only a few cases for finding m st F(m)==F(n)+1. These cases are:
n == N! where N is the number of letters in the string, there is no next permutation;
F(n+1) < F(n), the sought-for solution is perm(n+(N-1)!), ;
F(n+1) == F(n), the solution is perm(n+2);
F(n+1) > F(n), the solution is perm(n+1).
I suspect that some of this might only work for 4 letter strings, that some of these terms will have to be adjusted for K-letter permutations.

This is not O(n), but it is at least O(n²) (where n is the number of elements in the permutation, in your example 3).
First, notice that whenever you place a character in your string, you already know how much of an increase in F that's going to mean -- it's however many characters smaller than that one that haven't been added to the string yet.
This gives us another algorithm to calculate F(n):
used = set()
def get_inversions(S1):
inv = 0
for index, ch in enumerate(S1):
character = ord(ch)-ord('a')
cnt = sum(1 for x in range(character) if x not in used)
inv += cnt
used.add(character)
return inv
This is not much better than the original version, but it is useful when inverting F. You want to know the first string that is lexicographically smaller -- therefore, it makes sense to copy your original string and only change it whenever mandatory. When such changes are required, we should also change the string by the least amount possible.
To do so, let's use the information that the biggest value of F for a string with n letters is n(n-1)/2. Whenever the number of required inversions would be bigger than this amount if we didn't change the original string, this means we must swap a letter at that point. Code in Python:
used = set()
def get_inversions(S1):
inv = 0
for index, ch in enumerate(S1):
character = ord(ch)-ord('a')
cnt = sum(1 for x in range(character) if x not in used)
inv += cnt
used.add(character)
return inv
def f_recursive(n, S1, inv, ign):
if n == 0: return ""
delta = inv - (n-1)*(n-2)/2
if ign:
cnt = 0
ch = 0
else:
ch = ord(S1[len(S1)-n])-ord('a')
cnt = sum(1 for x in range(ch) if x not in used)
for letter in range(ch, len(S1)):
if letter not in used:
if cnt < delta:
cnt += 1
continue
used.add(letter)
if letter != ch: ign = True
return chr(letter+ord('a'))+f_recursive(n-1, S1, inv-cnt, ign)
def F_inv(S1):
used.clear()
inv = get_inversions(S1)
used.clear()
return f_recursive(len(S1), S1, inv+1, False)
print F_inv("acb")
It can also be made to run in O(n log n) by replacing the innermost loop with a data structure such as a binary indexed tree.

Did you try to swap two neighbor characters in the string? It seems that it can help to solve the problem. If you swap S[i] and S[j], where i < j and S[i] < S[j], then F(S) increases by one, because all other pairs of indices are not affected by this permutation.
If I'm not mistaken, F calculates the number of inversions of the permutation.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pythonic way of finding differences between two strings - python-3.x

Related

Count number of continuous matching elements in two different numbers in Python

Find the maximum value of K such that sub-sequences A and B exist and should satisfy the mentioned conditions

Given two strings, how do I find number of reoccurences of one in another?

Matlab, order cells of strings according to the first one

String lexicographical permutation and inversion

Categories

Resources