Matlab. Find the indices of a cell array of strings with characters all contained in a given string (without repetition) - string

I have one string and a cell array of strings.
str = 'actaz';
dic = {'aaccttzz', 'ac', 'zt', 'ctu', 'bdu', 'zac', 'zaz', 'aac'};
I want to obtain:
idx = [2, 3, 6, 8];
I have written a very long code that:
finds the elements with length not greater than length(str);
removes the elements with characters not included in str;
finally, for each remaining element, checks the characters one by one
Essentially, it's an almost brute force code and runs very slowly. I wonder if there is a simple way to do it fast.
NB: I have just edited the question to make clear that characters can be repeated n times if they appear n times in str. Thanks Shai for pointing it out.

You can sort the strings and then match them using regular expression. For your example the pattern will be ^a{0,2}c{0,1}t{0,1}z{0,1}$:
u = unique(str);
t = ['^' sprintf('%c{0,%d}', [u; histc(str,u)]) '$'];
s = cellfun(#sort, dic, 'uni', 0);
idx = find(~cellfun('isempty', regexp(s, t)));

I came up with this :
>> g=#(x,y) sum(x==y) <= sum(str==y);
>> h=#(t)sum(arrayfun(#(x)g(t,x),t))==length(t);
>> f=cellfun(#(x)h(x),dic);
>> find(f)
ans =
2 3 6
g & h: check if number of count of each letter in search string <= number of count in str.
f : finally use g and h for each element in dic

Related

Issue with ASCii in Python3

I am trying to convert a string of varchar to ascii. Then i'm trying to make it so any number that's not 3 digits has a 0 in front of it. then i'm trying to add a 1 to the very beginning of the string and then i'm trying to make it a large number that I can apply math to it.
I've tried a lot of different coding techniques. The closest I've gotten is below:
s = 'Ak'
for c in s:
mgk = (''.join(str(ord(c)) for c in s))
num = [mgk]
var = 1
num.insert(0, var)
mgc = lambda num: int(''.join(str(i) for i in num))
num = mgc(num)
print(num)
With this code I get the output: 165107
It's almost doing exactly what I need to do but it's taking out the 0 from the ord(A) which is 65. I want it to be 165. everything else seems to be working great. I'm using '%03d'% to insert the 0.
How I want it to work is:
Get the ord() value from a string of numbers and letters.
if the ord() value is less than 100 (ex: A = 65, add a 0 to make it a 3 digit number)
take the ord() values and combine them into 1 number. 0 needs to stay in from of 65. then add a one to the list. so basically the output will look like:
1065107
I want to make sure I can take that number and apply math to it.
I have this code too:
s = 'Ak'
for c in s:
s = ord(c)
s = '%03d'%s
mgk = (''.join(str(s)))
s = [mgk]
var = 1
s.insert(0, var)
mgc = lambda s: int(''.join(str(i) for i in s))
s = mgc(s)
print(s)
but then it counts each letter as its own element and it will not combine them and I only want the one in front of the very first number.
When the number is converted to an integer, it
Is this what you want? I am kinda confused:
a = 'Ak'
result = '1' + ''.join(str(f'{ord(char):03d}') for char in a)
print(result) # 1065107
# to make it a number just do:
my_int = int(result)

find minimum steps required to change one binary string to another

Given two string str1 and str2 which contain only 0 or 1, there
are some steps to change str1 to str2,
step1: find a substring of str1 of length 2 and reverse the substring, and str1 becomes str1' (str1' != str1)
step2: find a substring of str1' of length 3, and reverse the substring, and str1' becomes str1'' (str1'' != str1')
the following steps are similar.
the string length is in the range [2, 30]
Requirement: each step must be performed once and we can not skip
previous steps and perform the next step.
If it is possible to change str1 to str2, output the minimum steps required, otherwise, output -1
Example 1
str1 = "1010", str2 = "0011", the minimum step required is 2
first, choose substring in range [2, 3], "1010" --> "1001",
then choose substring in the range [0, 2], "1001" --> "0011"
Example 2
str1 = "1001", str2 = "0110", it is impossible to change str1 to str2,
because in step1, str1 can be changed to "0101" or "1010", but in step3, it is impossible to change a length3 substring to make it different. So the output is -1.
Example 3
str1 = "10101010", str2 = "00101011", output is 7
I can not figure out example 3, because there are two many possibilities. Can anyone gives some hint on how to solve this problem? What is the type of this
problem? Is it dynamic programming?
This is in fact a dynamic programming problem. To solve it, we are going to try all possible permutations, but memoize the results along the way. It could seem that there are way too many options - there are 2^30 different binary strings of length 30, but keep in mind that reverting a string doesn't change number of zeroes and ones we have, so the upper bound is in fact 30 choose 15 = 155117520 when we have a string of 15 zeroes and ones. Around 150 million possible results is not too bad.
So starting with our start string, we are going to derive all possible string from each string we derived so far, until we generate end string. We are also going to track predecessors to reconstruct generation. Here's my code:
start = '10101010'
end = '00101011'
dp = [{} for _ in range(31)]
dp[1][start] = '' # Originally only start string is reachable
for i in range(2, len(start) + 1):
for s in dp[i - 1].keys():
# Try all possible reversals for each string in dp[i - 1]
for j in range(len(start) - i + 1):
newstr = s
newstr = newstr[:j] + newstr[j:j+i][::-1] + newstr[j+i:]
dp[i][newstr] = s
if end in dp[i]:
ans = []
cur = end
for j in range(i, 0, -1):
ans.append(cur)
cur = dp[j][cur]
print(ans[::-1])
exit(0)
print('Impossible!')
And for your third example, this gives us sequence ['10101010', '10101001', '10101100', '10100011', '00101011'] - from your str1 to str2. If you check differences between the strings, you'll see which transitions were made. So this transformation can be done in 4 steps rather than 7 like you suggested.
Lastly, this will be a bit slow for 30 in python, but if you rewrite it into C++, it's going to be a couple of seconds tops.
This Question can be solved using Backtracking. here is my C++ Code, Which runs smooth with my testcases. This Question Came in an OA of Persistent systems and i was a bit confused about the steps, but this is simple Backtracking. Wants your suggestions if Dp can Optimize my solution!.
//prabaljainn
#include <bits/stdc++.h>
using namespace std;
string s1,s2;
int ans=1e9; int n;
void rec(string s1,int level){
if(s1==s2){
ans = min(ans,level-2);
return;
}
for(int i=0; i<= n-level; i++){
reverse(s1.begin()+i, s1.begin()+i+level);
rec(s1,level+1);
reverse(s1.begin()+i, s1.begin()+i+level);
}
}
int main(){
cin>>s1>>s2;
n = s1.size();
rec(s1,2);
if(ans==1e9)
cout<<"-1"<<endl;
else
cout<<ans<<endl;
}
Happy coding
This problem can be solved using breadth-first search. The following solution uses a queue which stores a pair having the current string as the first member and current operation length(initially 2) as the second member. A set is used to store already visited strings to prevent entering redundant states. For current string, we reverse every substring of length k where k is current operation length and add it to the queue if it hasn't been seen before. If the current string equals the desired string then answer is 'current operation length-2'. If queue becomes empty, then the answer isn't possible.
string str1,str2;
cin>>str1>>str2;
queue<pair<string, int>> q;
set<string> s;
q.push({str1,2});
s.insert(str1);
while(!q.empty())
{
auto p=q.front();
q.pop();
if(p.first==str2)
{
cout<<p.second-2;
return 0;
}
if(p.second<=p.first.size())
{
for(int i=0;i<=p.first.size()-p.second;i++)
{
string x=p.first;
reverse(x.begin()+i,x.begin()+i+p.second);
if(s.find(x)==s.end())
{
q.push({x,p.second+1});
s.insert(x);
}
}
}
}
cout<<-1;
save str1 as start of BFS and at each step,reverse values of all substrings of length 2 and 3 and see if the new strings formed after reversing have been seen previously or not.....if not seen....push them in the queue and also maintain count of steps...if the string at the front of queue is str2 at any time...that step is the answer

Return number of alphabetical substrings within input string

I'm trying to generate code to return the number of substrings within an input that are in sequential alphabetical order.
i.e. Input: 'abccbaabccba'
Output: 2
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x):
for i in range(len(x)):
for j in range (len(x)+1):
s = x[i:j+1]
l = 0
if s in alphabet:
l += 1
return l
print (cake('abccbaabccba'))
So far my code will only return 1. Based on tests I've done on it, it seems it just returns a 1 if there are letters in the input. Does anyone see where I'm going wrong?
You are getting the output 1 every time because your code resets the count to l = 0 on every pass through the loop.
If you fix this, you will get the answer 96, because you are including a lot of redundant checks on empty strings ('' in alphabet returns True).
If you fix that, you will get 17, because your test string contains substrings of length 1 and 2, as well as 3+, that are also substrings of the alphabet. So, your code needs to take into account the minimum substring length you would like to consider—which I assume is 3:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def cake(x, minLength=3):
l = 0
for i in range(len(x)):
for j in range(i+minLength, len(x)): # carefully specify both the start and end values of the loop that determines where your substring will end
s = x[i:j]
if s in alphabet:
print(repr(s))
l += 1
return l
print (cake('abccbaabccba'))

Change Letters in A String One at a Time (Pandas,Python3)

I have a list of words in Pandas (DF)
Words
Shirt
Blouse
Sweater
What I'm trying to do is swap out certain letters in those words with letters from my dictionary one letter at a time.
so for example:
mydict = {"e":"q,w",
"a":"z"}
would create a new list that first replaces all the "e" in a list one at a time, and then iterates through again replacing all the "a" one at a time:
Words
Shirt
Blouse
Sweater
Blousq
Blousw
Swqater
Swwater
Sweatqr
Sweatwr
Swezter
I've been looking around at solutions here: Mass string replace in python?
and have tried the following code but it changes all instances "e" instead of doing so one at a time -- any help?:
mydict = {"e":"q,w"}
s = DF
for k, v in mydict.items():
for j in v:
s['Words'] = s["Words"].str.replace(k, j)
DF["Words"] = s
this doesn't seem to work either:
s = DF.replace({"Words": {"e": "q","w"}})
This answer is very similar to Brian's answer, but a little bit sanitized and the output has no duplicates:
words = ["Words", "Shirt", "Blouse", "Sweater"]
md = {"e": "q,w", "a": "z"}
md = {k: v.split(',') for k, v in md.items()}
newwords = []
for word in words:
newwords.append(word)
for c in md:
occ = word.count(c)
pos = 0
for _ in range(occ):
pos = word.find(c, pos)
for r in md[c]:
tmp = word[:pos] + r + word[pos+1:]
newwords.append(tmp)
pos += 1
Content of newwords:
['Words', 'Shirt', 'Blouse', 'Blousq', 'Blousw', 'Sweater', 'Swqater', 'Swwater', 'Sweatqr', 'Sweatwr', 'Swezter']
Prettyprint:
Words
Shirt
Blouse
Blousq
Blousw
Sweater
Swqater
Swwater
Sweatqr
Sweatwr
Swezter
Any errors are a result of the current time. ;)
Update (explanation)
tl;dr
The main idea is to find the occurences of the character in the word one after another. For each occurence we are then replacing it with the replacing-char (again one after another). The replaced word get's added to the output-list.
I will try to explain everything step by step:
words = ["Words", "Shirt", "Blouse", "Sweater"]
md = {"e": "q,w", "a": "z"}
Well. Your basic input. :)
md = {k: v.split(',') for k, v in md.items()}
A simpler way to deal with replacing-dictionary. md now looks like {"e": ["q", "w"], "a": ["z"]}. Now we don't have to handle "q,w" and "z" differently but the step for replacing is just the same and ignores the fact, that "a" only got one replace-char.
newwords = []
The new list to store the output in.
for word in words:
newwords.append(word)
We have to do those actions for each word (I assume, the reason is clear). We also append the world directly to our just created output-list (newwords).
for c in md:
c as short for character. So for each character we want to replace (all keys of md), we do the following stuff.
occ = word.count(c)
occ for occurrences (yeah. count would fit as well :P). word.count(c) returns the number of occurences of the character/string c in word. So "Sweater".count("o") => 0 and "Sweater".count("e") => 2.
We use this here to know, how often we have to take a look at word to get all those occurences of c.
pos = 0
Our startposition to look for c in word. Comes into use in the next loop.
for _ in range(occ):
For each occurence. As a continual number has no value for us here, we "discard" it by naming it _. At this point where c is in word. Yet.
pos = word.find(c, pos)
Oh. Look. We found c. :) word.find(c, pos) returns the index of the first occurence of c in word, starting at pos. At the beginning, this means from the start of the string => the first occurence of c. But with this call we already update pos. This plus the last line (pos += 1) moves our search-window for the next round to start just behind the previous occurence of c.
for r in md[c]:
Now you see, why we updated mc previously: we can easily iterate over it now (a md[c].split(',') on the old md would do the job as well). So we are doing the replacement now for each of the replacement-characters.
tmp = word[:pos] + r + word[pos+1:]
The actual replacement. We store it in tmp (for debug-reasons). word[:pos] gives us word up to the (current) occurence of c (exclusive c). r is the replacement. word[pos+1:] adds the remaining word (again without c).
newwords.append(tmp)
Our so created new word tmp now goes into our output-list (newwords).
pos += 1
The already mentioned adjustment of pos to "jump over c".
Additional question from OP: Is there an easy way to dictate how many letters in the string I want to replace [(meaning e.g. multiple at a time)]?
Surely. But I have currently only a vague idea on how to achieve this. I am going to look at it, when I got my sleep. ;)
words = ["Words", "Shirt", "Blouse", "Sweater", "multipleeee"]
md = {"e": "q,w", "a": "z"}
md = {k: v.split(',') for k, v in md.items()}
num = 2 # this is the number of replaces at a time.
newwords = []
for word in words:
newwords.append(word)
for char in md:
for r in md[char]:
pos = multiples = 0
current_word = word
while current_word.find(char, pos) != -1:
pos = current_word.find(char, pos)
current_word = current_word[:pos] + r + current_word[pos+1:]
pos += 1
multiples += 1
if multiples == num:
newwords.append(current_word)
multiples = 0
current_word = word
Content of newwords:
['Words', 'Shirt', 'Blouse', 'Sweater', 'Swqatqr', 'Swwatwr', 'multipleeee', 'multiplqqee', 'multipleeqq', 'multiplwwee', 'multipleeww']
Prettyprint:
Words
Shirt
Blouse
Sweater
Swqatqr
Swwatwr
multipleeee
multiplqqee
multipleeqq
multiplwwee
multipleeww
I added multipleeee to demonstrate, how the replacement works: For num = 2 it means the first two occurences are replaced, after them, the next two. So there is no intersection of the replaced parts. If you would want to have something like ['multiplqqee', 'multipleqqe', 'multipleeqq'], you would have to store the position of the "first" occurence of char. You can then restore pos to that position in the if multiples == num:-block.
If you got further questions, feel free to ask. :)
Because you need to replace letters one at a time, this doesn't sound like a good problem to solve with pandas, since pandas is about doing everything at once (vectorized operations). I would dump out your DataFrame into a plain old list and use list operations:
words = DF.to_dict()["Words"].values()
for find, replace in reversed(sorted(mydict.items())):
for word in words:
occurences = word.count(find)
if not occurences:
print word
continue
start_index = 0
for i in range(occurences):
for replace_char in replace.split(","):
modified_word = list(word)
index = modified_word.index(find, start_index)
modified_word[index] = replace_char
modified_word = "".join(modified_word)
print modified_word
start_index = index + 1
Which gives:
Words
Shirt
Blousq
Blousw
Swqater
Swwater
Sweatqr
Sweatwr
Words
Shirt
Blouse
Swezter
Instead of printing the words, you can append them to a list and re-create a DataFrame if that's what you want to end up with.
If you are looping, you need to update s at each cycle of the loop. You also need to loop over v.
mydict = {"e":"q,w"}
s=deduped
for k, v in mydict.items():
for j in v:
s = s.replace(k, j)
Then reassign it to your dataframe:
df["Words"] = s
If you can write this as a function that takes in a 1d array (list, numpy array etc...), you can use df.apply to apply it to any column, using df.apply().

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))
Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);
You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);
this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Resources