Does Python have a string contains how many substring method, allowing for overlap? - string

I want to count a long string contain how many substring, how to do it in python?
"12212"
contains 2x "12"
how to get the count number?
It must allow for overlaping substrings; for instance "1111" contains 3 "11" substrings.
"12121" contains 2 "121" substrings.
"1111".count("11")
will return 2. It does not count any overlaps.

Strings have a method count
You can do
s = '12212'
s.count('12') # this equals 2
Edited for the changing question, the answer below was posted as a comment by tobias_k
To count with overlap,
count_all = lambda string, sub: sum(string[i:i+len(sub)] == sub for i in range(len(string) - len(sub) + 1))
This can be called with,
count_all('1111', '11') # this returns 3

Related

Efficient way to check if a specific character in a string appears consecutively

Say the character which we want to check if it appears consecutively in a string s is the dot '.'.
For example, 'test.2.1' does not have consecutive dots, whereas 'test..2.2a...' has consecutive dots. In fact, for the second example, we should not even bother with checking the rest of the string after the first occurence of the consecutive dots.
I have come up with the following simple method:
def consecutive_dots(s):
count = 0
for c in data:
if c == '.':
count += 1
if count == 2:
return True
else:
count = 0
return False
I was wondering if the above is the most 'pythonic' and efficient way to achieve this goal.
You can just use in to check if two consecutive dots (that is, the string "..") appear in the string s
def consecutive_dots(s):
return '..' in s

calling functions in python 3 from within a function

Given a string, return the count of the number of times that a substring length 2 appears in the string and also as the last 2 chars of the string, so "hixxxhi" yields 1 (we won't count the end substring).
last2('hixxhi') → 1
last2('xaxxaxaxx') → 1
last2('axxxaaxx') → 2
I found this question in one of the websites (https://codingbat.com/prob/p145834).
The answer to the above question as given on the website is as follows :
def last2(str):
# Screen out too-short string case.
if len(str) < 2:
return 0
# last 2 chars, can be written as str[-2:]
last2 = str[len(str)-2:]
count = 0
# Check each substring length 2 starting at i
for i in range(len(str)-2):
sub = str[i:i+2]
if sub == last2:
count = count + 1
return count
I have a doubt on the below mentioned line of code
last2 = str[len(str)-2:]
Now, I know that this piece of code is extracting the last 2 letters of the string 'str'. What I am confused about is the variable name. As you can see that the variable name is same as the name of the function. So is this line calling the function again and updating the value of the variable 'str' ??
def last2(str):
. . .
This creates a parameter called str that shadows the built-in str class*. Within this function, str refers to the str parameter, not the str built-in class.
This is poor practice though. Don't name your variables the same thing as existing builtins. This causes confusing situations like this, and leads to issues like this.
A better name would be something that describes what purpose the string has, instead of just a generic, non-meaningful str.
* The built-in str is actually a class, not a plain function. str(x) is a call to the constructor of the str class.
def last2(str):
if len(str) == 0:
return 0
last_two = str[-2::]
count = 0
for i in range(len(str)):
if last_two == str[i :i + 2]:
count += 1
return count-1
this is the answer that was correct for me for the first time. The official answer is better, but this one might be less confusing for you.

Find the location of multiple strings in a cell array of strings

I have 2 question regarding searching for strings in MATLAB
If I have to find a string in a cell array of strings I can do the following to get the location of 'PO' in the cell array
find(strcmpi({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
% 1 6
But, I really want to search for multiple strings ({'PO1', 'PO'}) at the same time (not using a for loop). What is the best way to do this?
Is there any function like histc() which can tell me how many times the string has occurred. Again for one string, I could do:
length(strfind({'PO','FOO','PO1','FOO1','PO1','PO'},'PO'))
But this obviously doesn't work for multiple strings at a time.
If you want to find multiple strings, then just use the second output of ismember instead to tell you which string it is. If you really need case-insensitive matching, I've added the upper call to force all inputs to be upper-case. You can omit this if you think it's already uppercase.
data = {'PO','FOO','PO1','FOO1','PO1','PO', 'PO'};
[tf, inds] = ismember(upper(data), {'PO1', 'PO'});
% 2 0 1 0 1 2 2
You can then use the second output to determine which string was found where:
% PO1 Occurrences
find(inds == 1)
% 3 5
% PO Occurrences
find(inds == 2)
% 1 6 7
If you want the equivalent of histc, you can use accumarray to do that. We can pass it all of the values of inds that are non-zero (i.e. the ones that you were actually searching for).
accumarray(inds(tf).', ones(sum(tf), 1))
% 2 3
If instead you want to get the histogram of all strings (not just the ones you're searching for) you could do the following:
[strings, ~, inds] = unique(data, 'stable');
occurrences = accumarray(inds, ones(size(inds)));
% 'PO' [3]
% 'FOO' [1]
% 'PO1' [2]
% 'FOO1' [1]

Finding indexes of strings in a string array in Matlab

I have two string arrays and I want to find where each string from the first array is in the second array, so i tried this:
for i = 1:length(array1);
cmp(i) = strfind(array2,array1(i,:));
end
This doesn't seem to work and I get an error: "must be one row".
Just for the sake of completeness, an array of strings is nothing but a char matrix. This can be quite restrictive because all of your strings must have the same number of elements. And that's what #neerad29 solution is all about.
However, instead of an array of strings you might want to consider a cell array of strings, in which every string can be arbitrarily long. I will report the very same #neerad29 solution, but with cell arrays. The code will also look a little bit smarter:
a = {'abcd'; 'efgh'; 'ijkl'};
b = {'efgh'; 'abcd'; 'ijkl'};
pos=[];
for i=1:size(a,1)
AreStringFound=cellfun(#(x) strcmp(x,a(i,:)),b);
pos=[pos find(AreStringFound)];
end
But some additional words might be needed:
pos will contain the indices, 2 1 3 in our case, just like #neerad29 's solution
cellfun() is a function which applies a given function, the strcmp() in our case, to every cell of a given cell array. x will be the generic cell from array b which will be compared with a(i,:)
the cellfun() returns a boolean array (AreStringFound) with true in position j if a(i,:) is found in the j-th cell of b and the find() will indeed return the value of j, our proper index. This code is more robust and works also if a given string is found in more than one position in b.
strfind won't work, because it is used to find a string within another string, not within an array of strings. So, how about this:
a = ['abcd'; 'efgh'; 'ijkl'];
b = ['efgh'; 'abcd'; 'ijkl'];
cmp = zeros(1, size(a, 1));
for i = 1:size(a, 1)
for j = 1:size(b, 1)
if strcmp(a(i, :), b(j, :))
cmp(i) = j;
break;
end
end
end
cmp =
2 1 3

Matlab. Find the indices of a cell array of strings with characters all contained in a given string (without repetition)

I have one string and a cell array of strings.
str = 'actaz';
dic = {'aaccttzz', 'ac', 'zt', 'ctu', 'bdu', 'zac', 'zaz', 'aac'};
I want to obtain:
idx = [2, 3, 6, 8];
I have written a very long code that:
finds the elements with length not greater than length(str);
removes the elements with characters not included in str;
finally, for each remaining element, checks the characters one by one
Essentially, it's an almost brute force code and runs very slowly. I wonder if there is a simple way to do it fast.
NB: I have just edited the question to make clear that characters can be repeated n times if they appear n times in str. Thanks Shai for pointing it out.
You can sort the strings and then match them using regular expression. For your example the pattern will be ^a{0,2}c{0,1}t{0,1}z{0,1}$:
u = unique(str);
t = ['^' sprintf('%c{0,%d}', [u; histc(str,u)]) '$'];
s = cellfun(#sort, dic, 'uni', 0);
idx = find(~cellfun('isempty', regexp(s, t)));
I came up with this :
>> g=#(x,y) sum(x==y) <= sum(str==y);
>> h=#(t)sum(arrayfun(#(x)g(t,x),t))==length(t);
>> f=cellfun(#(x)h(x),dic);
>> find(f)
ans =
2 3 6
g & h: check if number of count of each letter in search string <= number of count in str.
f : finally use g and h for each element in dic

Resources