Find longest substring without certain character in python 3 - python-3.x

My question is - how can I find a length of the longest substring without certain character?
For example, I want to find a length of the longest substring without letter 'a' in the string 'cbdbabdbcbdacbadbcbbcda' and the answer should be '7' - the longest string without 'a' is 'dbcbbcd'.
What are the possible ways to do it?

Split the string by the certain character, in this case "a", count the lengths of the substrings, and get the maximum.
string = 'cbdbabdbcbdacbadbcbbcda'
avoid = 'a'
longest_substring = max((len(substring) for substring in string.split(avoid)))
print(longest_substring)
This outputs 7.
You can obviously split the comprehension up into multiple lines etc. if that makes it easier to understand.

A regex based approach might use re.findall with the pattern [b-z]+:
inp = "cbdbabdbcbdacbadbcbbcda"
longest = sorted(re.findall(r'[b-z]+', inp), key=len, reverse=True)[0]
print(longest) # dbcbbcd
print(len(longest)) # 7

You can also use a simple for loop to achieve this.
string = 'cbdbabdbcbdacbadbcbbcda'
avoid = 'a'
max_count, curr_count = 0, 0
for char in string:
if char != avoid:
curr_count += 1
else:
if curr_count > max_count:
max_count = curr_count
curr_count = 0 # Reset curr_count
print(max_count)
Output:
7

Related

How to count the number of substrings in a string?

I want to find the number of occurrences of a particular sub-string in a string.
string="abcbcbcb"
sub_str="cbc"
c=string.count(sub_str)
print(c)
This gives the output as
1
which is the number of non-overlapping occurrences of substring in the string.
But I want to calculate the overlapping strings as well. Thus, the desired output is:
2
You can use a regular expression, use module "re"
print len(re.findall('(?=cbc)','abcbcbcb'))
No standard function available for overlapping count. You could write custom function tho.
def count_occ(string, substr):
cnt = 0
pos = 0
while(True):
pos = string.find(substr , pos)
if pos > -1:
cnt += 1
pos += 1
else:
break
return cnt
string="abcbcbcb"
sub_str="cbc"
print(count_occ(string,sub_str))

Lexicographically smallest palindrome in python

I found this question to be interesting and I would like to share this here and find reasonably good codes, specific to py :
Given a string S having characters from English alphabets ['a' - 'z'] and '.' as the special character (without quotes).
Write a program to construct the lexicographically smallest palindrome by filling each of the faded character ('.') with a lower case alphabet.
Definition:
The smallest lexicographical order is an order relation where string s is smaller than t, given the first character of s (s1 ) is smaller than the first character of t (t1 ), or in case they
are equivalent, the second character, etc.
For example : "aaabbb" is smaller than "aaac" because although the first three characters
are equal, the fourth character b is smaller than the fourth character c.
Input Format:
String S
Output Format:
Print lexicographically smallest palindrome after filling each '.' character, if it
possible to construct one. Print -1 otherwise.
Example-1
Input:
a.ba
Output:
abba
Example-2:
Input:
a.b
Output:
-1
Explanation:
In example 1, you can create a palindrome by filling the '.' character by 'b'.
In example 2, it is not possible to make the string s a palindrome.
You can't just copy paste questions from NPTEL assignments and ask them here without even trying!
Anyways,since the "code" is your only concern,try copy pasting the lines below:
word = input()
length = len(word)
def SmallestPalindrome(word, length):
i = 0
j = length - 1
word = list(word) #creating a list from the input word
while (i <= j):
if (word[i] == word[j] == '.'):
word[i] = word[j] = 'a'
elif word[i] != word[j]:
if (word[i] == '.'):
word[i] = word[j]
elif (word[j] == '.'):
word[j] = word[i]
else: # worst case situation when palindrome condition is not met
return -1
i = i + 1
j = j - 1
return "".join(word) # to turn the list back to a string
print(SmallestPalindrome(word, length)) #Print the output of your function
s=input()
s=list(s)
n=len(s)
j=n
c=0
for i in range(n):
j=j-1
if((s[i]==s[j]) and (i==j) and (s[i]=='.' and s[j]=='.')):
s[i]='a'
s[j]='a'
elif(s[i]==s[j]):
continue
elif((s[i]!=s[j]) and (i!=j) and (s[i]=='.' or s[j]=='.')):
if(s[i]!='.'):
s[j]=s[i]
else:
s[i]=s[j]
elif((i==j) and (s[i]=='.')):
s[i]=a
else:
c=c+1
break
if(c<1):
for k in s:
print(k,end="")
else:print("-1")

How to find set of shortest subsequences with minimal collisions from set of strings

I've got a list of strings like
Foobar
Foobaron
Foot
barstool
barfoo
footloose
I want to find the set of shortest possible sub-sequences that are unique to each string in the set; the characters in each sub-sequence do not need to be adjacent, just in order as they appear in the original string. For the example above, that would be (along other possibilities)
Fb (as unique to Foobar as it gets; collision with Foobaron unavoidable)
Fn (unique to Foobaron, no other ...F...n...)
Ft (Foot)
bs (barstool)
bf (barfoo)
e (footloose)
Is there an efficient way to mine such sequences and minimize the number of colliding strings (when collisions can't be avoided, e.g. when strings are substrings of other strings) from a given array of strings? More precisely, chosing the length N, what is the set of sub-sequences of up to N characters each that identify the original strings with the fewest number of collisions.
I would'nt really call that 'efficient', but you can do better than totally dumb like that:
words = ['Foobar', 'Foobaron', 'Foot', 'barstool', 'barfoo', 'footloose']
N = 2
n = len(words)
L = max([len(word) for word in words])
def generate_substrings(word, max_length=None):
if max_length is None:
max_length = len(word)
set_substrings = set()
set_substrings.add('')
for charac in word:
new_substr_list = []
for substr in set_substrings:
new_substr = substr + charac
if len(new_substr) <= max_length:
new_substr_list.append(new_substr)
set_substrings.update(new_substr_list)
return set_substrings
def get_best_substring_for_each(string_list=words, max_length=N):
all_substrings = {}
best = {}
for word in string_list:
for substring in generate_substrings(word, max_length=max_length):
if substring not in all_substrings:
all_substrings[substring] = 0
all_substrings[substring] = all_substrings[substring] + 1
for word in string_list:
best_score = len(string_list) + 1
best[word] = ''
for substring in generate_substrings(word=word, max_length=max_length):
if all_substrings[substring] < best_score:
best[word] = substring
best_score = all_substrings[substring]
return best
print(get_best_substring_for_each(words, N))
This program prints the solution:
{'barfoo': 'af', 'Foobar': 'Fr', 'Foobaron': 'n', 'footloose': 'os', 'barstool': 'al', 'Foot': 'Ft'}
This can still be improved easily by a constant factor, for instance by storing the results of generate_substringsinstead of computing it twice.
The complexity is O(n*C(N, L+N)), where n is the number of words and L the maximum length of a word, and C(n, k) is the number of combinations with k elements out of n.
I don't think (not sure though) that you can do much better in the worst case, because it seems hard not to enumerate all possible substrings in the worst case (the last one to be evaluated could be the only one with no redundancy...). Maybe in average you can do better...
You could use a modification to the longest common subsequence algorithm. In this case you are seeking the shortest unique subsequence. Shown below is part of a dynamic programming solution which is more efficient than a recursive solution. The modifications to the longest common subsequence algorithm are described in the comments below:
for (int i = 0; i < string1.Length; i++)
for (int j = 0; j < string2.Length; j++)
if (string1[i-1] != string2[j-1]) // find characters in the strings that are distinct
SUS[i][j] = SUS[i-1][j-1] + 1; // SUS: Shortest Unique Substring
else
SUS[i][j] = min(SUS[i-1][j], SUS[i][j-1]); // find minimum size of distinct strings
You can then put this code in a function and call this function for each string in your set to find the length of the shortest unique subsequence in the set.
Once you have the length of the shortest unique subsequence, you can backtrack to print the subsequence.
You should use modified Trie structure, insert strings to a trie in a way that :
Foo-bar-on
-t
bar-stool
-foo
The rest is straightforward, just choose correct compressed node[0] char
That Radix tree should help

python find position of characters in a string ignoring special characters in this string

I have a string
s1='abcdebcfg'
And for some reason the same string with added characters ('-','.')
s2='..abcde--bc-fg'
I want to map the index of a character from s1 to s2
Example:s1:0 -->s2:2 , s1:5 -->s2:9 ...
I solved by :
counting the number of occurrence of the character in s1 at position i and then find the character s1[i] in s2 that has the same number of occurrence
def find_nth(needle,haystack, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+len(needle))
n -= 1
return start
for i in range(len(s1)) :
occurrence= s1[:i+1].count(s1[i])
j=find_nth(s1[i], s2, occurrence)
Note I found the find_nth here
You can try something like this:
for i in len(s1):
j = s2.find(s1[i])
print "s1:",i,"-->s2:",j
you may use two stack for each (s1,s2)with index as key and character as value then pop the values from each compare them and generate required output.
Ok, so something like this should work:
for i in range(len(s1)):
for j in range (i,len(s2)):
if s2[j]==s1[i]:
print "s1:",i,"-->s2:",j
break

How to calculate word co-occurence

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.
I want to calculate how many times b is followed by another b (n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use
numel(regexp(str, '(b)(?=a)'))
to find bda use
numel(regexp(str, '(b)(?=da)'))
Building on the proposal by Magla:
str = 'abcdabbcdaabbbabbbb'; % for example
index_single = ismember(str, 'b');
index_digram = index_single(1:end-1)&index_single(2:end);
index_trigram = index_single(1:end-2)&index_single(2:end-1)&index_single(3:end);
You may try this piece of code that uses ismember (doc).
%generate string (50 char, 'a' to 'd')
str = char(floor(97 + (101-97).*rand(1,50)))
%digram case
index_digram = ismember(str, 'aa');
%trigram case
index_trigram = ismember(str, 'aaa');
EDIT
Probabilities can be computed with
proba = sum(index_digram)/length(index_digram);
this will find all n-grams and count them:
numberOfGrams = 5;
s = char(floor(rand(1,1000)*4)+double('a'));
ngrams = cell(1);
for n = 2:numberOfGrams
strLength = size(s,2)-n+1;
indices = repmat((1:strLength)',1,n)+repmat(1:n,strLength,1)-1;
grams = s(indices);
gramNumbers = (double(grams)-double('a'))*((ones(1,n)*n).^(0:n-1))';
[uniqueGrams, gramInd] = unique(gramNumbers);
count=hist(gramNumbers,uniqueGrams);
ngrams(n) = {struct('gram',grams(gramInd,:),'count',count)};
end
edit:
the result will be:
ngrams{n}.gram %a list of all n letter sequences in the string
ngrams{n}.count(x) %the number of times the sequence ngrams{n}.gram(x) appears

Resources