How to find length of shortest unique substring and number of occurrences of all unique substrings of same length in a given string - string

The problem is to find the length of the shortest unique substring and number of same length unique substring occurring in the string. For eg. "aatcc" will have "t" as the shortest length unique substring and length is 1 so the output will be 1,1. Another example is "aacc" here the output will be 2,3 as strings are aa,ac,cc
I tried to solve it but could come up only with a brute Force solution which is to loop over all possible substrings. It exceeded the time limit.
I googled it and found some references to suffix array but not quite clear about it.
So what is the optimal solution for this problem?
EDIT : Forgot to mention the key requirement of the solution of that was required for this problem and that is to NOT use any library functions other than input and output functions to read and write from and to the standard input and the standard output respectively.
EDIT: I have found another solution using trie data structure.
Pseudocode:
for i from 1 to length(string) do
for j from 0 to length(string)-1 do
1. create a substring of length i from jth character
2. if checkIfSeen(substring) then count-- else count++
close inner for loop
if count >= 1 then break
close outer for loop
print i(the length of the unique substring), count (no. of such substrings)
checkIfSeen(Substring) will use a trie data structure which
will run O(log l) where l is the average length of the prefixes.
The time complexity of this algorithm would be O(n^2 log l) where if the average length of the prefixes is n/2 then the time complexity would be O(n^2 log n). Please point out the mistakes if there are and also ways to improve this running time if possible.

Sorry, but keep in mind that my answer is based on program I wrote with Python, but can be applied to any programming language :)
Now I believe brute force approach is indeed what you need to do in this problem. But what we can do to shorten the time is:
1: start the brute force from the smallest substring length, which is
1.
2: after looping through the string with substring length 1 (the data
will look something like {"a":2, "t":1, "c":2} for "aatcc"), check if
any substring appeared only once. If it did, count the occurrence by
looping through the dictionary (in case of the example you gave, "t"
only appeared once, so occurrence is 1).
3: After the occurrence is counted, break the loop so that it does not
have to waste time on counting the rest of bigger substrings.
4: on 2:, if the unique substring was not found, reset the dictionary
and try a bigger substring (the data can be something like {"aa": 1, "ac":1,
"cc":1 for "aacc"}). Eventually the unique substring WILL be found no matter what (for example, in the string "aaaaa", the unique substring is "aaaaa" with the data {"aaaaa":1})
Here is the implementation in Python:
def countString(string):
for i in range(1, len(string)+1): #start the brute force from string length 1
dictionary = {}
for j in range(len(string)-i+1): #check every combination.
#count the substring occurrences
try:
dictionary[string[j:j+i]] += 1
except:
dictionary[string[j:j+i]] = 1
isUnique = False #loop stops if isUnique is True
occurrence= 0
for key in dictionary: #iterate through the dictionary
if dictionary[key] == 1: #check if any substring is unique
#if found, get ready to escape from the loop and increase the occurrence
isUnique = True
occurrence+=1
if isUnique:
return (i, occurrence)
print(countString("aacc")) #prints (2,3)
print(countString("aatcc")) #prints (1,1)
I am pretty sure that this design is fairly fast, but there always should be a better way. But anyway, I hope this helped :)

Related

Palindrome problem - Trying to check 2 lists for equality python3.9

I'm writing a program to check if a given user input is a palindrome or not. if it is the program should print "Yes", if not "no". I realize that this program is entirely too complex since I actually only needed to check the whole word using the reversed() function, but I ended up making it quite complex by splitting the word into two lists and then checking the lists against each other.
Despite that, I'm not clear why the last conditional isn't returning the expected "Yes" when I pass it "racecar" as an input. When I print the lists in line 23 and 24, I get two lists that are identical, but then when I compare them in the conditional, I always get "No" meaning they are not equal to each other. can anyone explain why this is? I've tried to convert the lists to strings but no luck.
def odd_or_even(a): # function for determining if odd or even
if len(a) % 2 == 0:
return True
else:
return False
the_string = input("How about a word?\n")
x = int(len(the_string))
odd_or_even(the_string) # find out if the word has an odd or an even number of characters
if odd_or_even(the_string) == True: # if even
for i in range(x):
first_half = the_string[0:int((x/2))] #create a list with part 1
second_half = the_string[(x-(int((x/2)))):x] #create a list with part 2
else: #if odd
for i in range(x):
first_half = the_string[:(int((x-1)/2))] #create a list with part 1 without the middle index
second_half = the_string[int(int(x-1)/2)+1:] #create a list with part 2 without the middle index
print(list(reversed(second_half)))
print(list(first_half))
if first_half == reversed(second_half): ##### NOT WORKING BUT DONT KNOW WHY #####
print("Yes")
else:
print("No")
Despite your comments first_half and second_half are substrings of your input, not lists. When you print them out, you're converting them to lists, but in the comparison, you do not convert first_half or reversed(second_half). Thus you are comparing a string to an iterator (returned by reversed), which will always be false.
So a basic fix is to do the conversion for the if, just like you did when printing the lists out:
if list(first_half) == list(reversed(second_half)):
A better fix might be to compare as strings, by making one of the slices use a step of -1, so you don't need to use reversed. Try second_half = the_string[-1:x//2:-1] (or similar, you probably need to tweak either the even or odd case by one). Or you could use the "alien smiley" slice to reverse the string after you slice it out of the input: second_half = second_half[::-1].
There are a few other oddities in your code, like your for i in range(x) loop that overwrites all of its results except the last one. Just use x - 1 in the slicing code and you don't need that loop at all. You're also calling int a lot more often than you need to (if you used // instead of /, you could get rid of literally all of the int calls).

Sorting algoritm

I want to make my algorithm more efficient via deleting the items it already sorted, but i don't know how I can do it efficiently. The only way I found was to rewrite the whole list.
l = [] #Here you put your list
sl = [] # this is to store the list when it is sorted
a = 0 # variable to store which numbers he already looked for
while True: # loop
if len(sl) == len(l): #if their size is matching it will stop
print(sl) # print the sorted list
break
a = a + 1
if a in l: # check if it is in list
sl.append(a) # add to sorted list
#here i want it to be deleted from the list.
The variable a is a little awkward. It starts at 0 and increments 1 by 1 until it matches elements from the list l
Imagine if l = [1000000, 1200000, -34]. Then your algorithm will first run for 1000000 iterations without doing anything, just incrementing a from 0 to 1000000. Then it will append 1000000 to sl. Then it will run again 200000 iterations without doing anything, just incrementing a from 1000000 to 1200000.
And then it will keep incrementing a looking for the number -34, which is below zero...
I understand the idea behind your variable a is to select the elements from l in order, starting from the smallest element. There is a function that does that: it's called min(). Try using that function to select the smallest element from l, and append that element to sl. Then delete this element from l; otherwise, the next call to min() will select the same element again instead of selecting the next smallest element.
Note that min() has a disadvantage: it returns the value of the smallest element, but not its position in the list. So it's not completely obvious how to delete the element from l after you've found it with min(). An alternative is to write your own function that returns both the element, and its position. You can do that with one loop: in the following piece of code, i refers to a position in the list (0 is the position of the first element, 1 the position of the second, etc) and a refers to the value of that element. I left blanks and you have to figure out how to select the position and value of the smallest element in the list.
....
for i, a in enumerate(l):
if ...:
...
...
If you managed to do all this, congratulations! You have implemented "selection sort". It's a well-known sorting algorithm. It is one of the simplest. There exist many other sorting algorithms.

Grabbing the most duplicated letter from a string

What I want to get accomplished is an algorithm that finds the most duplicated letter from the entire list of strings. I'm new to Python so its taken me roughly two hours to get to this stage. The problem with my current code is that it returns every duplicated letter, when I'm only looking for the most duplicated letter. Additionally, I would like to know of a faster way that doesn't use two for loops.
Code:
rock_collections = ['aasdadwadasdadawwwwwwwwww', 'wasdawdasdasdAAdad', 'WaSdaWdasSwd', 'daWdAWdawd', 'QaWAWd', 'fAWAs', 'fAWDA']
seen = []
dupes = []
for words in rock_collections:
for letter in words:
if letter not in seen:
seen.append(letter)
else:
dupes.append(letter)
print(dupes)
If you are looking for the letter which appears the greatest number of times, I would recommend the following code:
def get_popular(strings):
full = ''.join(strings)
unique = list(set(full))
return max(
list(zip(unique, map(full.count, unique))), key=lambda x: x[1]
)
rock_collections = [
'aasdadwadasdadawwwwwwwwww',
'wasdawdasdasdAAdad',
'WaSdaWdasSwd',
'daWdAWdawd',
'QaWAWd',
'fAWAs',
'fAWDA'
]
print(get_popular(rock_collections)) # ('d', 19)
Let me break down the code for you:
full contains each of the strings together with without any letters between them. set(full) produces a set, meaning that it contains every unique letter only once. list(set(full)) makes this back into a list, meaning that it retains order when you iterate over the elements in the set.
map(full.count, unique) iterates over each of the unique letters and counts how many there are of them in the string. zip(unique, ...) puts those numbers with their respective letters. key=lambda x: x[1] is a way of saying, don't take the maximum value of the tuple, instead take the maximum value of the second element of the tuple (which is the number of times the letter appears). max finds the most common letter, using the aforementioned key.

list and dictionary: which one is faster

I have the following pieces of code doing the sorting of a list by swapping pairs of elements:
# Complete the minimumSwaps function below.
def minimumSwaps(arr):
counter = 0
val_2_indx = {val: arr.index(val) for val in arr}
for indx, x in enumerate(arr):
if x != indx+1:
arr[indx] = indx+1
s_indx = val_2_indx[indx+1]
arr[s_indx] = x
val_2_indx[indx+1] = indx
val_2_indx[x] = s_indx
counter += 1
return counter
def minimumSwaps(arr):
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
swaps = 0
for i in range(len(arr)):
if arr[i] != i+1:
swaps += 1
t = arr[i]
arr[i] = i+1
arr[temp[i+1]] = t
temp[t] = temp[i+1]
temp[i+1] = i
return swaps
The second function works much faster than the first one. However, I was told that dictionary is faster than list.
What's the reason here?
A list is a data structure, and a dictionary is a data structure. It doesn't make sense to say one is "faster" than the other, any more than you can say that an apple is faster than an orange. One might grow faster, you might be able to eat the other one faster, and they might both fall to the ground at the same speed when you drop them. It's not the fruit that's faster, it's what you do with it.
If your problem is that you have a sequence of strings and you want to know the position of a given string in the sequence, then consider these options:
You can store the sequence as a list. Finding the position of a given string using the .index method requires a linear search, iterating through the list in O(n) time.
You can store a dictionary mapping strings to their positions. Finding the position of a given string requires looking it up in the dictionary, in O(1) time.
So it is faster to solve that problem using a dictionary.
But note also that in your first function, you are building the dictionary using the list's .index method - which means doing n linear searches each in O(n) time, building the dictionary in O(n^2) time because you are using a list for something lists are slow at. If you build the dictionary without doing linear searches, then it will take O(n) time instead:
val_2_indx = { val: i for i, val in enumerate(arr) }
But now consider a different problem. You have a sequence of numbers, and they happen to be the numbers from 1 to n in some order. You want to be able to look up the position of a number in the sequence:
You can store the sequence as a list. Finding the position of a given number requires linear search again, in O(n) time.
You can store them in a dictionary like before, and do lookups in O(1) time.
You can store the inverse sequence in a list, so that lst[i] holds the position of the value i in the original sequence. This works because every permutation is invertible. Now getting the position of i is a simple list access, in O(1) time.
This is a different problem, so it can take a different amount of time to solve. In this case, both the list and the dictionary allow a solution in O(1) time, but it turns out it's more efficient to use a list. Getting by key in a dictionary has a higher constant time than getting by index in a list, because getting by key in a dictionary requires computing a hash, and then probing an array to find the right index. (Getting from a list just requires accessing an array at an already-known index.)
This second problem is the one in your second function. See this part:
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
This creates a list temp, where temp[val] = pos whenever arr[pos] == val. This means the list temp is the inverse permutation of arr. Later in the code, temp is used only to get these positions by index, which is an O(1) operation and happens to be faster than looking up a key in a dictionary.

Python3.x - counting occurences of all substrings using dictionaries

Given some string S, this code will count the number of occurrences of all possible substrings of the string S.
#count[i]=no of different substrings in the string that occurs exactly i times
count=[0]*(100001)
a=input()
dic={}
n=len(a)
for i in range(n):
temp=""
for j in range(i,n,1):
temp+=a[j]
if temp in dic:
dic[temp]+=1
else:
dic[temp]=1
for k,v in dic.items():
count[v]+=1
For example, for the string "ababa", the array will be:
cnt[1]=4 {"ababa", "abab", "baba", "bab"} occur exactly once
cnt[2]=4 {"aba", "ab", "ba", "b"} occur exactly twice
cnt[3]=1 {"a"} occur exactly thrice
cnt[4]=0
cnt[5]=0
i am interested in knowing the runtime of my code
There are essentially two parts of your code to consider separately:
The nested loop where you build `dic`.
The loop where you build `count`.
For 1. there are two loops to consider. The i loop will run n times and j loop will run n-i times each time.
This means that the j loop will run n times the first time, n-1 times the second time and so on till it runs once when i = n-1. Thus the total running time of this block is n(n+1)/2, which is O(n^2).
(Note: I am assuming that the dictionary access take constant time which is the case most of the time).
For 2. There is only one loop to consider which will run for as many times unique substrings exist. For a string of length n, the maximum number of unique substrings is again n(n+1)/2, which is also O(n^2).
So, running time is O(n^2). For n = 10e5, the number of operations is ~10e10, which will take around 10 seconds, using the standard assumption that 10e9 operations take 1 second.

Resources