Python3.x - counting occurences of all substrings using dictionaries

Python3.x - counting occurences of all substrings using dictionaries - python-3.x

Given some string S, this code will count the number of occurrences of all possible substrings of the string S.
#count[i]=no of different substrings in the string that occurs exactly i times
count=[0]*(100001)
a=input()
dic={}
n=len(a)
for i in range(n):
temp=""
for j in range(i,n,1):
temp+=a[j]
if temp in dic:
dic[temp]+=1
else:
dic[temp]=1
for k,v in dic.items():
count[v]+=1
For example, for the string "ababa", the array will be:
cnt[1]=4 {"ababa", "abab", "baba", "bab"} occur exactly once
cnt[2]=4 {"aba", "ab", "ba", "b"} occur exactly twice
cnt[3]=1 {"a"} occur exactly thrice
cnt[4]=0
cnt[5]=0
i am interested in knowing the runtime of my code

There are essentially two parts of your code to consider separately:
The nested loop where you build `dic`.
The loop where you build `count`.
For 1. there are two loops to consider. The i loop will run n times and j loop will run n-i times each time.
This means that the j loop will run n times the first time, n-1 times the second time and so on till it runs once when i = n-1. Thus the total running time of this block is n(n+1)/2, which is O(n^2).
(Note: I am assuming that the dictionary access take constant time which is the case most of the time).
For 2. There is only one loop to consider which will run for as many times unique substrings exist. For a string of length n, the maximum number of unique substrings is again n(n+1)/2, which is also O(n^2).
So, running time is O(n^2). For n = 10e5, the number of operations is ~10e10, which will take around 10 seconds, using the standard assumption that 10e9 operations take 1 second.

Related

How to extract numbers with repeating digits within a range

I need to identify the count of numbers with non-repeating digits in the range of two numbers.
Suppose n1=11 and n2=15.
There is the number 11, which has repeated digits, but 12, 13, 14 and 15 have no repeated digits. So, the output is 4.
Wrote this code:
n1=int(input())
n2=int(input())
count=0
for i in range(n1,n2+1):
lst=[]
x=i
while (n1>0):
a=x%10
lst.append(a)
x=x//10
for j in range(0,len(lst)-1):
for k in range(j+1,len(lst)):
if (lst[j]==lst[k]):
break
else:
count=count+1
print (count)
While running the code and after inputting the two numbers, it does not run the code but still accepts input. What did I miss?

The reason your code doesn't run is because it gets stuck in your while loop, it can never exit that condition, since n1 > 0 will never have a chance to be evaluated as False, unless the input itself is <= 0.
Anyway, your approach is over complicated, not quite readable and not exactly pythonic. Here's a simpler, and more readable approach:
from collections import Counter
n1 = int(input())
n2 = int(input())
count = 0
for num in range(n1, n2+1):
num = str(num)
digit_count = Counter(num)
has_repeating_digits = any((True for count in digit_count.values() if count > 1))
if not has_repeating_digits:
count += 1
print(count)
When writing code, in general you should try to avoid nesting too much stuff (in your original example you have 4 nested loops, that's readability and debugging nightmare), and try using self-describing variable names (so a, x, j, k, b... are kind of a no-go).
If in a IPython session you run import this you can also read the "Zen of Python", which kind of sums up the concept of writing proper pythonic code.

How to find length of shortest unique substring and number of occurrences of all unique substrings of same length in a given string

The problem is to find the length of the shortest unique substring and number of same length unique substring occurring in the string. For eg. "aatcc" will have "t" as the shortest length unique substring and length is 1 so the output will be 1,1. Another example is "aacc" here the output will be 2,3 as strings are aa,ac,cc
I tried to solve it but could come up only with a brute Force solution which is to loop over all possible substrings. It exceeded the time limit.
I googled it and found some references to suffix array but not quite clear about it.
So what is the optimal solution for this problem?
EDIT : Forgot to mention the key requirement of the solution of that was required for this problem and that is to NOT use any library functions other than input and output functions to read and write from and to the standard input and the standard output respectively.
EDIT: I have found another solution using trie data structure.
Pseudocode:
for i from 1 to length(string) do
for j from 0 to length(string)-1 do
1. create a substring of length i from jth character
2. if checkIfSeen(substring) then count-- else count++
close inner for loop
if count >= 1 then break
close outer for loop
print i(the length of the unique substring), count (no. of such substrings)
checkIfSeen(Substring) will use a trie data structure which
will run O(log l) where l is the average length of the prefixes.
The time complexity of this algorithm would be O(n^2 log l) where if the average length of the prefixes is n/2 then the time complexity would be O(n^2 log n). Please point out the mistakes if there are and also ways to improve this running time if possible.

Sorry, but keep in mind that my answer is based on program I wrote with Python, but can be applied to any programming language :)
Now I believe brute force approach is indeed what you need to do in this problem. But what we can do to shorten the time is:
1: start the brute force from the smallest substring length, which is
1.
2: after looping through the string with substring length 1 (the data
will look something like {"a":2, "t":1, "c":2} for "aatcc"), check if
any substring appeared only once. If it did, count the occurrence by
looping through the dictionary (in case of the example you gave, "t"
only appeared once, so occurrence is 1).
3: After the occurrence is counted, break the loop so that it does not
have to waste time on counting the rest of bigger substrings.
4: on 2:, if the unique substring was not found, reset the dictionary
and try a bigger substring (the data can be something like {"aa": 1, "ac":1,
"cc":1 for "aacc"}). Eventually the unique substring WILL be found no matter what (for example, in the string "aaaaa", the unique substring is "aaaaa" with the data {"aaaaa":1})
Here is the implementation in Python:
def countString(string):
for i in range(1, len(string)+1): #start the brute force from string length 1
dictionary = {}
for j in range(len(string)-i+1): #check every combination.
#count the substring occurrences
try:
dictionary[string[j:j+i]] += 1
except:
dictionary[string[j:j+i]] = 1
isUnique = False #loop stops if isUnique is True
occurrence= 0
for key in dictionary: #iterate through the dictionary
if dictionary[key] == 1: #check if any substring is unique
#if found, get ready to escape from the loop and increase the occurrence
isUnique = True
occurrence+=1
if isUnique:
return (i, occurrence)
print(countString("aacc")) #prints (2,3)
print(countString("aatcc")) #prints (1,1)
I am pretty sure that this design is fairly fast, but there always should be a better way. But anyway, I hope this helped :)

Grabbing the most duplicated letter from a string

What I want to get accomplished is an algorithm that finds the most duplicated letter from the entire list of strings. I'm new to Python so its taken me roughly two hours to get to this stage. The problem with my current code is that it returns every duplicated letter, when I'm only looking for the most duplicated letter. Additionally, I would like to know of a faster way that doesn't use two for loops.
Code:
rock_collections = ['aasdadwadasdadawwwwwwwwww', 'wasdawdasdasdAAdad', 'WaSdaWdasSwd', 'daWdAWdawd', 'QaWAWd', 'fAWAs', 'fAWDA']
seen = []
dupes = []
for words in rock_collections:
for letter in words:
if letter not in seen:
seen.append(letter)
else:
dupes.append(letter)
print(dupes)

If you are looking for the letter which appears the greatest number of times, I would recommend the following code:
def get_popular(strings):
full = ''.join(strings)
unique = list(set(full))
return max(
list(zip(unique, map(full.count, unique))), key=lambda x: x[1]
)
rock_collections = [
'aasdadwadasdadawwwwwwwwww',
'wasdawdasdasdAAdad',
'WaSdaWdasSwd',
'daWdAWdawd',
'QaWAWd',
'fAWAs',
'fAWDA'
]
print(get_popular(rock_collections)) # ('d', 19)
Let me break down the code for you:
full contains each of the strings together with without any letters between them. set(full) produces a set, meaning that it contains every unique letter only once. list(set(full)) makes this back into a list, meaning that it retains order when you iterate over the elements in the set.
map(full.count, unique) iterates over each of the unique letters and counts how many there are of them in the string. zip(unique, ...) puts those numbers with their respective letters. key=lambda x: x[1] is a way of saying, don't take the maximum value of the tuple, instead take the maximum value of the second element of the tuple (which is the number of times the letter appears). max finds the most common letter, using the aforementioned key.

list and dictionary: which one is faster

I have the following pieces of code doing the sorting of a list by swapping pairs of elements:
# Complete the minimumSwaps function below.
def minimumSwaps(arr):
counter = 0
val_2_indx = {val: arr.index(val) for val in arr}
for indx, x in enumerate(arr):
if x != indx+1:
arr[indx] = indx+1
s_indx = val_2_indx[indx+1]
arr[s_indx] = x
val_2_indx[indx+1] = indx
val_2_indx[x] = s_indx
counter += 1
return counter
def minimumSwaps(arr):
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
swaps = 0
for i in range(len(arr)):
if arr[i] != i+1:
swaps += 1
t = arr[i]
arr[i] = i+1
arr[temp[i+1]] = t
temp[t] = temp[i+1]
temp[i+1] = i
return swaps
The second function works much faster than the first one. However, I was told that dictionary is faster than list.
What's the reason here?

A list is a data structure, and a dictionary is a data structure. It doesn't make sense to say one is "faster" than the other, any more than you can say that an apple is faster than an orange. One might grow faster, you might be able to eat the other one faster, and they might both fall to the ground at the same speed when you drop them. It's not the fruit that's faster, it's what you do with it.
If your problem is that you have a sequence of strings and you want to know the position of a given string in the sequence, then consider these options:
You can store the sequence as a list. Finding the position of a given string using the .index method requires a linear search, iterating through the list in O(n) time.
You can store a dictionary mapping strings to their positions. Finding the position of a given string requires looking it up in the dictionary, in O(1) time.
So it is faster to solve that problem using a dictionary.
But note also that in your first function, you are building the dictionary using the list's .index method - which means doing n linear searches each in O(n) time, building the dictionary in O(n^2) time because you are using a list for something lists are slow at. If you build the dictionary without doing linear searches, then it will take O(n) time instead:
val_2_indx = { val: i for i, val in enumerate(arr) }
But now consider a different problem. You have a sequence of numbers, and they happen to be the numbers from 1 to n in some order. You want to be able to look up the position of a number in the sequence:
You can store the sequence as a list. Finding the position of a given number requires linear search again, in O(n) time.
You can store them in a dictionary like before, and do lookups in O(1) time.
You can store the inverse sequence in a list, so that lst[i] holds the position of the value i in the original sequence. This works because every permutation is invertible. Now getting the position of i is a simple list access, in O(1) time.
This is a different problem, so it can take a different amount of time to solve. In this case, both the list and the dictionary allow a solution in O(1) time, but it turns out it's more efficient to use a list. Getting by key in a dictionary has a higher constant time than getting by index in a list, because getting by key in a dictionary requires computing a hash, and then probing an array to find the right index. (Getting from a list just requires accessing an array at an already-known index.)
This second problem is the one in your second function. See this part:
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
This creates a list temp, where temp[val] = pos whenever arr[pos] == val. This means the list temp is the inverse permutation of arr. Later in the code, temp is used only to get these positions by index, which is an O(1) operation and happens to be faster than looking up a key in a dictionary.

Finding palindrome of sum

Here's the problem statement:
For a given no. num, perform:
1. Add num and reverse(num)
2. Check whether the sum is palindrome or not. Else repeat.
Here is my solution. The program seems to be working for the 3 test cases given but when I am executing this program, for 1 private test case I am getting server time out error. Is my program not efficient?
flag=0
iteration=0
num = 195 # sample no. output produced is 9339 which is a palindrome.
while(flag!=1):
#print("iteration ",iteration)
num_rev= int(str(num)[::-1]) #finding rev of number
#print(num_rev)
total= num+num_rev #adding no and no_rev
#print(total)
total_rev= int((str(total))[::-1]) # finding total rev
iteration=iteration+1
if total==total_rev: #if equal, printing palindrome
print("palindrome")
flag=1
else:
num=total #else the new no becomes sum of old num and old_rev

Begin by profiling your code for each of your test cases to find out what is taking so long. I suggest using Jupyter Notebook's code profiling magic. Paste your code into a Jupyter Notebook cell and being the cell with %%prun. Along with other information, it will return a list of function calls, the number of times each function is called, and the amount of time each function takes to run. Compare the run time of your test cases to your server's timeout limit.
If the problem is not with your code and its evolving complexity as the number increases in size, it may be a problem with your server.
...
Note that a palindrome must contain at least 2 or 3 elements depending on your accepted definition.
Consider that one of your test cases may be venturing into arbitrarily large integers that take an unpredictable amount of time to compute. Consider that this compute time may exceed your server's timeout limit.
Furthermore, consider this modified version of your code that accepts an arbitrarily-defined max iteration depth per seed number. It also accepts an arbitrary seed number and a max seed number. Then returns an ordered list of all non-duplicate seed numbers, their respective palindrome sums, the number of iterations required to find the palindrome, and the approximate time required to perform the search:
#flag = 0
iteration = 0
maxiter = 1000
num = 0
max_num = 10000
used_seeds = []
palindromes = []
while num < max_num:
seed = num
while(iteration <= maxiter) and len(str(total))>2:
if num in used_seeds:
break
iteration+=1
num_reversed = int(str(num)[::-1])
total = num + num_reversed
total_reversed = int((str(total))[::-1])
if total == total_reversed:
used_seeds.append(num)
palindromes.append( [num, total] )
break
else:
num = total
num = seed+1
iteration = 0
palindromes.sort(key=lambda elem: elem[0])
print(palindromes)
The results are fascinating, by the way. Hopefully this helps.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string