list and dictionary: which one is faster - python-3.x

I have the following pieces of code doing the sorting of a list by swapping pairs of elements:
# Complete the minimumSwaps function below.
def minimumSwaps(arr):
counter = 0
val_2_indx = {val: arr.index(val) for val in arr}
for indx, x in enumerate(arr):
if x != indx+1:
arr[indx] = indx+1
s_indx = val_2_indx[indx+1]
arr[s_indx] = x
val_2_indx[indx+1] = indx
val_2_indx[x] = s_indx
counter += 1
return counter
def minimumSwaps(arr):
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
swaps = 0
for i in range(len(arr)):
if arr[i] != i+1:
swaps += 1
t = arr[i]
arr[i] = i+1
arr[temp[i+1]] = t
temp[t] = temp[i+1]
temp[i+1] = i
return swaps
The second function works much faster than the first one. However, I was told that dictionary is faster than list.
What's the reason here?

A list is a data structure, and a dictionary is a data structure. It doesn't make sense to say one is "faster" than the other, any more than you can say that an apple is faster than an orange. One might grow faster, you might be able to eat the other one faster, and they might both fall to the ground at the same speed when you drop them. It's not the fruit that's faster, it's what you do with it.
If your problem is that you have a sequence of strings and you want to know the position of a given string in the sequence, then consider these options:
You can store the sequence as a list. Finding the position of a given string using the .index method requires a linear search, iterating through the list in O(n) time.
You can store a dictionary mapping strings to their positions. Finding the position of a given string requires looking it up in the dictionary, in O(1) time.
So it is faster to solve that problem using a dictionary.
But note also that in your first function, you are building the dictionary using the list's .index method - which means doing n linear searches each in O(n) time, building the dictionary in O(n^2) time because you are using a list for something lists are slow at. If you build the dictionary without doing linear searches, then it will take O(n) time instead:
val_2_indx = { val: i for i, val in enumerate(arr) }
But now consider a different problem. You have a sequence of numbers, and they happen to be the numbers from 1 to n in some order. You want to be able to look up the position of a number in the sequence:
You can store the sequence as a list. Finding the position of a given number requires linear search again, in O(n) time.
You can store them in a dictionary like before, and do lookups in O(1) time.
You can store the inverse sequence in a list, so that lst[i] holds the position of the value i in the original sequence. This works because every permutation is invertible. Now getting the position of i is a simple list access, in O(1) time.
This is a different problem, so it can take a different amount of time to solve. In this case, both the list and the dictionary allow a solution in O(1) time, but it turns out it's more efficient to use a list. Getting by key in a dictionary has a higher constant time than getting by index in a list, because getting by key in a dictionary requires computing a hash, and then probing an array to find the right index. (Getting from a list just requires accessing an array at an already-known index.)
This second problem is the one in your second function. See this part:
temp = [0] * (len(arr) + 1)
for pos, val in enumerate(arr):
temp[val] = pos
This creates a list temp, where temp[val] = pos whenever arr[pos] == val. This means the list temp is the inverse permutation of arr. Later in the code, temp is used only to get these positions by index, which is an O(1) operation and happens to be faster than looking up a key in a dictionary.

Related

Sorting algoritm

I want to make my algorithm more efficient via deleting the items it already sorted, but i don't know how I can do it efficiently. The only way I found was to rewrite the whole list.
l = [] #Here you put your list
sl = [] # this is to store the list when it is sorted
a = 0 # variable to store which numbers he already looked for
while True: # loop
if len(sl) == len(l): #if their size is matching it will stop
print(sl) # print the sorted list
break
a = a + 1
if a in l: # check if it is in list
sl.append(a) # add to sorted list
#here i want it to be deleted from the list.
The variable a is a little awkward. It starts at 0 and increments 1 by 1 until it matches elements from the list l
Imagine if l = [1000000, 1200000, -34]. Then your algorithm will first run for 1000000 iterations without doing anything, just incrementing a from 0 to 1000000. Then it will append 1000000 to sl. Then it will run again 200000 iterations without doing anything, just incrementing a from 1000000 to 1200000.
And then it will keep incrementing a looking for the number -34, which is below zero...
I understand the idea behind your variable a is to select the elements from l in order, starting from the smallest element. There is a function that does that: it's called min(). Try using that function to select the smallest element from l, and append that element to sl. Then delete this element from l; otherwise, the next call to min() will select the same element again instead of selecting the next smallest element.
Note that min() has a disadvantage: it returns the value of the smallest element, but not its position in the list. So it's not completely obvious how to delete the element from l after you've found it with min(). An alternative is to write your own function that returns both the element, and its position. You can do that with one loop: in the following piece of code, i refers to a position in the list (0 is the position of the first element, 1 the position of the second, etc) and a refers to the value of that element. I left blanks and you have to figure out how to select the position and value of the smallest element in the list.
....
for i, a in enumerate(l):
if ...:
...
...
If you managed to do all this, congratulations! You have implemented "selection sort". It's a well-known sorting algorithm. It is one of the simplest. There exist many other sorting algorithms.

How to find length of shortest unique substring and number of occurrences of all unique substrings of same length in a given string

The problem is to find the length of the shortest unique substring and number of same length unique substring occurring in the string. For eg. "aatcc" will have "t" as the shortest length unique substring and length is 1 so the output will be 1,1. Another example is "aacc" here the output will be 2,3 as strings are aa,ac,cc
I tried to solve it but could come up only with a brute Force solution which is to loop over all possible substrings. It exceeded the time limit.
I googled it and found some references to suffix array but not quite clear about it.
So what is the optimal solution for this problem?
EDIT : Forgot to mention the key requirement of the solution of that was required for this problem and that is to NOT use any library functions other than input and output functions to read and write from and to the standard input and the standard output respectively.
EDIT: I have found another solution using trie data structure.
Pseudocode:
for i from 1 to length(string) do
for j from 0 to length(string)-1 do
1. create a substring of length i from jth character
2. if checkIfSeen(substring) then count-- else count++
close inner for loop
if count >= 1 then break
close outer for loop
print i(the length of the unique substring), count (no. of such substrings)
checkIfSeen(Substring) will use a trie data structure which
will run O(log l) where l is the average length of the prefixes.
The time complexity of this algorithm would be O(n^2 log l) where if the average length of the prefixes is n/2 then the time complexity would be O(n^2 log n). Please point out the mistakes if there are and also ways to improve this running time if possible.
Sorry, but keep in mind that my answer is based on program I wrote with Python, but can be applied to any programming language :)
Now I believe brute force approach is indeed what you need to do in this problem. But what we can do to shorten the time is:
1: start the brute force from the smallest substring length, which is
1.
2: after looping through the string with substring length 1 (the data
will look something like {"a":2, "t":1, "c":2} for "aatcc"), check if
any substring appeared only once. If it did, count the occurrence by
looping through the dictionary (in case of the example you gave, "t"
only appeared once, so occurrence is 1).
3: After the occurrence is counted, break the loop so that it does not
have to waste time on counting the rest of bigger substrings.
4: on 2:, if the unique substring was not found, reset the dictionary
and try a bigger substring (the data can be something like {"aa": 1, "ac":1,
"cc":1 for "aacc"}). Eventually the unique substring WILL be found no matter what (for example, in the string "aaaaa", the unique substring is "aaaaa" with the data {"aaaaa":1})
Here is the implementation in Python:
def countString(string):
for i in range(1, len(string)+1): #start the brute force from string length 1
dictionary = {}
for j in range(len(string)-i+1): #check every combination.
#count the substring occurrences
try:
dictionary[string[j:j+i]] += 1
except:
dictionary[string[j:j+i]] = 1
isUnique = False #loop stops if isUnique is True
occurrence= 0
for key in dictionary: #iterate through the dictionary
if dictionary[key] == 1: #check if any substring is unique
#if found, get ready to escape from the loop and increase the occurrence
isUnique = True
occurrence+=1
if isUnique:
return (i, occurrence)
print(countString("aacc")) #prints (2,3)
print(countString("aatcc")) #prints (1,1)
I am pretty sure that this design is fairly fast, but there always should be a better way. But anyway, I hope this helped :)

Choosing minimum numbers from a given list to give a sum N( repetition allowed)

How to find the minimum number of ways in which elements taken from a list can sum towards a given number(N)
For example if list = [1,3,7,4] and N=14 function should return 2 as 7+7=14
Again if N= 11, function should return 2 as 7+4 =11. I think I have figured out the algorithm but unable to implement it in code.
Pls use Python, as that is the only language I understand(at present)
Sorry!!!
Since you mention dynamic programming in your question, and you say that you have figured out the algorithm, i will just include an implementation of the basic tabular method written in Python without too much theory.
The idea is to have a tabular structure we will use to compute all possible values we need without having to doing the same computations many times.
The basic formula will try to sum values in the list till we reach the target value, for every target value.
It should work, but you can of course make some optimization like trying to order the list and/or find dividends in order to construct a smaller table and have faster termination.
Here is the code:
import sys
# num_list : list of numbers
# value: value for which we want to get the minimum number of addends
def min_sum(num_list, value):
list_len = len(num_list)
# We will use the tipycal dynamic programming table construct
# the key of the list will be the sum value we want,
# and the value will be the
# minimum number of items to sum
# Base case value = 0, first element of the list is zero
value_table = [0]
# Initialize all table values to MAX
# for range i use value+1 because python range doesn't include the end
# number
for i in range(1, value+1):
value_table.append(sys.maxsize);
# try every combination that is smaller than <value>
for i in range(1, value+1):
for j in range(0, list_len):
if (num_list[j] <= i):
tmp = value_table[i-num_list[j]]
if ((tmp != sys.maxsize) and (tmp + 1 < value_table[i])):
value_table[i] = tmp + 1
return value_table[value]
## TEST ##
num_list = [1,3,16,5,3]
value = 22
print("Min Sum: ",min_sum(num_list,value)) # Outputs 3
it would be helpful if you include your Algorithm in Pseudocode - it will very much look like Python :-)
Another aspect: your first operation is a multiplication with one item from the list (7) and one outside of the list (2), whereas for the second opration it is 7+4 - both values in the list.
Is there a limitation for which operation or which items to use (from within or without the list)?

Efficiently Perform Nested Dictionary Lookups and List Appending Using Numpy Nonzero Indices

I have working code to perform a nested dictionary lookup and append results of another lookup to each key's list using the results of numpy's nonzero lookup function. Basically, I need a list of strings appended to a dictionary. These strings and the dictionary's keys are hashed at one point to integers and kept track of using separate dictionaries with the integer hash as the key and the string as the value. I need to look up these hashed values and store the string results in the dictionary. It's confusing so hopefully looking at the code helps. Here's a simplified version of code:
for key in ResultDictionary:
ResultDictionary[key] = []
true_indices = np.nonzero(numpy_array_of_booleans)
for idx in range(0, len(true_indices[0])):
ResultDictionary.get(HashDictA.get(true_indices[0][idx])).append(HashDictB.get(true_indices[1][idx]))
This code works for me, but I am hoping there's a way to improve the efficiency. I am not sure if I'm limited due to the nested lookup. The speed is also dependent on the number of true results returned by the nonzero function. Any thoughts on this? Appreciate any suggestions.
Here are two suggestions:
1) since your hash dicts are keyed with ints it might help to transform them into arrays or even lists for faster lookup if that is an option.
k, v = map(list, (HashDictB.keys(), HashDictB.values())
mxk, mxv = max(k), max(v, key=len)
lookupB = np.empty((mxk+1,), dtype=f'U{mxv}')
lookupB[k] = v
2) you probably can save a number of lookups in ResultDictionary and HashDictA by processing your numpy_array_of_booleans row-wise:
i, j = np.where(numpy_array_of_indices)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
ResultDict = {HashDictA[i[l]]: [HashDictB[jj] for jj in j[l:r]] for l, r in zip(bnds[:-1], bnds[1:])}
2b) if for some reason you need to incrementally add associations you could do something like (I'll shorten variable names for that)
from operator import itemgetter
res = {}
def add_batch(data, res, hA, hB):
i, j = np.where(data)
bnds, = np.where(np.r_[True, i[:-1] != i[1:], True])
for l, r in zip(bnds[:-1], bnds[1:]):
if l+1 == r:
res.setdefault(hA[i[l]], set()).add(hB[j[l]])
else:
res.setdefault(hA[i[l]], set()).update(itemgetter(*j[l:r])(hB))
You can't do much about the dictionary lookups - you have to do those one at a time.
You can clean up the array indexing a bit:
idxes = np.argwhere(numpy_array_of_booleans)
for i,j in idxes:
ResultDictionary.get(HashDictA.get(i)).append(HashDictB.get(j)
argwhere is transpose(nonzero(...)), turning the tuple of arrays into a (n,2) array of index pairs. I don't think this makes a difference in speed, but the code is cleaner.

Generating a mutation frequency on a DNA Strand using Python

I would like to input a DNA sequence and make some sort of generator that yields sequences that have a certain frequency of mutations. For instance, say I have the DNA strand "ATGTCGTCACACACCGCAGATCCGTGTTTGAC", and I want to create mutations with a T->A frequency of 5%. How would I go about to creating this? I know that creating random mutations can be done with a code like this:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
But what I am truly not sure how to do is make a fixed mutation frequency. Anybody know how to do that? Thanks.
EDIT:
So should the formatting look like this if I'm using a byte array, because I'm getting an error:
import random
dna = "ATGTCGTACGTTTGACGTAGAG"
def mutate(dna, mutation, threshold):
dna = bytearray(dna) #if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
error: "TypeError: string argument without an encoding"
EDIT 2:
import random
myDNA = bytearray("ATGTCGTCACACACCGCAGATCCGTGTTTGAC")
def mutate(dna, mutation, threshold):
dna = myDNA # if you don't want to modify the original
for index in range(len(dna)):
if dna[index] in mutation and random.random() < threshold:
dna[index] = mutation[char]
return dna
mutate(dna, {"A": "T"}, 0.05)
print("my dna now:", dna)
yields an error
You asked me about a function that prints all possible mutations, here it is. The number of outputs grows exponentially with your input data length, so the function only prints the possibilities and does not store them somehow (that could consume very much memory). I created a recursive function, this function should not be used with very large input, I also will add a non-recursive function that should work without problems or limits.
def print_all_possibilities(dna, mutations, index = 0, print = print):
if index < 0: return #invalid value for index
while index < len(dna):
if chr(dna[index]) in mutations:
print_all_possibilities(dna, mutations, index + 1)
dnaCopy = bytearray(dna)
dnaCopy[index] = ord(mutations[chr(dna[index])])
print_all_possibilities(dnaCopy, mutations, index + 1)
return
index += 1
print(dna.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This works for me on python 3, I also can explain the code if you want.
Note: This function requires a bytearray as given in the function test.
Explanation:
This function searches for a place in dna where a mutation can happen, it starts at index, so it normally begins with 0 and goes to the end. That's why the while-loop, which increases index every time the loop is executed, is for (it's basically a normal iteration like a for loop). If the function finds a place where a mutation can happen (if chr(dna[index]) in mutations:), then it copies the dna and lets the second one mutate (dnaCopy[index] = ord(mutations[chr(dna[index])]), Note that a bytearray is an array of numeric values, so I use chr and ord all the time to change between string and int). After that the function is called again to look for more possible mutations, so the functions look again for possible mutations in both possible dna's, but they skip the point they have already scanned, so they begin at index + 1. After that the order to print is passed to the called functions print_all_possibilities, so we don't have to do anything anymore and quit the executioning with return. If we don't find any mutations anymore we print our possible dna, because we don't call the function again, so no one else would do it.
It may sound complicated, but it is a more or less elegant solution. Also, to understand a recursion you have to understand a recursion, so don't bother yourself if you don't understand it for now. It could help if you try this out on a sheet of paper: Take an easy dna string "TTATTATTA" with the possible mutation "A" -> "T" (so we have 8 possible mutations) and do this: Go through the string from left to right and if you find a position, where the sequence can mutate (here it is just the "A"'s), write this string down again, this time let the string mutate at the given position, so that your second string is slightly different from the original. In the original and the copy, mark how far you came (maybe put a "|" after the letter you let mutate) and repeat this procedure with the copy as new original. If you don't find any possible mutation, then underline the string (This is the equivalent to printing it). At the end you should have 8 different strings all underlined. I hope that can help to understand it.
EDIT: Here is the non-recursive function:
def print_all_possibilities(dna, mutations, printings = -1, print = print):
mut_possible = []
for index in range(len(dna)):
if chr(dna[index]) in mutations: mut_possible.append(index)
if printings < 0: printings = 1 << len(mut_possible)
for number in range(min(printings, 1 << len(mut_possible)):
dnaCopy = bytearray(dna) # don't change the original
counter = 0
while number:
if number & (1 << counter):
index = mut_possible[counter]
dnaCopy[index] = ord(mutations[chr(dna[index])])
number &= ~(1 << counter)
counter += 1
print(dnaCopy.decode("ascii"))
# for testing
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"})
This function comes with an additional parameter, which can control the number of maximum outputs, e.g.
print_all_possibilities(bytearray(b"AAAATTTT"), {"A": "T"}, 5)
will only print 5 results.
Explanation:
If your dna has x possible positions where it can mutate, you have 2 ^ x possible mutations, because at every place the dna can mutate or not. This function finds all positions where your dna can mutate and stores them in mut_possible (that's the code of the for-loop). Now mut_possible contains all positions where the dna can mutate and so we have 2 ^ len(mut_possible) (len(mut_possible) is the number of elements in mut_possible) possible mutations. I wrote 1 << len(mut_possible), it's the same, but faster. If printings is a negative number the function will decide to print all possibilities and set printings to the number of possibilities. If printings is positive, but lower than the number of possibilities, then the function will print only printings mutations, because min(printings, 1 << len(mut_possible)) will return the smaller number, which is printings. Else, the function will print out all possibilities. Now we have number to go through range(...) and so this loop, which prints one mutation every time, will execute the desired number of times. Also, number will increase by one every time. (e.g., range(4) is similar! to [0, 1, 2, 3]). Next we use number to create a mutation. To understand this step you have to understand a binary number. If our number is 10, it's in binary 1010. These numbers tell us at which places we have to modify out code of dna (dnaCopy). The first bit is a 0, so we don't modify the first position where a mutation can happen, the next bit is a 1, so we modify this position, after that there is a 0 and so on... To "read" the bits we use the variable counter. number & (1 << counter) will return a non-zero value if the counterth bit is set, so if this bit is set we modify our dna at the counterth position where a mutation can happen. This is written in mut_possible, so our desired position is mut_possible[counter]. After we mutated our dna at that position we set the bit to 0 to show that we already modified this position. That is done with number &= ~(1 << counter). After that we increase counter to look at the other bits. The while-loop will only continue to execute if number is not 0, so if number has at least one bit set (if we have to modify at least one position of dna). After we modified our dnaCopy the while-loop is finished and we print our result.
I hope these explanations could help. I see that you are new to python, so take yourself time to let that sink in and contact me if you have any further questions.
After what I read this question seems easy to answer. The chance is high that I misunderstood something, so please correct me if I am wrong.
If you want a chance of 5% to change a T with an A, then you should write
mutate(yourString, {"A": "T"}, 0.05)
I also suggest you to use a bytearray instead of a string. A bytearray is similar to a string, it can only contain bytes (values from 0 to 255) while a string can contain more characters, but a bytearray is mutable. By using a bytearray you don't need to create you temporary list or to join it in the end. If you do that, your code looks like this:
import random
def mutate(dna, mutation, threshold):
if isinstance(dna, str):
dna = bytearray(dna, "utf-8")
else:
dna = bytearray(dna)
for index in range(len(dna)):
if chr(dna[index]) in mutation and random.random() < threshold:
dna[index] = ord(mutation[chr(dna[index])])
return dna
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA.decode("ascii")) #use decode to make newDNA a string
After all the stupid problems I had with the bytearray version, here is the version that operates on strings:
import random
def mutate(string, mutation, threshold):
dna = list(string)
for index, char in enumerate(dna):
if char in mutation:
if random.random() < threshold:
dna[index] = mutation[char]
return ''.join(dna)
dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.05)
print("DNA now:", newDNA)
If you use the string version with larger input the computation time will be bigger as well as the memory used. The bytearray-version will be the best when you want to do this with much larger input.

Resources