Efficient search for collisions in multiple lists - python-3.x

I have a multiple lists with data of the form:(There is a simple example, in fact, the dimension of row-vectors are much larger)
list 1: [num1] [[1,0,0,1,0], [0,0,1,0,1], [0,1,0,1,0], ...]
list 2: [num2] [[0,0,0,1,0], [1,0,0,1,0], [0,0,1,0,0], ...]
...
list n: [numn] [[1,1,0,1,0], [1,0,0,1,1], [0,0,1,0,1], ...]
Every list marked with its own number [num] (numbers are not repeated).
The main question is: How to efficently find all num's of lists with identical row-vectors from them and such vectors?
In details:
For example, the row-vector [1,0,0,1,0] occurs in list 1 and list 2, so then I should return [1,0,0,1,0] : [num1], [num2]
First of all hash tables come to mind. I think it's best to use due to the large amount of data but I know hash tables quite superficially and I can’t structurize a clear algorithm in my head with this case. Can anyone advise what should I pay attention to and what modules should I consider? Perhaps there are other efficient approaches?

It is beyond the scope of a regular question to dive into hash tables and such. But suffice to say that sets in Python are backed by hash tables and checking for set membership is almost instantaneous and much more efficient than searching through lists.
If order doesn't matter within your list of vectors, you should just think of them as unordered collections (sets). Sets need to contain immutable things, so you cannot put a list into a set, but you can put in tuples. So, if you re-structure your data to be sets of tuples, you are in good shape.
You have many "cases" of things you might do then, below are a few examples.
data = { 1: {(1, 0, 0), (1, 1, 0)},
2: {(0, 0, 0), (1, 0, 0)},
3: {(1, 0, 0), (1, 0, 1), (1, 1, 0)}}
# find common vectors in 2 sets
def common_vecs(a, b):
return a.intersection(b)
# find all the common vectors in a group of sets
def all_common_vecs(grps):
return set.intersection(*grps)
# find which sets contain a specific vector
def find(vec, data):
result = set()
for idx, grp in data.items():
if vec in grp:
result.add(idx)
return result
print(common_vecs(data[1], data[3]))
print(all_common_vecs(data.values()))
print(find((1,0,1), data))
Output:
{(1, 0, 0), (1, 1, 0)}
{(1, 0, 0)}
{3}

Related

np.where issue above a certain value (#Numpy)

I'm facing to 2 issues in the following snippet using np.where (looking for indexes where A[:,0] is identical to B)
Numpy error when n is above a certain value (see error)
quite slow
DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
So I'm wondering what I'm missing and/or misunderstanding, how to fix it, and how to speed-up the code. This is a basic example I've made to mimic my code, but in fact I'm dealing with arrays having (dozens of) millions of rows.
Thanks for your support
Paul
import numpy as np
import time
n=100_000 # with n=10 000 ok but quit slow
m=2_000_000
#matrix A
# A=np.random.random ((n, 4))
A = np.arange(1, 4*n+1, dtype=np.uint64).reshape((n, 4), order='F')
#Matrix B
B=np.random.randint(1, m+1, size=(m), dtype=np.uint64)
B=np.unique(B) # duplicate values are generally generated, so the real size remains lower than n
# use of np.where
t0=time.time()
ind=np.where(A[:, 0].reshape(-1, 1) == B)
# ind2=np.where(B == A[:, 0].reshape(-1, 1))
t1=time.time()
print(f"duration={t1-t0}")
In your current implementation, A[:, 0] is just
np.arange(n/4, dtype=np.uint64)
And if you are interested only in row indexes where A[:, 0] is in B, then you can get them like this:
row_indices = np.where(np.isin(first_col_of_A, B))[0]
If you then want to select the rows of A with these indices, you don't even have to convert the boolean mask to index locations. You can just select the rows with the boolean mask: A[np.isin(first_col_of_A, B)]
There are better ways to select random elements from an array. For example, you could use numpy.random.Generator.choice with replace=False. Also, Numpy: Get random set of rows from 2D array.
I feel there is almost certainly a better way to do the whole thing that you are trying to do with these index locations.
I recommend you study the Numpy User Guide and the Pandas User Guide to see what cool things are available there.
Honestly, with your current implementation you don't even need the first column of A at all, because row indicies simply equal the elements of A[:, 0]. Here:
row_indices = B[B < n]
row_indices.sort()
print(row_indices)

Choose smallest value in dictionary

I have a dictionary given in the form {(i1,r1,m1):w, (i2,r2,m1):w, (i1,r1,m2):w ...} where i is the activity, r the type of resource, m the mode, and w is the resources of type r needed of activity i in mode m.
Now I would like to choose for every activity the mode, that requires the least resources (w). If possible, at the end in a list in the form [(i,m),...] for every i.
My tutor suggested to work with np.argmin(), but for this I have to convert the dictionary into an array. So I tried to convert the dictionary into an array:
w_list = list(w.items())
w_array = np.array(w_list)
print(w_array)
array([[(0, 1, 1), 0],
[(0, 2, 1), 0],
[(1, 1, 1), 9],
[(1, 2, 1), 0], ...
However, this array arrangement cannot be used for np.argmin.
Does anyone have any other idea how I can get the desired list mentioned above?
Here's one trivial non-numpy solution - simply create a new dictionary, and fill it with the mode and lowest cost per activity by iterating over the original dict:
w = {(i1,r1,m1): w1, (i2,r2,m1): w2, (i1,r1,m2): w3} #your original dict
result = {}
for (activity, _, mode), requiredResources in w.items():
if activity not in result or result[activity][1] > requiredResources:
result[activity] = mode, requiredResources
Now result holds a mapping from i to a tuple of m and w for the lowest w. In case of ambiguous entries for some i, the first entry in the iteration order will win (and as dicts are unordered, the iteration order is an implementation detail and dependend on things such as the specific keys and the dict size).
If you want to turn this into a list of i and m tuples, simply use a list comprehension:
resultList = [(k, v[0]) for k, v in result.items()]
An observation on the side: when confronted with any python problem, some people instantly recommend using numpy or similar libraries. IMO this is simply an expression of their own inexperience or ignorance - in many cases numpy is not just unneccessary, but actively detrimental if you don't know what you're doing.
If you're intending to seriously work with python, you would do well to first master the basics of the language (functions, classes, lists, dictionaries, loops, comprehensions, basic variable scoping rules), and get a rough overview of the vast python standard library - at least enough to know how to look up if something you need is readily available in some built-in module. Then next time when you need some functionality, you will be better equipped for deciding if this is something you can easily implement yourself (potentially with help from the standard lib), or if it makes sense to use functionality from external libraries such as numpy.

Delete a certain percentage elements from a list while preserving the structural integrity of the list

I have a list of tuples, where each tuple contains a value and an object, as shown below:
[(1, object0), (1, object1), (0, object2), (5, object3)]
Now were I to delete 25% of the tuples, with the lowest value within the first entry, within this list, I would be getting:
[(1, object0), (1, object1), (5, object3)]
Here the structural integrity of list is still intact while the elements have been deleted. I have explored options with a priority queue, but I wanted to know if there is a more efficient method to do this. Please describe the time complexity of the solution.
I should elaborate that I am looking for the bottom 25% by value of the list from the previous example. If the list were to be sorted ascending, I would get:
[(0, object2), (1, object0), (1, object1), (5, object3)]
The lowest 25% now would be the first element, but when I output, I retain the structural integrity of the original list. The order of the result does not matter as long as it does not contain the bottom 25%, by value, of the original list.
Possible Solution
A possible solution that I believe would work is using a min-priority queue that holds the tuples, and get added to it from a list. However, they also get added to a doubly linked list at the same time. When all the elements have been added to the queue and the doubly linked list, the first certain percentage of elements are popped from the queue and then deleted from the linked list. The linked list after all the deletes becomes the final answer.
ls = [(1, 'object0'), (1, 'object1'), (0, 'object2'), (5, 'object3')]
p25 = m.ceil(len(ls)/4)
ls25 = []
res = []
for i,j in ls:
if len(ls25)<p25:
ls25.append(i)
elif i < max(ls25):
ls25.remove(max(ls25))
res.append((i,j))
ls25.append(i)
else:
res.append((i,j))
print(res)
output
[(1, 'object1'), (1, 'object2'), (5, 'object3')]

Python: difference between copying a variable or making two of them point to the same object

I have used Python for a long time but I don't know how objects and the memory really work.
Until a few days ago, I thought that alpha = gamma made a variable whose name was alpha and saved in it a copy of gamma, without linking the variables to each other. However, I have recently noticed that that doesn't happen. Both variables actually point to the same object. Nevertheless, the variables become independent when you change the data in one of them (depending on the variables).
There are many other cases in which variables don't behave like you would expect. This is an example I came upon:
>>> grid1=[[0]*4]*4
>>> grid2=[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
>>> grid1 == grid2
True
>>> grid1[2][3]+=1
>>> grid2[2][3]+=1
>>> grid1
[[0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1]]
>>> grid2
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 0]]
I have tried to find more information about how = and other commands treat variables and found some threads, but I have many questions whose answer I don't know yet:
Why did the behavior shown above with the lists take place?
What should be done in order to avoid it (make grid1 behave like grid2)?
Does this have anything to do with the modifiability of the variables? What does it really mean for a variable to be modifiable?
When are variables the same object or different ones? How do you know if a command creates a separate variable or not (x+=y vs x = x + y, append vs +)?
Is there an == which would have returned false in the example above (is wouldn't work, those two variables were created in different steps and independently so they won't be in the same place in the memory) because in grid1 all lists were in the same place in the memory while in grid2 they were independent?
I haven't been able to find the answers to those questions anywhere, could anyone give a brief answer to the questions or provide a reference which explained these concepts? Thanks.
Why did the behavior shown above with the lists take place?
Because lists and other mutable collections do not create a new object when you set them to a variable.
What should be done in order to avoid it (make grid1 behave like grid2)?
grid1=[[[0] for _ in range(4)] for _ in range(4)] would make it work as you want. This is because it actually creates a new list each time instead of duplicating it (like [[0]*4]*4 does).
Does this have anything to do with the modifiability of the variables? What does it really mean for a variable to be modifiable?
Collections such as strings are immutable so when you do a = "hi";b = a; b += "!" b is set to a new string that copies a and then to a new string that copies b and adds "!".
Lists instead operate on the same object so when you do a = [];b = a;b.append(1) b is set to a and then it appends 1 to b (which references a in memory).
When are variables the same object or different ones? How do you know if a command creates a separate variable or not (x+=y vs x = x + y, append vs +)?
It Depends more on the data structure rather than on the operator or method.
Mutable types: list, set, dict.
Immutable types: tuple, frozenset, string.
Is there an == which would have returned false in the example above (is wouldn't work, those two variables were created in different steps and independently so they won't be in the same place in the memory) because in grid1 all lists were in the same place in the memory while in grid2 they were independent?
== evaluates the equality of values (i.e. if they contain the same) while is evaluates if both are the same object. (Try testing == and is on two equal lists. In the first case a = [1]; b = [1] and in the second case a = [1]; b = a.

Python: Increasing one number in a list by one

I am trying to write a program that returns the frequency of a certain pattern. My frequency list is initially a list of zeros, and I want to increase a certain zero by one depending on the pattern. I have tried the code below, but it does not work.
FrequencyArray[j] = FrequencyArray[j]+1
Is there another way to increase one element of the list by 1 without affecting the other elements?
While your approach should work, this would be the alternative:
FrequencyArray[j] += 1
Example:
>>> zeros = [0, 0, 0]
>>> zeros[1] += 1
>>> zeros
[0, 1, 0]

Resources