Python: Faster way to filter a list using list comprehension - python-3.x

Consider the following problem: I want to keep elements of list1 that belongs to list2. So I can do something like this:
filtered_list = [w for w in list1 if w in list2]
I need to repeat this same procedure for different examples of list1 (about 20000 different examples) and a "constant" (frozen) list2.
How can I speed up the process?
I also know the following properties:
1) list1 has repeated elements and it is not sorted and it has about 10000 (ten thousand) items.
2) list2 is a giant sorted list (about 200000 - two hundred thousand) entries in Python) and each element is unique.
The first thing that comes to me is that maybe I can use a kind of binary search. However, is there a way to do this in Python?
Furthermore, I do not mind if filtered_list has the same order of items of list1. So, maybe I can check only a unrepeated version of list1 and after removing the elements in list1 that do not belong to list 2, I can return the repeated items.
Is there a fast way to do this in Python 3?

Convert list2 to a set:
# do once
set2 = set(list2)
# then every time
filtered_list = [w for w in list1 if w in set2]
x in list2 is sequential; x in set2 uses the same mechanism as dictionaries, resulting in a very quick lookup.
If list1 didn't have duplicates, converting both to sets and taking set intersection would be the way to go:
filtered_set = set1 & set2
but with duplicates you're stuck with iterating over list1 as above.
(As you said, you could even see elements that you should delete, using set1 - set2, but then you'd still be stuck in a loop in order to delete - there shouldn't be any difference in performance between filtering keepers vs filtering trash, you still have to iterate over list1, so that's no win over the method above.)
EDIT in response to comment: Converting list1 to a Counter would might (EDIT: or not; testing needed!) speed it up if you can use it normally like that (i.e. you never have a list, you always just deal with a Counter). But if you have to preprocess list1 into counter1 each time you do the above operation, again it's no win - creating a Counter will again involve a loop.

Related

Appending Elements From Left/Start of List/Array

I came across appending elements to array from left. and it has two solution
Solution 1:
List = [2,3,4,5,6]
List.insert(0,1) # Using insert
List = [1,2,3,4,5,6]
Solution 2:
List = [2,3,4,5,6]
[1] + List # Concatenation of array
List = [1,2,3,4,5,6]
I'm new to Python So please can any one explain the time complexity of both solution according to me both solution takes O(n) am I right or wrong?

Time complexity of a function in Python

I have 2 functions which perform same task of identifying if the 2 lists have any common element between them. I want to analyze their time complexity.
What i know is: for loop if iterated n times gives O(n) complexity. But, I am confused with the situation when we use 'in' operator. eg: if element in mylist
Please look at the functions to have better understanding of the scenario:
list1 = ['a','b','c','d','e']
list2 = ['m','n','o','d']
def func1(list1, list2):
for i in list1: # O(n), assuming number of items in list1 is n
if i in list2: # What will be the BigO of this statement??
return True
return False
z = func1(list1, list2)
print(z)
I have another function func2, please help determine its BigO as well:
def func2(list1, list2):
dict = {}
for i in list1:
if i not in dict.keys():
dict[i] = True
for j in list2:
if j in dict.keys():
return True
return False
z = func2(list1, list2)
print(z)
What is the time complexity of func1 and func2? Is there any difference in performance between 2 functions?
Regarding func1:
searching in lists is a linear operation with respect to the number of elements,
assuming items are randomly ordered and order of checking is also not related then statistically you come across an existing element in n/2 steps and n when not found (which simplifies to O(n))
if x in list_ is a linear search as described above, hence func1 has complexity of n^2.
Regarding func2:
instead of dictionary you may want to consider using a set. It has O(1) complexity for checking the existence of element. which would improve the complexity over func1, and also you can use set(list) to create a list instead of iterating over list directly in python (which is slower than initialization of a set directly from list - but does not affect the O complexity, as it is just slower, but constant).

How to find match between two 2D lists in Python?

Lets say I have two 2D lists like this:
list1 = [ ['A', 5], ['X', 7], ['P', 3]]
list2 = [ ['B', 9], ['C', 5], ['A', 3]]
I want to compare these two lists and find where the 2nd item matches between the two lists e.g here we can see that numbers 5 and 3 appear in both lists. The first item is actually not relevant in comparison.
How do I compare the lists and copy those values that appear in 2nd column of both lists? Using 'x in list' does not work since these are 2D lists. Do I create another copy of the lists with just the 2nd column copied across?
It is possible that this can be done using list comprehension but I am not sure about it so far.
There might be a duplicate for this but I have not found it yet.
The pursuit of one-liners is a futile exercise. They aren't always more efficient than the regular loopy way, and almost always less readable when you're writing anything more complicated than one or two nested loops. So let's get a multi-line solution first. Once we have a working solution, we can try to convert it to a one-liner.
Now the solution you shared in the comments works, but it doesn't handle duplicate elements and also is O(n^2) because it contains a nested loop. https://wiki.python.org/moin/TimeComplexity
list_common = [x[1] for x in list1 for y in list2 if x[1] == y[1]]
A few key things to remember:
A single loop O(n) is better than a nested loop O(n^2).
Membership lookup in a set O(1) is much quicker than lookup in a list O(n).
Sets also get rid of duplicates for you.
Python includes set operations like union, intersection, etc.
Let's code something using these points:
# Create a set containing all numbers from list1
set1 = set(x[1] for x in list1)
# Create a set containing all numbers from list2
set2 = set(x[1] for x in list2)
# Intersection contains numbers in both sets
intersection = set1.intersection(set2)
# If you want, convert this to a list
list_common = list(intersection)
Now, to convert this to a one-liner:
list_common = list(set(x[1] for x in list1).intersection(x[1] for x in list2))
We don't need to explicitly convert x[1] for x in list2 to a set because the set.intersection() function takes generator expressions and internally handles the conversion to a set.
This gives you the result in O(n) time, and also gets rid of duplicates in the process.

A more efficient way for nested loops

Currently, I have a nested loops in python that iterates over lists, but the iterable child list depends on the selected value of parent loop. So, consider this code snippet for the nested loop.
my_combinations = []
list1 = ['foo', 'bar', 'baz']
for l1 in list1:
list2 = my_func1(l1) # Some user defined function which queries through some dataset
for l2 in list2:
list3 = my_func2(l1, l2) # Some other user defined function which queries through some dataset
for l3 in list3:
my_combinations.append((l1,l2,l3))
Is there an efficient way to get all the permissible combinations (as defined by my_func1 and my_func2 functions) in the my_combinations list as the number of elements in list1, list2 and list3 runs into 4-5 digits and is clearly inefficient right now?
As a thought process, if I had list1, list2 and list3 pre-defined before entering the outermost loop, itertools.product might have given me the required combinations efficiently. However, I don't think I can use it in this situation.

Grouping elements in list by equivalence class

I have an issue when trying to make new lists from one list by applying sets.
Suppose I have the following list:
L=[[(a),(b),(c)],[(b),(c),(a)],[(a),(c),(b)],[(a),(d),(b)]]
And I wish to just creat ONE list from the lists in L which have the same elements. We can clearly see that:
[(a),(b),(c)], [(b),(c),(a)] and [(a),(c),(b)]
when seen as sets, they are the same, because all share the elements (a), (b) and (c).
So if I wish to create new lists from L applying this rule:
I would then need two new lists, which are:
[(a),(b),(c)] and [(a),(d),(b)]
since
[(a),(d),(b)]
seen as a set differs from the rest of the lists.
What would be an optimal way to do this? I know how to convert an element inside L as a set, but if I wish to apply this rule in order to create only two independent lists, what should I do?
A set of frozensets would get you roughly what you want (though it won't preserve order):
unique_sets = {frozenset(lst) for lst in L}
Though order is lost in the set conversion, converting back to a list of lists is fairly easy:
unique_lists = [list(s) for s in unique_sets]
You can make a set of frozensets to get only the unique collections ignoring order and counts of items:
set(map(frozenset, L))
# {frozenset({'a', 'd', 'b'}), frozenset({'a', 'c', 'b'})}
It's then pretty trivial to convert those back to lists:
list(map(list, set(map(frozenset, L))))
# [['a', 'd', 'b'], ['a', 'c', 'b']]
If you'd be willing to write a hash method for set then you could do:
import itertools
[k for k, g in itertools.groupby(sorted([set(y) for y in x], key = your_hash))]

Resources