Python list append based on substring search - slow performance - python-3.x

In a list of lists, I need to add a list element to each inner list, whenever one or more elements of another list are contained in a fixed position element of the inner list itself.
Here's an example of the lists
list1 = ['AS23X2', '33YK87', 'YY744Q']
list2 = [[0, 1773332, 'some text that may contain 0, 1 or more occurrences of list1 items'], [1, 77666543, 'some other text 33YK87 is here']]
Note that len(list1) is about 95,000 and len(list1) over 120,000. The requirement is that if more than 1 item of list1 is found within list2[n][2], they are all appended as a list.
The below code does exactly what is required, but is very slow (takes several minutes). I can't figure out how to improve performance - can anyone suggest a possible solution?
for i in list2:
i.append([x for x in list1 if x in i[2]])
Please do consider that list2 is derived from a Pandas dataframe:
list2 = df2.values.to_list()
I'm quite confident there's something more efficient that could be achieved using Pandas, but I'm new to it and hope someone already solved a similar question in a better way.
Thanks

I'm just spit balling ideas:
Use a database
Use multithreading library
Try to do something with Set if the dataset includes many duplicates
Or try using Counter from the collections library to remove duplicates, but keep occurrences. I'm not sure if this will be faster given your dataset

Related

Looping over multiple lists of lists with different list (within the master list) lengths

I have two lists with lists in each position for l1 and dataframes in l2
l1 = [['US','phone','active'],...,['CA','email','inactive']]
l2 = [df_1, .., df_n]
df_n are all dataframes with actual contents in them.
I want to access the contents in l1 and l2 for the same corresponding position to use them within the for loop for forecasting purposes.
However I try writing the for loops, with zip or izip_longest, enumerate, yet I can't get the loops to unpack the contents properly since.
In l1 there are 3 attributes and l2 just a single attribute for each loop.
There is probably a simple fix to this, I looked at other questions that were similar but none of them dealt with lists within a list s.t. the lengths of the lists that are being looped over were different len.
Depending on the approach I use , the errors vary. But frequently get 'too many values to unpack'
appreciate any thoughts!
If you loop over the indices, with something like for i in range(len(l1)), you'll be able to access elements from both lists with l1[i] and l2[i].

Python: using list comprehension to count first element in list of numbers

I'm trying to teach myself list comprehension in Python, but I find it quite tricky compared to regular loops and it is hard to find good beginner examples of list comprehension.
Using this basic example below, it supplies a list of numbers and asks for sentences generated such as "2 numbers start with 1."
my_list = [232, 379, 985, 384, 129, 197]
2 numbers start with 1
1 number starts with 2
2 numbers start with 3
1 number starts with 9
If I was going to do this in a loop, I might bring back the first digit in each like this and then count them and put them in print statements (this just shows how I might start out in a loop):
for x in range(len(my_list)):
strList = (str(my_list[x]))
if strList[0]:
print(strList[0])
I'm so confused about how to bring back element [0] in list comprehension.
I know there is a sum available in list comprehension, so I'm trying to start like this below to create a count (this isn't right though) and I don't know how to retrieve the first elements back out of this so I can piece together sentences like "2 numbers start with 1":
count = [sum(x) for x in my_list if my_list[0]]
print(count,' numbers start with', start_digit)
Thanks for any help with understanding list comprehension. It looks much better than loops in terms of being more concise so I want to learn it.
Perhaps the reason why you're getting confused here is that this particular problem doesn't seem like something that list comprehension would solve.
If you only need to get the first digits of the items, then list comprehension can do the trick:
start_digits = [str(x)[0] for x in my_list]
Getting the occurrences of each item is a completely different story. You can it implement in a variety of ways, and if you're not against importing modules, you can use collections.Counter to get the occurrence counts.
from collections import Counter
Counter(start_digits)

select sublists with items that have multiple occurances throughout list

I have a nested list of integers ranging from 1 to 5 (not really). I want to ensure that each integer occurs at least once in the list, and if one is missing to replace a sublist with a list that contains the missing integer. (I have a full set of possible sublists to choose from.) I'm having trouble working out the syntax for ensuring that the removed list contains integers that have muliple occurances so that I don't recreate the missing integer problem I'm attempting to solve. Here's an example:
a = [[2], [4], [1], [1, 2], [1,2,5]]
Notice 3 is missing. If I randomly choose the the second or fifth sublist for replacement then either the 4 or 5 will be missing. I need to choose the first, third or fourth sublist, where each of the sublist elements i has a list.count(i) > 1.
Therefore I want to create a new list of viable selection candidates. I believe the solution should look something like this
b = [item for item in a if sum(a.count(i)) > 1 for i in item]
but Python3 is complaining that
UnboundLocalError: local variable 'i' referenced before assignment.
Any suggestions? Note: the algorithm will need to be able to scale to thousands of sublists, but this would rarely happen because the probability of a missing integer in those cases becomes nearly 0.
Thanks for looking!

Removing list element while iterating in python3

I am trying to Remove list elements(numeric values) while iterating through the list. I have two examples. example 1 works but example 2 doesn't, even though both examples use the same logic.
Example 1 : Working
list1=["5","a","6","c","f","9","r"]
print(list1)
for i in list1:
if str.isnumeric(i):
list1.remove(i)
print(list1)
Example 2 : Not Working
list2=["12abc1","45asd"]
for items in list2:
item_list=list(items)
print(item_list)
for i in item_list:
if str.isnumeric(i):
item_list.remove(i)
print(item_list)
I solved the example 2 by using (for i in item_list[:]:). But i can't understand the logic why second example didn't work at first place?
I can't claim to be an expert in Python, as I'm only poorly familiar with it, however I'll give you an explanation of what I think is likely happening.
The first example doesn't actually work any better than the second example, however the data you've used to test it is different so it doesn't show. The problem seems to be due to the fact that you're iterating through any modifying at the same time, so the following happens in the second example:
The program will iterate through its given list:
["1", "2", "a", "b","c", "1"]
The program starts with list item 1. It is numerical, so it is removed. The list is now different:
["2", "a", "b", "c", "1"]
As you are iterating through, it moves on to list item 2. This is problematic, as list item 2 is "a" rather than the "2", so it skips the "2".
As numbers in the first example are separated by at least 1 list item, this isn't an issue as all of the numbers are iterated over.
As for the fix you mentioned of changing list2 to list2[:], I have no idea what happened there as when I ran the program through PythonTutor's visualizor it didn't seem to work.
In order to fix this, the most obvious solution to me would be to try going through the array backwards - starting with the final list item and moving towards the start of the list, as that means any item you remove won't affect the numbering of the previous items.
Hope I helped!

Sort a dictionary according to the lists size and then alphabetically by their key in Python?

So I have a tiny problem. I got help here a while ago about sorting a dictionary with keys that have a list to each key according to the value of things in the list. The keys with lists with the least amount of values on the left and to the right the keys with lists with the most amount of values. That worked great. Now I know how to sort dictionary keys alphabetically but I cant get it to work combined with the above..
I'm trying to sort the dictionary below according to first how many values the key list contains... and then alphabetically if the key list contains the same amount of values as a previous key list.
so before I would have this:
Dict = {"anna":[1,2,3],"billy":[1,2],"cilla":[1,2,3,4],"cecillia":[1,2,3,4],"dan":[1]}
And after if everything goes well I would like to have...
Dict = {"dan":[1],"billy":[1,2],"anna":[1,2,3],"cecillia":[1,2,3,4],"cilla":[1,2,3,4]}
As you see in the above, cecillia comes before cilla since they both have 4 values in their lists... and dan comes in first since he has the least amount of values in his list. I hope this makes sense. What I have right now to get the below result is:
ascending = sorted(Dict, key =lambda x: len(Dict[x]))
this gives me for example:
{"dan":[1],"billy":[1,2],"anna":[1,2,3],"cilla":[1,2,3,4],"cecillia":[1,2,3,4]}
So it works but only for the values in the list.. now when I go
ascending.sort()
it sorts the dictionary alphabetically but then the order of values from least to greatest is gone. Anyone know how to to combine the two things? I would greatly appreciate it.
You cannot keep dictionaries sorted so you must convert it to a list of tuples:
D = [ (x, Dict[x]) for x in Dict]
ascending = sorted(D, key = lambda x: x[1])
ascending.sort()
See http://wiki.python.org/moin/HowTo/Sorting . BTW the feature you are relying on is in fact because the sorting algorithm is stable (which apparently was not the case when I was programming in Python).

Resources