How to speed up for loop execution using multiprocessing in python - python-3.x

I have two lists. List A contains 500 words. List B contains 10000 words. I am trying to find similar words for List A with respect to B.I am using Spacy's similarity function.
The problem I am facing is that it takes ages to compute. I am new to multiprocessing usage, hence request help.
How do I speed up the execution of the for loop part through multiprocessing in python?
The following is my code.
ListA =['Dell', 'GPU',......] #500 words lists
ListB = ['Docker','Ec2'.......] #10000 words lists
s_words = []
for token1 in ListB:
list_to_sort = []
for token2 in ListA:
list_to_sort.append((token1, token2,nlp(str(token1)).similarity(nlp(str(token2)))))
sorted_list = sorted(list_to_sort, key = itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)

You can use multiprocessing package. This I hope will reduce your time significantly. See here for a sample code.

Have you tried nlp.pipe()?
You could do something like this:
from operator import itemgetter
import spacy
nlp = spacy.load("en_core_web_lg")
ListA = ['Apples', 'Monkey'] # 500 words lists
ListB = ['Grapefruit', 'Ape', 'Oranges', 'Banana'] # 10000 words lists
s_words = []
docs_a = nlp.pipe(ListA)
docs_b = list(nlp.pipe(ListB))
for token1 in docs_a:
list_to_sort = []
for token2 in docs_b:
list_to_sort.append((token1.text, token2.text, token1.similarity(token2)))
sorted_list = sorted(list_to_sort, key=itemgetter(2), reverse=True)[0][:2]
s_words.append(sorted_list)
print(s_words)
That should already speed things up for you. The function nlp.pipe() also has the parameter n_process which might be what you're looking for.

Related

Parallelize a list item append to dict using multiprocessing

I have a large list containing strings. I wish to create a dict from this list such that:
list = [str1, str2, str3, ....]
dict = {str1:len(str1), str2:len(str2), str3:len(str3),.....}
My go to solution was a for loop but its taking too much time (my list contains almost 1M elements):
for i in list:
d[i] = len(i)
I wish to use the multiprocessing module in python in order to leverage all cores and reduce the time taken for the process to execute. I have come across some crude examples involving manager module to share dict between different processes but am unable to implement it. Any help would be appreciated!
I don't know if using multiple process will be faster, but it's an interesting experiment.
General flow:
Create list of random words
Split list into segments, one segment per process
Run processes, pass segment as parameter
Merge result dictionaries to single dictionary
Try this code:
import concurrent.futures
import random
from multiprocessing import Process, freeze_support
def todict(lst):
print(f'Processing {len(lst)} words')
return {e:len(e) for e in lst} # convert list to dictionary
if __name__ == '__main__':
freeze_support() # needed for Windows
# create random word list - max 15 chars
letters = [chr(x) for x in range(65,65+26)] # A-Z
words = [''.join(random.sample(letters,random.randint(1,15))) for w in range(10000)] # 10000 words
words = list(set(words)) # remove dups, count will drop
print(len(words))
########################
cpucnt = 4 # process count to use
# split word list for each process
wl = len(words)//cpucnt + 1 # word count per process
lstsplit = []
for c in range(cpucnt):
lstsplit.append(words[c*wl:(c+1)*wl]) # create word list for each process
# start processes
with concurrent.futures.ProcessPoolExecutor(max_workers=cpucnt) as executor:
procs = [executor.submit(todict, lst) for lst in lstsplit]
results = [p.result() for p in procs] # block until results are gathered
# merge results to single dictionary
dd = {}
for r in results:
dd.update(r)
print(len(dd)) # confirm match word count
with open('dd.txt','w') as f: f.write(str(dd)) # write dictionary to text file

Finding the mean per class in Pytorch

I am naively iterating over each and every sample of the dataset. Is there anyway to calculate the mean efficiently ?
my_root = '/mini_imagenet_full_size/train/'
Clmean = []
dir_list = os.listdir(my_root)
print(len(dir_list))
miniImagenet_dataset = datasets.ImageFolder(root=my_root, transform=data_transform)
Clmean=torch.zeros([64,3,224,224])
for t,c in miniImagenet_dataset:
print(c)
Clmean[c,:,:,:]+=t
print(Clmean)

How to use Python3 multiprocessing to append list?

I have an empty list empty_list = []
and 2 other lists: list1=[[1,2,3],[4,5,6],[7,8,9]],list2=[[10,11,12],[13,14,15],[16,17,18]].
I would like to two things:
I would like to pick up [1,2,3] from list and [10,11,12] to make [1,2,3,10,11,12];[4,5,6]and[13,14,15] to form [4,5,6,13,14,15] and finally [7,8,9],[17,18,19] to form [7,8,9,17,18,19]
append listA=[1,2,3,10,11,12], listB=[4,5,6,13,14,15], listC=[7,8,9,17,18,19] to empty with axis=0.
I have done this work by non-multiprocess but slowly. I would ask how to do it by multiprocess.
I have two naive approaches but do not know how to implement it.
to use pool,
make a func0, for picking up sub-lists and merge them using pool.map(func0,[lst for lst in[ list1,list2,list3]]
make a func1 for appending listA, listB, listC to the empty and then pool.map(func1,[lst for lst in [listA,listB,listC]]),
to use multiprocessing.Array
but I have not figured out how to do it
This sample may not need to use multiprocessing but I have lists with thousands lines.
I am not sure if this can help, but you can avoid some list comprehensions:
empty_list=[]
for l1,l2 in zip(list1,list2):
empty_list.append(l1+l2)
Let's check time performance with some random lists:
import timeit
code_to_test = """
import numpy as np
list1 = [np.random.randint(0,10, 100).tolist() for i in range(10_000)]
list2 = [np.random.randint(0,10, 100).tolist() for i in range(10_000)]
empty_list=[]
for l1,l2 in zip(list1,list2):
empty_list.append(l1+l2)
"""
elapsed_time = timeit.timeit(code_to_test, number=100)/100
print(elapsed_time, ' seconds')
0.12564824399999452 seconds
You can use dask to parallelize numpy operations:
import dask.array as da
list1 = da.from_array(list1)
list2 = da.from_array(list2)
result = da.hstack([list1,list2])
result.compute()

Split list into randomised ordered sub lists

I would like to improve the below code to split a list of values into two sub lists, which have been randomised and sorted. The below code works, but I'm sure there is a better/cleaner way to do it.
import random
data = list(range(1, 61))
random.shuffle(data)
Intervention = data[:30]
Control = data[30:]
Intervention.sort()
Control.sort()
f = open('Randomised_Groups.txt', 'w')
f.write('Intervention Group = ' + str(Intervention) + '\n' + 'Control Group = ' + str(Control))
f.close()
The expected output is:
Intervention = [1,3,7,9]
Control = [2,4,5,6,8,10]
I think your code is short and clean already. Some changes you can make:
Call sorted() when you slice it.
Intervention = sorted(data[:30])
You can also define both Intervention and Control on one line:
Intervention, Control = data[:30], data[30:]
I would replace the 30 with a variable:
half = len(data)//2
It is safer to open a file with with. That closes the file automatically when indentation stops.
with open('Randomised_Groups.txt', 'w') as f:
...
With the use of f-strings you can make the write statement shorter:
f.write(f'Intervention Group = {Intervention} \nControl Group = {Control}')
All combined:
import random
data = list(range(1, 61))
random.shuffle(data)
half = len(data)//2
Intervention, Control = sorted(data[:half]), sorted(data[half:])
with open('Randomised_Groups.txt', 'w') as f:
f.write(f'Intervention Group = {Intervention}\nControl Group = {Control}')
Something like this might be what you want:
import random
my_rng = [random.randint(0,1) for i in range(60)]
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
print(Control)
The idea is to create 60 random 1s or 0s to use as indicators for which list to put each number in. This will only work if you do not need the two lists to be the same length. To get the same length would require changing how my_rng is created in this example.
I have tinkered a bit further and got the lists of the same length:
import random
my_rng = [0 for i in range(30)]
my_rng.extend([1 for i in range(30)])
random.shuffle(my_rng)
Control = [i for i in range(60) if my_rng[i] == 0]
Intervention = [i for i in range(60) if my_rng[i] == 1]
Here, instead of adding randomly 1 or 0 to my_rng I get a list of 30 0s and 30 1s to shuffle, then continue like before.
Here is another solution that is more dynamic using built in random functionality that only creates the lists needed (no extra memory) and would work with lists that contain any type of object (provided that object can be sorted):
import random
def convert_to_random_list(data, num_list):
"""
Takes in the data as one large list and converts it into
[num_list] random sorted lists.
"""
result_lists = [list() for _ in range(num_list)] # two lists
for x in data:
# Using randint we pick which list to insert into
result_lists[random.randint(0, num_list - 1)].append(x)
# You could use list comprehension here with `sorted(...)` but it would take a little extra memory.
for _list in result_lists:
_list.sort()
return result_lists
Can be tested with:
data = list(range(1, 61))
random.shuffle(data)
temp = convert_to_random_list(data, 3)
print(temp)

Making a dictionary of from a list and a dictionary

I am trying to create a dictionary of codes that I can use for queries and selections. Let's say I have a dictionary of state names and corresponding FIPS codes:
statedict ={'Alabama': '01', 'Alaska':'02', 'Arizona': '04',... 'Wyoming': '56'}
And then I have a list of FIPS codes that I have pulled in from a Map Server request:
fipslist = ['02121', '01034', '56139', '04187', '02003', '04023', '02118']
I want to sort of combine the key from the dictionary (based on the first 2 characters of the value of that key) with the list items (also, based on the first 2 characters of the value of that key. Ex. all codes beginning with 01 = 'Alabama', etc...). My end goal is something like this:
fipsdict ={'Alabama': ['01034'], 'Alaska':['02121', '02003','02118'], 'Arizona': ['04187', '04023'],... 'Wyoming': ['56139']}
I would try to set it up similar to this, but it's not working quite correctly. Any suggestions?
fipsdict = {}
tempList = []
for items in fipslist:
for k, v in statedict:
if item[:2] == v in statedict:
fipsdict[k] = statedict[v]
fipsdict[v] = tempList.extend(item)
A one liner with nested comprehensions:
>>> {k:[n for n in fipslist if n[:2]==v] for k,v in statedict.items()}
{'Alabama': ['01034'],
'Alaska': ['02121', '02003', '02118'],
'Arizona': ['04187', '04023'],
'Wyoming': ['56139']}
You will have to create a new list to hold matching fips codes for each state. Below is the code that should work for your case.
for state,two_digit_fips in statedict.items():
matching_fips = []
for fips in fipslist:
if fips[:2] == two_digit_fips:
matching_fips.append(fips)
state_to_matching_fips_map[state] = matching_fips
>>> print(state_to_matching_fips_map)
{'Alabama': ['01034'], 'Arizona': ['04187', '04023'], 'Alaska': ['02121', '02003', '02118'], 'Wyoming': ['56139']}
For both proposed solutions I need a reversed state dictionary (I assume that each state has exactly one 2-digit code):
reverse_state_dict = {v: k for k,v in statedict.items()}
An approach based on defaultdict:
from collections import defaultdict
fipsdict = defaultdict(list)
for f in fipslist:
fipsdict[reverse_state_dict[f[:2]]].append(f)
An approach based on groupby and dictionary comprehension:
from itertools import groupby
{reverse_state_dict[k]: list(v) for k,v
in (groupby(sorted(fipslist), key=lambda x:x[:2]))}

Resources