How to share a variable among threads in joblib using external module - scikit-learn

I am trying to modify sklearn source code. In particular, I am modifying GridSearch source code, in a way that the separate processes/threads that evaluate the different model configuration share a variable among themselves. I need each thread/process to read/update that variable during running time in order to modify their execution according to what the other threads obtained. More specifically the parameter that I would like to share is best, in the snippet below:
out = parallel(delayed(_fit_and_score)(clone(base_estimator), X, y, best, self.early,train=train, test=test,parameters=parameters,**fit_and_score_kwargs) for parameters, (train, test) in product(candidate_params, cv.split(X, y, groups)))
Nota bene that the _fit_and_score function is in a separate module.
Sklearn utilizes joblib for parallelization, but I am not able to understand how I can effectively do that using an external module. In joblib doc this code is provided:
>>> shared_set = set()
>>> def collect(x):
... shared_set.add(x)
...
>>> Parallel(n_jobs=2, require='sharedmem')(
... delayed(collect)(i) for i in range(5))
[None, None, None, None, None]
>>> sorted(shared_set)
[0, 1, 2, 3, 4]
but I am not able to understand how to make it run in my context. You can find the source code here:
gridsearch: https://github.com/scikit-learn/scikit-learn/blob/7389dbac82d362f296dc2746f10e43ffa1615660/sklearn/model_selection/_search.py#L704
fit_and_score: https://github.com/scikit-learn/scikit-learn/blob/7389dbac82d362f296dc2746f10e43ffa1615660/sklearn/model_selection/_validation.py#L406

You can do it with python's Manager (https://docs.python.org/3/library/multiprocessing.html#multiprocessing.sharedctypes.multiprocessing.Manager), simple code for example:
from joblib import Parallel, delayed
from multiprocessing import Manager
manager = Manager()
q = manager.Namespace()
q.flag = False
def test(i, q):
#update shared var in 0 process
if i == 0:
q.flag = True
# do nothing for few seconds
for n in range(100000000):
if q.flag == True:
return f'process {i} was updated'
return 'process {i} was not updated'
out = Parallel(n_jobs=4)(delayed(test)(i, q) for i in range(4))
out:
['process 0 was updated',
'process 1 was updated',
'process 2 was updated',
'process 3 was updated']

Related

Multiprocessing pool map for a BIG array computation go very slow than expected

I've experienced some difficulties when using multiprocessing Pool in python3. I want to do BIG array calculation by using pool.map. Basically, I've a 3D array which I need to do computation for 10 times and it generates 10 output files sequentially. This task can be done 3 times i,e, in the output we get 3*10=30 output files(*.txt). To do this, I've prepared the following script for small array calculation (a sample problem). However, when I use this script for a BIG array calculation or array come out from a series of files, then this piece of code (maybe pool) capture the memory, and it does not save any .txt file at the destination directory. There is no error message when I run the file with command mpirun python3 sample_prob_func.py
Can anybody suggest what is the problem in the sample script and how to write code to get rid of stuck? I've not received any error message, but don't know where the problem occurs. Any help is appreciated. Thanks!
import numpy as np
import multiprocessing as mp
from scipy import signal
import matplotlib.pyplot as plt
import contextlib
import os, glob, re
import random
import cmath, math
import time
import pdb
#File Storing path
save_results_to = 'File saving path'
arr_x = [0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0, 8.49, 12.0]
arr_y = [0, 8.49, 12.0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0]
N=len(arr_x)
np.random.seed(12345)
total_rows = 5000
arr = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr1 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr2 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
# Finding cross spectral density (CSD)
def my_func1(data):
# Do something here
return array1
t0 = time.time()
my_data1 = my_func1(arr)
my_data2 = my_func1(arr1)
my_data3 = my_func1(arr2)
print('Time required {} seconds to execute CSD--For loop'.format(time.time()-t0))
mydata_list = [my_data1,my_data3,my_data3]
def my_func2(data2):
# Do something here
return from_data2
start_freq = 100
stop_freq = 110
freq_range= np.around(np.linspace(start_freq,stop_freq,11)/10, decimals=2)
no_of_freq = len(freq_range)
list_arr =[]
def my_func3(csd):
list_csd=[]
for fr_count in range(start_freq, stop_freq):
csd_single = csd[:,:, fr_count]
list_csd.append(csd_single)
print('Shape of list is :', np.array(list_csd).shape)
return list_csd
def parallel_function(BIG_list_data):
with contextlib.closing(mp.Pool(processes=10)) as pool:
dft= pool.map(my_func2, BIG_list_data)
pool.close()
pool.join()
data_arr = np.array(dft)
print('shape of data :', data_arr.shape)
return data_arr
count_day = 1
count_hour =0
for count in range(3):
count_hour +=1
list_arr = my_func3(mydata_list[count]) # Load Numpy files
print('Array shape is :', np.array(arr).shape)
t0 = time.time()
data_dft = parallel_function(list_arr)
print('The hour number={} data is processing... '.format(count_hour))
print('Time in parallel:', time.time() - t0)
for i in range(no_of_freq-1): # (11-1=10)
jj = freq_range[i]
#print('The hour_number {} and frequency number {} data is processing... '.format(count_hour, jj))
dft_1hr_complx = data_dft[i,:,:]
np.savetxt(save_results_to + f'csd_Day_{count_day}_Hour_{count_hour}_f_{jj}_hz.txt', dft_1hr_complx.view(float))
As #JérômeRichard suggested,to aware your job scheduler you need to define the number of processors will engage to perform this task. So, the following command could help you: ncpus = int(os.getenv('SLURM_CPUS_PER_TASK', 1))
You need to use this line inside your python script. Also, inside the parallel_function use with contextlib.closing(mp.Pool(ncpus=10)) as pool: instead of with contextlib.closing(mp.Pool(processes=10)) as pool:. Thanks

generating a list of arrays using multiprocessing in python

I am having difficulty implementing parallelisation for generating a list of arrays. In this case, each array is generated independently, and then appended to a list. Somehow multiprocessing.apply_asynch() is outputting an empty array when I feed it with complicated arguments.
More specifically, just to give the context, I am attempting implement a machine learning algorithm using parallelisation . The idea is the following: I have an 'system', and an 'agent' which performs actions on the system. To teach the agent (in this case a neural net) how to behave optimally (with respect to a certain reward scheme that I have omitted here), the agent needs to generate trajectories of the system by applying actions on it. From the obtained reward obtained upon performing the actions, the agent then learns what to do and what not to do. Note importantly that the possible actions in the code are referred to as integers with:
possible_actions = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
So here I am attempting to generate many such trajectories using multiprocessing (sorry the code is not runnable here as it requires many other files, but I'm hoping somebody can spot the issue):
from quantum_simulator_EC import system
from reinforce_keras_EC import Agent
import multiprocessing as mp
s = system(1200, N=3)
s.set_initial_state([0,0,1])
agent = Agent(alpha=0.0003, gamma=0.95, n_actions=len( s.actions ))
def get_result(result):
global action_batch
action_batch.append(result)
def generate_trajectory(s, agent):
sequence_of_actions = []
for k in range( 5 ):
net_input = s.generate_net_input_FULL(6)
action = agent.choose_action( net_input )
sequence_of_actions.append(action)
return sequence_of_actions
action_batch = []
pool = mp.Pool(2)
for i in range(0, batch_size):
pool.apply_async(generate_trajectory, args=(s,agent), callback=get_result)
pool.close()
pool.join()
print(action_batch)
The problem is the code returns an empty array []. Can somebody explain to me what the issue is? Are there restrictions on the kind of arguments that I can pass to apply_asynch? In this example I am passing my system 's' and my 'agent', both complicated objects. I am mentioning this because when I test my code with simple arguments like integers or matrices, instead of agent and system, it works fine. If there is no obvious reason why it's not working, if somebody has some tips to debug the code that would also be helpful.
Note that there is no problem if I do not use multiprocessing by replacing the last part by:
action_batch = []
for i in range(0, batch_size):
get_result( generate_sequence(s,agent) )
print(action_batch)
And in this case, the output here is as expected, a list of sequences of 5 actions:
[[4, 2, 1, 1, 7], [8, 2, 2, 12, 1], [8, 1, 9, 11, 9], [7, 10, 6, 1, 0]]
The final results can directly be appended to a list in the main process, no need to create a callback function. Then you can close and join the pool, and finally retrieve all the results using get.
See the following two examples, using apply_async and starmap_async, (see this post for the difference).
Solution apply
import multiprocessing as mp
import time
def func(s, agent):
print(f"Working on task {agent}")
time.sleep(0.1) # some task
return (s, s, s)
if __name__ == '__main__':
agent = "My awesome agent"
with mp.Pool(2) as pool:
results = []
for s in range(5):
results.append(pool.apply_async(func, args=(s, agent)))
pool.close()
pool.join()
print([result.get() for result in results])
Solution starmap
import multiprocessing as mp
import time
def func(s, agent):
print(f"Working on task {agent}")
time.sleep(0.1) # some task
return (s, s, s)
if __name__ == '__main__':
agent = "My awesome agent"
with mp.Pool(2) as pool:
result = pool.starmap_async(func, [(s, agent) for s in range(5)])
pool.close()
pool.join()
print(result.get())
Output
Working on task My awesome agent
Working on task My awesome agent
Working on task My awesome agent
Working on task My awesome agent
Working on task My awesome agent
[(0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4)]

Multiprocessing to return large data sets in Python

I have 2 functions in a Python 3.7 script that search 2 separate network nodes and returns very large data sets of strings in a list. The smaller data set length is ~300K entries, while the larger one is ~1.5M. This script takes almost an hour to execute because of how it has to compile the data sets as well as having the second data set be significantly larger. I don't have a way to shorten the run time by changing how the compilation happens, there's no easier way for me to get the data from the network nodes. But I can cut almost 10 minutes if I can run them simultaneously, so I'm trying to shorten the run time by using multiprocessing so I can run both of them at once.
I do not need them to necessarily start within the same second or finish at the same second, just want them to run at the same time.
Here's a breakdown of first attempt at coding for multiprocessing:
def p_func(arg1, arg2, pval):
## Do Stuff
return pval
def s_func(arg1, sval):
## Do Stuff
return sval
# Creating variables to get return values that multiprocessing can handle
pval = multiprocessing.Value(list)
sval = multiprocessing.Value(list)
# setting up multiprocessing Processes for each function and passing arguments
p1 = multiprocessing.Process(target=p_func, args=(arg1, arg2, pval))
s2 = multiprocessing.Process(target=s_func, args=(arg3, sval))
p1.start()
s1.start()
p1.join()
s1.join()
print("Number of values in pval: ", len(pval))
print("Number of values in sval: ", len(sval))
I believe I have solved my list concerns, so....
Based on comments I've updated my code as follows:
#! python3
import multiprocessing as mp
def p_func(arg1, arg2, pval):
# takes arg1 and arg2 and queries network node to return list of ~300K
# values and assigns that list to pval for return to main()
return pval
def s_func(arg1, sval):
# takes arg1 and queries network node to return list of ~1.5M
# values and assigns that list to sval for return to main()
return sval
# Creating variables to get return values that multiprocessing can handle in
# main()
with mp.Manager() as mgr
pval = mgr.list()
sval = mgr.list()
# setting up multiprocessing Processes for each function and passing
# arguments
p1 = mp.Process(target=p_func, args=(arg1, arg2, pval))
s1 = mp.Process(target=s_func, args=(arg3, sval))
p1.start()
s1.start()
p1.join()
s1.join()
# out of with block
print("Number of values in pval: ", len(pval))
print("Number of values in sval: ", len(sval))
Now getting a TypeError: can't pickle _thread.lock objects on the p1.start() invocation. I'm guessing that one of the variables I have passed in the p1 declaration is causing a problem with multiprocessing, but I'm not sure how to read the error or resolve the problem.
Use a Manager.list() instead:
import multiprocessing as mp
def p_func(pval):
pval.extend(list(range(300000)))
def s_func(sval):
sval.extend(list(range(1500000)))
if __name__ == '__main__':
# Creating variables to get return values that mp can handle
with mp.Manager() as mgr:
pval = mgr.list()
sval = mgr.list()
# setting up mp Processes for each function and passing arguments
p1 = mp.Process(target=p_func, args=(pval,))
s2 = mp.Process(target=s_func, args=(sval,))
p1.start()
s2.start()
p1.join()
s2.join()
print("Number of values in pval: ", len(pval))
print("Number of values in sval: ", len(sval))
Output:
Number of values in pval: 300000
Number of values in sval: 1500000
Manager objects are slower than shared memory but more flexible. Shared memory is faster, so if you know an upper limit for your arrays, you could use a fixed-sized shared memory Array and a shared value indicating the used size instead, such as:
#!python3
import multiprocessing as mp
def p_func(parr,psize):
for i in range(10):
parr[i] = i
psize.value = 10
def s_func(sarr,ssize):
for i in range(5):
sarr[i] = i
ssize.value = 5
if __name__ == '__main__':
# Creating variables to get return values that mp can handle
parr = mp.Array('i',2<<20) # 2M
sarr = mp.Array('i',2<<20)
psize = mp.Value('i',0)
ssize = mp.Value('i',0)
# setting up mp Processes for each function and passing arguments
p1 = mp.Process(target=p_func, args=(parr,psize))
s2 = mp.Process(target=s_func, args=(sarr,ssize))
p1.start()
s2.start()
p1.join()
s2.join()
print("parr: ", parr[:psize.value])
print("sarr: ", sarr[:ssize.value])
Output:
parr: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
sarr: [0, 1, 2, 3, 4]

Python Multiprocessing Scheduling

In Python 3.6, I am running multiple processes in parallel, where each process pings a URL and returns a Pandas dataframe. I want to keep running the (2+) processes continually, I have created a minimal representative example as below.
My questions are:
1) My understanding is that since I have different functions, I cannot use Pool.map_async() and its variants. Is that right? The only examples of these I have seen were repeating the same function, like on this answer.
2) What is the best practice to make this setup to run perpetually? In my code below, I use a while loop, which I suspect is not suited for this purpose.
3) Is the way I am using the Process and Manager optimal? I use multiprocessing.Manager.dict() as the shared dictionary to return the results form the processes. I saw in a comment on this answer that using a Queue here would make sense, however the Queue object has no `.dict()' method. So, I am not sure how that would work.
I would be grateful for any improvements and suggestions with example code.
import numpy as np
import pandas as pd
import multiprocessing
import time
def worker1(name, t , seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
time.sleep(t)
np.random.seed(seed)
df= pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
def worker2(name, t, seed, return_dict):
'''worker function'''
print(str(name) + 'is here.')
np.random.seed(seed)
time.sleep(t)
df = pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
return_dict[name] = [df.columns.tolist()] + df.values.tolist()
if __name__ == '__main__':
t=1
while True:
start_time = time.time()
manager = multiprocessing.Manager()
parallel_dict = manager.dict()
seed=np.random.randint(0,1000,1) # send seed to worker to return a diff df
jobs = []
p1 = multiprocessing.Process(target=worker1, args=('name1', t, seed, parallel_dict))
p2 = multiprocessing.Process(target=worker2, args=('name2', t, seed+1, parallel_dict))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
for proc in jobs:
proc.join()
parallel_end_time = time.time() - start_time
#print(parallel_dict)
df1= pd.DataFrame(parallel_dict['name1'][1:],columns=parallel_dict['name1'][0])
df2 = pd.DataFrame(parallel_dict['name2'][1:], columns=parallel_dict['name2'][0])
merged_df = pd.concat([df1,df2], axis=0)
print(merged_df)
Answer 1 (map on multiple functions)
You're technically right.
With map, map_async and other variations, you should use a single function.
But this constraint can be bypassed by implementing an executor, and passing the function to execute as part of the parameters:
def dispatcher(args):
return args[0](*args[1:])
So a minimum working example:
import multiprocessing as mp
def function_1(v):
print("hi %s"%v)
return 1
def function_2(v):
print("by %s"%v)
return 2
def dispatcher(args):
return args[0](*args[1:])
with mp.Pool(2) as p:
tasks = [
(function_1, "A"),
(function_2, "B")
]
r = p.map_async(dispatcher, tasks)
r.wait()
results = r.get()
Answer 2 (Scheduling)
I would remove the while from the script and schedule a cron job (on GNU/Linux) (on windows) so that the OS will be responsible for it's execution.
On Linux you can run cronotab -e and add the following line to make the script run every 5 minutes.
*/5 * * * * python /path/to/script.py
Answer 3 (Shared Dictionary)
yes but no.
To my knowledge using the Manager for data such as collections is the best way.
For Arrays or primitive types (int, floats, ecc) exists Value and Array which are faster.
As in the documentation
A manager object returned by Manager() controls a server process which holds > Python objects and allows other processes to manipulate them using proxies.
A manager returned by Manager() will support types list, dict, Namespace, Lock, > RLock, Semaphore, BoundedSemaphore, Condition, Event, Barrier, Queue, Value and > Array.
Server process managers are more flexible than using shared memory objects because they can be made to support arbitrary object types. Also, a single manager can be shared by processes on different computers over a network. They are, however, slower than using shared memory.
But you have only to return a Dataframe, so the shared dictionary it's not needed.
Cleaned Code
Using all the previous ideas the code can be rewritten as:
map version
import numpy as np
import pandas as pd
from time import sleep
import multiprocessing as mp
def worker1(t , seed):
print('worker1 is here.')
sleep(t)
np.random.seed(seed)
return pd.DataFrame(np.random.randint(0,1000,8).reshape(2,4), columns=list('ABCD'))
def worker2(t , seed):
print('worker2 is here.')
sleep(t)
np.random.seed(seed)
return pd.DataFrame(np.random.randint(0, 1000, 12).reshape(3, 4), columns=list('ABCD'))
def dispatcher(args):
return args[0](*args[1:])
def task_generator(sleep_time=1):
seed = np.random.randint(0,1000,1)
yield worker1, sleep_time, seed
yield worker2, sleep_time, seed + 1
with mp.Pool(2) as p:
results = p.map(dispatcher, task_generator())
merged = pd.concat(results, axis=0)
print(merged)
If the process of concatenation of the Dataframe is the bottleneck, An approach with imap might become optimal.
imap version
with mp.Pool(2) as p:
merged = pd.DataFrame()
for result in p.imap_unordered(dispatcher, task_generator()):
merged = pd.concat([merged,result], axis=0)
print(merged)
The main difference is that in the map case, the program first wait for all the process tasks to end, and then concatenate all the Dataframes.
While in the imap_unoredered case, As soon as a task as ended, the Dataframe is concatenated ot the current results.

How can I make my program to use multiple cores of my system in python?

I wanted to run my program on all the cores that I have. Here is the code below which I used in my program(which is a part of my full program. somehow, managed to write the working flow).
def ssmake(data):
sslist=[]
for cols in data.columns:
sslist.append(cols)
return sslist
def scorecal(slisted):
subspaceScoresList=[]
if __name__ == '__main__':
pool = mp.Pool(4)
feature,FinalsubSpaceScore = pool.map(performDBScan, ssList)
subspaceScoresList.append([feature, FinalsubSpaceScore])
#for feature in ssList:
#FinalsubSpaceScore = performDBScan(feature)
#subspaceScoresList.append([feature,FinalsubSpaceScore])
return subspaceScoresList
def performDBScan(subspace):
minpoi=2
Epsj=2
final_data = df[subspace]
db = DBSCAN(eps=Epsj, min_samples=minpoi, metric='euclidean').fit(final_data)
labels = db.labels_
FScore = calculateSScore(labels)
return subspace, FScore
def calculateSScore(cluresult):
score = random.randint(1,21)*5
return score
def StartingFunction(prvscore,curscore,fe_select,df):
while prvscore<=curscore:
featurelist=ssmake(df)
scorelist=scorecal(featurelist)
a = {'a' : [1,2,3,1,2,3], 'b' : [5,6,7,4,6,5], 'c' : ['dog', 'cat', 'tree','slow','fast','hurry']}
df2 = pd.DataFrame(a)
previous=0
current=0
dim=[]
StartingFunction(previous,current,dim,df2)
I had a for loop in scorecal(slisted) method which was commented, takes each column to perform DBSCAN and has to calculate the score for that particular column based on the result(but I tried using random score here in example). This looping is making my code to run for a longer time. So I tried to parallelize each column of the DataFrame to perform DBSCAN on the cores that i had on my system and wrote the code in the above fashion which is not giving the result that i need. I was new to this multiprocessing library. I was not sure with the placement of '__main__' in my program. I also would like to know if there is any other way in python to run in a parallel fashion. Any help is appreciated.
Your code has all what is needed to run on multi-core processor using more than one core. But it is a mess. I don't know what problem you trying to solve with the code. Also I cannot run it since I don't know what is DBSCAN. To fix your code you should do several steps.
Function scorecal():
def scorecal(feature_list):
pool = mp.Pool(4)
result = pool.map(performDBScan, feature_list)
return result
result is a list containing all the results returned by performDBSCAN(). You don't have to populate the list manually.
Main body of the program:
# imports
# functions
if __name__ == '__main__':
# your code after functions' definition where you call StartingFunction()
I created very simplified version of your code (pool with 4 processes to handle 8 columns of my data) with dummy for loops (to achieve cpu-bound operation) and tried it. I got 100% cpu load (I have 4-core i5 processor) that naturally resulted in approx x4 faster computation (20 seconds vs 74 seconds) in comparison with single process implementation through for loop.
EDIT.
The complete code I used to try multiprocessing (I use Anaconda (Spyder) / Python 3.6.5 / Win10):
import multiprocessing as mp
import pandas as pd
import time
def ssmake():
pass
def score_cal(data):
if True:
pool = mp.Pool(4)
result = pool.map(
perform_dbscan,
(data.loc[:, col] for col in data.columns))
else:
result = list()
for col in data.columns:
result.append(perform_dbscan(data.loc[:, col]))
return result
def perform_dbscan(data):
assert isinstance(data, pd.Series)
for dummy in range(5 * 10 ** 8):
dummy += 0
return data.name, 101
def calculate_score():
pass
def starting_function(data):
print(score_cal(data))
if __name__ == '__main__':
data = {
'a': [1, 2, 3, 1, 2, 3],
'b': [5, 6, 7, 4, 6, 5],
'c': ['dog', 'cat', 'tree', 'slow', 'fast', 'hurry'],
'd': [1, 1, 1, 1, 1, 1]}
data = pd.DataFrame(data)
start = time.time()
starting_function(data)
print(
'running time = {:.2f} s'
.format(time.time() - start))

Resources