Why does processes take different time to execute in multi-processing?

Why does processes take different time to execute in multi-processing? - python-3.x

I am running a programm that processes 3 laks row dataframe using multi-processing. I do this over 64 core VM using 62 processes created using multiprocess.process in python. Each process is fed 4900 rows.
Oddly, process takes different time to finish. first process finished the task in 15 minutes while last one took more than 70. Below is code block for multi-processing that I've used.
import multiprocessing
# define dataframe here
data_thread = data
uid = "final" ### make sure to change uid
batch_size = 4900
counter = 0
datalen = len(data_thread)
Flag = True
processes = []
while(Flag):
start = counter*batch_size
end = min(datalen, start+batch_size)
if end>=datalen:
Flag = False
indices.append((start, end))
data_split = data_thread.iloc[start:end]
threadName = "process_"+str(counter)
processes.append(multiprocessing.Process(target=process, args = (data_split, uid, threadName, start, end, )))
counter = counter+1
startCount = 0
while(startCount<len(processes)):
t = processes[startCount]
try:
t.start()
except:
print("Error encountered while starting the process_%lf: %s"%(startCount, str(indices[startCount])))
print("Started: process_" + str(startCount))
startCount = startCount + 1
endCount = 0
while(endCount<len(processes)):
t = processes[endCount]
t.join()
print("Joined: process_" + str(endCount))
endCount = endCount + 1

Related

Issue in modifying a for loop using joblib

I have a sequential set of code which generates a tuple of values for different stocks, which is passed to a multiprocessing pool to apply technical indicators. Below is the sequential piece of code, which is working as expected.
child_fn_arg_tuple_list = []
for stock in m1_ts_consistent_stock_list: # prev_day_stock_list:
f_prev_ts_stock_merged_mdf_row =
m1_df_in_mdf[(m1_df_in_mdf['stock_id']==stock) &
(m1_df_in_mdf['datetimestamp'] == prev_ts)] # previous timestamp
if f_prev_ts_stock_merged_mdf_row.empty:
f_filtered_stock_list.remove(stock)
else:
f_stock_prev_ts_merged_ohlcv_df_list_of_dict =
f_prev_ts_stock_merged_mdf_row['merged_ohlcv_df'].iloc[0]
f_current_ts_stock_ohlcv_row_df =
period_ts_ohlcv_df[(period_ts_ohlcv_df['stock_id'] == stock)].copy()
if f_current_ts_stock_ohlcv_row_df.shape[0] == 1:
pass
else:
error_string = f_current_fn + 'Expected
f_current_ts_stock_ohlcv_row_df shape for stock ' + stock \
+ 'at ts ' + str(m1_time) + ' is not 1 - ' +
str(f_current_ts_stock_ohlcv_row_df.shape[0])
f_current_ts_stock_ohlcv_row_df =
period_ts_ohlcv_df[(period_ts_ohlcv_df['stock_id'] == stock) &
(period_ts_ohlcv_df['datetimestamp'] == (m1_time -
timedelta(minutes=1)))].copy()
fn_arg_tuple = (f_from_date_list,f_run_folder_name,stock,
f_period,m1_time, f_stock_prev_ts_merged_ohlcv_df_list_of_dict,
f_current_ts_stock_ohlcv_row_df,f_grouped_column_list_dict)
child_fn_arg_tuple_list.append(fn_arg_tuple)
result_list = []
pool = multiprocessing.Pool(7)
for result in pool.starmap(single_stock_apply_indicator_df_in_df_v3, child_fn_arg_tuple_list):
result_list.append(result)
pool.close()
Since the for loop runs for around 400 stocks every minute, I am trying to speed up the for loop over stocks, before passing them for applying multiprocessing using python inner function and joblib - parallel , delayed.
def create_child_fn_arg_tuple_list(cp_stock): # cp = child parameter
f_prev_ts_stock_merged_mdf_row = m1_df_in_mdf[
(m1_df_in_mdf['stock_id'] == cp_stock) &
(m1_df_in_mdf['datetimestamp'] == prev_ts)].copy()
if f_prev_ts_stock_merged_mdf_row.empty:
f_filtered_stock_list.remove(cp_stock)
else:
f_stock_prev_ts_merged_ohlcv_df_list_of_dict = \
f_prev_ts_stock_merged_mdf_row['merged_ohlcv_df'].iloc[0]
f_current_ts_stock_ohlcv_row_df = period_ts_ohlcv_df[
(period_ts_ohlcv_df['stock_id'] == cp_stock)].copy()
if f_current_ts_stock_ohlcv_row_df.shape[0] == 1:
pass
else:
error_string = f_current_fn + 'Expected f_current_ts_stock_ohlcv_row_df
shape for stock ' + \
cp_stock + 'at ts ' + str(m1_time) + ' is not 1 - ' + \
str(f_current_ts_stock_ohlcv_row_df.shape[0])
f_current_ts_stock_ohlcv_row_df =
period_ts_ohlcv_df[(period_ts_ohlcv_df['stock_id'] == cp_stock)
& (period_ts_ohlcv_df['datetimestamp'] ==
(m1_time - timedelta(minutes=1)))].copy()
fn_arg_tuple = (f_from_date_list, f_run_folder_name, cp_stock, f_period,
m1_time,f_stock_prev_ts_merged_ohlcv_df_list_of_dict,
f_current_ts_stock_ohlcv_row_df,f_grouped_column_list_dict)
child_fn_arg_tuple_list.append(fn_arg_tuple)
return child_fn_arg_tuple_list
child_fn_arg_tuple_list = Parallel(n_jobs=7, backend='multiprocessing')\
(delayed(create_child_fn_arg_tuple_list)(in_stock) for in_stock in
m1_ts_consistent_stock_list)
result_list = []
pool = multiprocessing.Pool(7)
for result in pool.starmap(single_stock_apply_indicator_df_in_df_v3, child_fn_arg_tuple_list):
result_list.append(result)
pool.close()
I am getting an error -
AttributeError: Can't pickle local object 'multiple_stock_apply_indicator_df_in_df_v6..create_child_fn_arg_tuple_list' and occurs in the line line where I am trying to apply the joblib parallel and delayed.
Please note that there are some common variables between the main function and inner function - m1_df_in_mdf, f_filtered_stock_list
1] m1_df_in_mdf is not affected as it is used only in read only mode
2] f_filtered_stock_list is affected as some stocks are removed
My objective is to get the for loop of stocks run faster, any other approaches are also welcome.

Object tracking and counting with YOLO V5 and Deep_Sort problem

I'm working on an object tracking and counting project but I want the counter to start counting from 0 each 10 seconds but when I tried to create a counting thread it just doesn't work and the counting would show 0 the whole video.
Here is the code:
def countdown():
global timer
timer = 10
for z in range(10):
timer = timer - 1
sleep (1)
if timer == 0:
count = 0
countdown_thread = threading.Thread(target = countdown)
countdown_thread.start()
def count_obj(box,w,h,id):
global count,data
#ADDEDD
while timer > 0:
center_coordinates = (int(box[0]+(box[2]-box[0])/2) , int(box[1]+(box[3]-box[1])/2))
if int(box[1]+(box[3]-box[1])/2) > (h -200):
if id not in data:
count += 1
data.append(id)
count = 0

Target Labeling Using Sliding Window On Stock Data In Python

I'm trying to label BUY, SELL, and HOLD values to the closing stock prices based on the algorithm I found in a paper. I'm not quite able to figure out the error I'm getting. I'd very much appreciate your help. Thank you.
Algorigthm:
[EDITED]
My implementation:
window_size = 11
counter = 0
result = []
window_begin_idx=0; window_end_idx=0; window_middle_idx=0; min_idx=0; max_idx=0;
while counter < len(closing_price):
if counter > window_size:
window_begin_idx = counter - window_size
window_end_idx = window_begin_idx + window_size - 1
window_middle_idx = (window_begin_idx + window_end_idx)//2
for i in range(window_begin_idx, window_end_idx+1):
rng = closing_price[window_begin_idx:window_end_idx+1]
number = closing_price[i]
mins = rng.min()
maxs = rng.max()
if number < mins:
mins=number
min_idx = np.argmin(rng)
if number > maxs:
maxs=number
max_idx = np.argmax(rng)
if max_idx == window_middle_idx:
result.append("SELL")
elif min_idx == window_middle_idx:
result.append("BUY")
else:
result.append("HOLD")
mins = 0.0
maxs = 10000.0
counter+=1
After the edit based on the author's JAVA code, I'm only getting the HOLD label. The author's implementation is here.

You need to initialize mins, maxs, min_idx and max_idx with appropriate values before the main loop.
In your case if max_idx == occurs earlier than any max_idx assignment
Edit after questing change:
Seems in Python you can make similar behavior replacing the whole for-loop with:
rng = closing_price[window_begin_idx:window_end_idx+1]
mins = rng.min()
maxs = rng.max()
min_idx = rng.index(mins)
max_idx = rng.index(maxs)

After reading through the author's implementation and following the suggestions provided by MBo, I have managed to solve this issue. So, now anyone who wants this algorithm in python, below is the code:
window_size = 11
counter = 0
result = []
window_begin_idx=0; window_end_idx=0; window_middle_idx=0; min_idx=0; max_idx=0;
number=0.0; mins=10000.0; maxs=0.0
while counter < len(closing_price):
if counter > window_size:
window_begin_idx = counter - window_size
window_end_idx = window_begin_idx + window_size - 1
window_middle_idx = (window_begin_idx + window_end_idx)//2
for i in range(window_begin_idx, window_end_idx+1):
number = closing_price[i]
if number < mins:
mins=number
min_idx = np.where(closing_price==mins)[0][0]
if number > maxs:
maxs=number
max_idx = np.where(closing_price==maxs)[0][0]
if max_idx == window_middle_idx:
result.append("SELL")
elif min_idx == window_middle_idx:
result.append("BUY")
else:
result.append("HOLD")
mins = 10000.0
maxs = 0.0
counter+=1

How to solve issue of maximum iteration getting exceeded in python gekko in this case (explained in the body)?

I am trying to track multiple set-points in the case of interacting quadruple tank system process. Here, the upper limits of tanks are 25 and lower limits are 0. I want to track the set-point values of 5,12,7 and 5. Although, I am able to track the initial 3 set-points (i.e. 5,12 and 7), I am not able to track the last set-point due to solver exceeding max. iterations. I have attached the code below->
#MHE+MPC model
#to measure computational time of the code
start=time.time()
#Process Model
p = GEKKO(remote=False)
process=0
p.time = [0,0.5]
noise = 0.25
#Constants
g = 981
g1 = .9
g2 = .9
A1=32
A3=32
A2=32
A4=32
a1=0.057
a3=0.057
a2=0.057
a4=0.057
init_h=5
#Controlled process variables
p.h1=p.SV(lb=0,ub=25)
p.h2=p.SV(lb=0,ub=25)
p.h3=p.SV(lb=0,ub=25)
p.h4=p.SV(lb=0,ub=25)
#Manipulated process variables
p.v1=p.MV(value=3.15,lb=0.1,ub=8)
p.v2=p.MV(value=3.15,lb=0.1,ub=8)
#Parameters of process
p.k1=p.Param(value=3.14,lb=0,ub=10)
p.k2=p.Param(value=3.14,lb=0,ub=10)
#Equations process
p.Equation(A1*p.h1.dt()==a3*((2*g*p.h3)**0.5)-(a1*((2*g*p.h1)**0.5))+(g1*p.k1*p.v1))
p.Equation(A2*p.h2.dt()==a4*((2*g*p.h4)**0.5)-(a2*((2*g*p.h2)**0.5))+(g2*p.k2*p.v2))
p.Equation(A3*p.h3.dt()==-a3*((2*g*p.h3)**0.5)+((1-g2)*p.k2*p.v2))
p.Equation(A4*p.h4.dt()==-a4*((2*g*p.h4)**0.5)+((1-g1)*p.k1*p.v1))
#options
p.options.IMODE = 4
#p.h1.TAU=-10^10
#p.h2.TAU=-10^10
#%% MHE Model
m = GEKKO(remote=False)
#prediction horizon
m.time = np.linspace(0,40,41) #0-20 by 0.5 -- discretization must match simulation
#MHE control, manipulated variables and parameters
m.h1=m.CV(lb=0,ub=25)
m.h2=m.CV(lb=0,ub=25)
m.h3=m.SV(lb=0,ub=25)
m.h4=m.SV(lb=0,ub=25)
m.v1=m.MV(value=3.15,lb=0.10,ub=8)
m.v2=m.MV(value=3.15,lb=0.10,ub=8)
m.k1=m.FV(value=3.14,lb=0,ub=10)
m.k2=m.FV(value=3.14,lb=0,ub=10)
#m.h1.TAU=0
#m.h2.TAU=0
#Equations
m.Equation(A1*m.h1.dt()==a3*((2*g*m.h3)**0.5)-(a1*((2*g*m.h1)**0.5))+(g1*m.k1*m.v1))
m.Equation(A2*m.h2.dt()==a4*((2*g*m.h4)**0.5)-(a2*((2*g*m.h2)**0.5))+(g2*m.k2*m.v2))
m.Equation(A3*m.h3.dt()==-a3*((2*g*m.h3)**0.5)+((1-g2)*m.k2*m.v2))
m.Equation(A4*m.h4.dt()==-a4*((2*g*m.h4)**0.5)+((1-g1)*m.k1*m.v1))
#Options
m.options.IMODE = 5 #MHE
m.options.EV_TYPE = 2
# STATUS = 0, optimizer doesn't adjust value
# STATUS = 1, optimizer can adjust
m.v1.STATUS = 0
m.v2.STATUS = 0
m.k1.STATUS=1
m.k2.STATUS=1
m.h1.STATUS = 1
m.h2.STATUS = 1
#m.h3.STATUS = 0
#m.h4.STATUS = 0
# FSTATUS = 0, no measurement
# FSTATUS = 1, measurement used to update model
m.v1.FSTATUS = 1
m.v2.FSTATUS = 1
m.k1.FSTATUS=0
m.k2.FSTATUS=0
m.h1.FSTATUS = 1
m.h2.FSTATUS = 1
m.h3.FSTATUS = 1
m.h4.FSTATUS = 1
#m.options.MAX_ITER=1000
m.options.SOLVER=3
m.options.NODES=3
#%% MPC Model
c = GEKKO(remote=False)
c.time = np.linspace(0,10,11) #0-5 by 0.5 -- discretization must match simulation
c.v1=c.MV(value=3.15,lb=0.10,ub=8)
c.v2=c.MV(value=3.15,lb=0.10,ub=8)
c.k1=c.FV(value=3.14,lb=0,ub=10)
c.k2=c.FV(value=3.14,lb=0,ub=10)
#Variables
c.h1=c.CV(lb=0,ub=25)
c.h2=c.CV(lb=0,ub=25)
c.h3=c.SV(lb=0,ub=25)
c.h4=c.SV(lb=0,ub=25)
#Equations
c.Equation(A1*c.h1.dt()==a3*((2*g*c.h3)**0.5)-(a1*((2*g*c.h1)**0.5))+(g1*c.k1*c.v1))
c.Equation(A2*c.h2.dt()==a4*((2*g*c.h4)**0.5)-(a2*((2*g*c.h2)**0.5))+(g2*c.k2*c.v2))
c.Equation(A3*c.h3.dt()==-a3*((2*g*c.h3)**0.5)+((1-g2)*c.k2*c.v2))
c.Equation(A4*c.h4.dt()==-a4*((2*g*c.h4)**0.5)+((1-g1)*c.k1*c.v1))
#Options
c.options.IMODE = 6 #MPC
c.options.CV_TYPE = 2
# STATUS = 0, optimizer doesn't adjust value
# STATUS = 1, optimizer can adjust
c.v1.STATUS = 1
c.v2.STATUS = 1
c.k1.STATUS=0
c.k2.STATUS=0
c.h1.STATUS = 1
c.h2.STATUS = 1
#c.h3.STATUS = 0
#c.h4.STATUS = 0
# FSTATUS = 0, no measurement
# FSTATUS = 1, measurement used to update model
c.v1.FSTATUS = 0
c.v2.FSTATUS = 0
c.k1.FSTATUS=1
c.k2.FSTATUS=1
c.h1.FSTATUS = 1
c.h2.FSTATUS = 1
c.h3.FSTATUS = 1
c.h4.FSTATUS = 1
sp=5
c.h1.SP=sp
c.h2.SP=sp
p1 = GEKKO(remote=False)
p1.time = [0,0.5]
#Parameters
p1.h1=p1.CV(lb=0,ub=25)
p1.h2=p1.CV(lb=0,ub=25)
p1.h3=p1.CV(lb=0,ub=25)
p1.h4=p1.CV(lb=0,ub=25)
p1.v1=p1.MV(value=3.15,lb=0.1,ub=8)
p1.v2=p1.MV(value=3.15,lb=0.1,ub=8)
p1.k1=p1.Param(lb=0,ub=10,value=3.14)
p1.k2=p1.Param(lb=0,ub=10,value=3.14)
#Equations
p1.Equation(A1*p1.h1.dt()==a3*((2*g*p1.h3)**0.5)-a1*((2*g*p1.h1)**0.5)+g1*p1.k1*p1.v1)
p1.Equation(A2*p1.h2.dt()==a4*((2*g*p1.h4)**0.5)-a2*((2*g*p1.h2)**0.5)+g2*p1.k2*p1.v2)
p1.Equation(A3*p1.h3.dt()==-a3*((2*g*p1.h3)**0.5)+(1-g2)*p1.k2*p1.v2)
p1.Equation(A4*p1.h4.dt()==-a4*((2*g*p1.h4)**0.5)+(1-g1)*p1.k1*p1.v1)
#options
p1.options.IMODE = 4
#%% problem configuration
# number of cycles
cycles = 480
# noise level
#%% run process, estimator and control for cycles
h1_meas = np.empty(cycles)
h2_meas =np.empty(cycles)
h3_meas =np.empty(cycles)
h4_meas=np.empty(cycles)
h1_est = np.empty(cycles)
h2_est = np.empty(cycles)
h3_est = np.empty(cycles)
h4_est = np.empty(cycles)
h1_plant=np.empty(cycles)
h2_plant=np.empty(cycles)
h3_plant=np.empty(cycles)
h4_plant=np.empty(cycles)
h1_measured=np.empty(cycles)
h2_measured=np.empty(cycles)
h3_measured=np.empty(cycles)
h4_measured=np.empty(cycles)
v1_est = np.empty(cycles)
v2_est = np.empty(cycles)
k1_est = np.empty(cycles)
k2_est = np.empty(cycles)
u_cont_k1 = np.empty(cycles)
u_cont_k2 = np.empty(cycles)
sp_store = np.empty(cycles)
sum_est=np.empty(cycles)
sum_model=np.empty(cycles)
# Create plot
plt.figure(figsize=(10,7))
plt.ion()
plt.show()
p.MAX_ITER=20
c.MAX_ITER=20
m.MAX_ITER=20
p1.MAX_ITER=20
for i in range(cycles):
print(i)
# set point changes
if i==cycles/4:
sp = 12
elif i==2*cycles/4:
sp = 7
elif i==3*cycles/4:
sp = 5
sp_store[i] = sp
c.h1.SP=sp
c.h2.SP=sp
c.k1.MEAS = m.k1.NEWVAL
c.k2.MEAS = m.k2.NEWVAL
if p.options.SOLVESTATUS == 1:
# print("going:",i)
c.h1.MEAS = p.h1.MODEL
c.h2.MEAS = p.h2.MODEL
c.h3.MEAS = p.h3.MODEL
c.h4.MEAS = p.h4.MODEL
print(i,'Plant Model:',p.h1.MODEL,p.h2.MODEL,p.h3.MODEL,p.h4.MODEL)
c.solve(disp=False,debug=0)
#print("NEWVAL:",i,c.u,c.u.NEWVAL)
u_cont_k1[i] = c.v1.NEWVAL
u_cont_k2[i] = c.v2.NEWVAL
#print("Horizon:",i,c.h1[0:],c.h2[0:])
#print("Move:",i,c.v1.NEWVAL,c.v2.NEWVAL)
## process simulator
#load control move
p.v1.MEAS = u_cont_k1[i]
p.v2.MEAS = u_cont_k2[i]
#simulate
p.solve(disp=False,debug=0)
#plant model
p1.k1=3.14
p1.k2=3.14
p1.v1.MEAS = u_cont_k1[i]
p1.v2.MEAS = u_cont_k2[i]
p1.solve(disp=False,debug=0)
h1_plant[i]=p1.h1.MODEL
h2_plant[i]=p1.h2.MODEL
h3_plant[i]=p1.h3.MODEL
h4_plant[i]=p1.h4.MODEL
h1_measured[i]=p1.h1.MODEL+(random()*2)*noise
h2_measured[i]=p1.h2.MODEL+(random()*2)*noise
h3_measured[i]=p1.h3.MODEL+(random()*2)*noise
h4_measured[i]=p1.h4.MODEL+(random()*2)*noise
#print("Model process output:",i,p.h1.MODEL,p.h2.MODEL,p.h3.MODEL,p.h4.MODEL)
#load output with white noise
h1_meas[i] = p.h1.MODEL+(random()-0.5)*noise
h2_meas[i] = p.h2.MODEL+(random()-0.5)*noise
h3_meas[i] = p.h3.MODEL+(random()-0.5)*noise
h4_meas[i] = p.h4.MODEL+(random()-0.5)*noise
#Only MPC
## estimator
#load input and measured output
m.v1.MEAS = u_cont_k1[i]
m.v2.MEAS = u_cont_k2[i]
#m.h1.MEAS = h1_meas[i]+(random()*2)*noise
#m.h2.MEAS = h2_meas[i]+(random()*2)*noise
#m.h3.MEAS = h3_meas[i]+(random()*2)*noise
#m.h4.MEAS = h4_meas[i]+(random()*2)*noise
m.h1.MEAS = h1_meas[i]
m.h2.MEAS = h2_meas[i]
m.h3.MEAS = h3_meas[i]
m.h4.MEAS = h4_meas[i]
#m.COLDSTART=2
#optimize parameters
m.solve(disp=False,debug=0)
#store results
if i>=process:
h1_est[i] = m.h1.MODEL
h2_est[i] = m.h2.MODEL
h3_est[i] = m.h3.MODEL
h4_est[i] = m.h4.MODEL
v1_est[i] = m.v1.NEWVAL
v2_est[i] = m.v2.NEWVAL
k1_est[i]= m.k1.NEWVAL
k2_est[i] = m.k2.NEWVAL
print("Estimated h:",i,h1_est[i],h2_est[i],h3_est[i],h4_est[i])
print("Estimated k:",i,k1_est[i],k2_est[i],p.k1[0],p.k2[0])
print("Estimated v:",i,v1_est[i],v2_est[i])
print("dh1/dt:",(a3*((2*g*h3_est[i])**0.5)-(a1*((2*g*h3_est[i])**0.5))+(g1*k1_est[i]*v1_est[i]))/A3)
print("dh2/dt:",(a4*((2*g*h4_est[i])**0.5)-(a2*((2*g*h2_est[i])**0.5))+(g2*k2_est[i]*v2_est[i]))/A2)
print("dh3/dt:",(-a3*((2*g*h3_est[i])**0.5)+((1-g2)*k2_est[i]*v2_est[i]))/A3)
print("dh4/dt:",(-a4*((2*g*h4_est[i])**0.5)+((1-g1)*k1_est[i]*v1_est[i]))/A4)
if i%1==0:
plt.clf()
plt.subplot(4,1,1)
#plt.plot(h1_meas[0:i])
#plt.plot(h2_meas[0:i])
#plt.plot(h3_meas[0:i])
#plt.plot(h4_meas[0:i])
plt.plot(h1_est[0:i])
plt.plot(h2_est[0:i])
plt.plot(sp_store[0:i])
plt.subplot(4,1,2)
plt.plot(h3_est[0:i])
plt.plot(h4_est[0:i])
#plt.legend(('h1_pred','h2_pred','h3_pred','h4_pred'))
plt.subplot(4,1,3)
plt.plot(k1_est[0:i])
plt.plot(k2_est[0:i])
plt.subplot(4,1,4)
plt.plot(v1_est[0:i])
plt.plot(v2_est[0:i])
plt.draw()
plt.pause(0.05)
end=time.time()
print("total time:",end-start)
I feel there is some issue with my MHE+MPC code. However, I am not able to realize the mistake?

Nice application. I needed a few imports to make the script work. These may be loaded automatically for you.
from gekko import GEKKO
import time
import numpy as np
import matplotlib.pyplot as plt
from random import random
The script solves successfully if a lower bound is included on all the level variables (1e-6). There is a problem when the level goes below zero or is at zero when using m.sqrt(). This small adjustment helps it solve successfully so it doesn't get into a region where it is undefined. Gekko solvers can't deal with imaginary numbers.
Although the solution is successful, it appears that the control performance oscillates. There may need to be some tuning of the application.

I have some code that take under 2 seconds to run 100 iterations, under 8 seconds to run 1000, and over 11 minutes to run 10,000

I'm a hobbyist programmer, and this is just a little project I set for myself. I know I very likely have something in this code that is inefficient enough to not matter for small loops but is compounding when I scale it up. Any suggestions would be appreciated.
def RndSelection(ProjMatrix):
percentiles = [0,10,20,25,30,40,50,60,70,75,80,90,99]
results = []
for row in ProjMatrix.itertuples():
x = npr.randint(1,100)
for p in range(3,16):
if p < 15:
a = percentiles[p-3]
b = percentiles[p-2]
if x in range (a,b):
factor = (b-x)/(b-a)
r = round((row[p]*factor)+((row[p+1])*(1-factor)),2)
break
else:
r = row[p]
results.append(r)
thisrun = pd.DataFrame(results)
return(thisrun)
def main():
ts = datetime.datetime.now()
print ('Run Started: ', ts)
Matrix = SetMatrix()
Outcome = Matrix['player_id']
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(RndSelection,Matrix) for _ in range(10000)]
for f in concurrent.futures.as_completed(results):
thisrun = f.result()
Outcome = pd.concat([Outcome,thisrun],axis=1)
print(Outcome)
ts = datetime.datetime.now()
print('Run Completed: ', ts)
if __name__ == '__main__':
main()

So the answer, as Jérôme pointed out, was the iteration of the concat.
Moving the output to a list of lists and then concat just once improved the runtime of 10,000 interactions to 8 seconds and 100,000 iterations to 2 mins, 34 seconds.
def RndSelection(ProjMatrix):
percentiles = [0,10,20,25,30,40,50,60,70,75,80,90,99]
results = []
r = ""
for row in ProjMatrix.itertuples():
x = npr.randint(1,100)
for p in range(3,16):
if p < 15:
a = percentiles[p-3]
b = percentiles[p-2]
if x in range (a,b):
factor = (b-x)/(b-a)
r = round((row[p]*factor)+((row[p+1])*(1-factor)),2)
break
else:
r = row[p]
results.append(r)
return results
def main():
ts = datetime.datetime.now()
print ('Run Started: ', ts)
Matrix = SetMatrix()
runs = 100000
s = 0
Outcome = pd.DataFrame(Matrix['player_id'])
thisrun = np.empty((runs,0)).tolist()
with concurrent.futures.ProcessPoolExecutor() as executor:
results = [executor.submit(RndSelection,Matrix) for _ in range(runs)]
for f in concurrent.futures.as_completed(results):
thisrun[s]=f.result()
s += 1
allruns = pd.DataFrame(thisrun).transpose()
Outcome = pd.concat([Outcome,allruns],axis=1)
ts = datetime.datetime.now()
print('Run Completed: ', ts)
if __name__ == '__main__':
main()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why does processes take different time to execute in multi-processing? - python-3.x

Related

Issue in modifying a for loop using joblib

Object tracking and counting with YOLO V5 and Deep_Sort problem

Target Labeling Using Sliding Window On Stock Data In Python

How to solve issue of maximum iteration getting exceeded in python gekko in this case (explained in the body)?

I have some code that take under 2 seconds to run 100 iterations, under 8 seconds to run 1000, and over 11 minutes to run 10,000

Categories

Resources