multiprocessing: maxtasksperchild and chunksize conflict? - python-3.x

I am using the multiprocessing module in Python 3.7. My code is not working as expected (see this question here). Someone suggested to set maxtasksperchild, which I set to 1. Then, while reading the documentation, I figured that it was best to set the chunksize to 1 as well. This is the relevant code part:
# Parallel Entropy Calculation
# ============================
node_combinations = [(i, j) for i in g.nodes for j in g.nodes]
pool = Pool(maxtaskperchild=1)
start = datetime.datetime.now()
logging.info("Start time: %s", start)
print("Start time: ", start)
results = pool.starmap(g._log_probability_path_ij, node_combinations, chunksize=1)
end = datetime.datetime.now()
print("End time: ", end)
print("Run time: ", end - start)
logging.info("End time: %s", end)
logging.info("Total run time: %s", start)
pool.close()
pool.join()
This backfired enormously. Setting only maxtasksperchild or only chunksize got the job done in the expected time (for a smaller dataset that I am using to test the code). Setting both just wouldn't finish and nothing was really running after a few seconds (I checked with htop to see if the cores where working).
Questions
Do maxtasksperchild and chunksize conflict when setting them together?
Do they do the same thing? maxtasksperchild at the Pool() level and chunksize at the Pool methods level?
======================================================
EDIT
I understand that debugging may be impossible from the extract of code presented, please find the full code below. The modules graph and graphfile are just little libraries written by me available in GitHub. If you wish to run the code, you can use any of the files in the data/ directory in the mentioned GitHub repository. Short tests are better run using F2, but F1 and F3 are the ones causing trouble in the HPC.
import graphfile
import graph
from multiprocessing.pool import Pool
import datetime
import logging
def remove_i_and_f(edges):
new_edges = dict()
for k,v in edges.items():
if 'i' in k:
continue
elif 'f' in k:
key = (k[0],k[0])
new_edges[key] = v
else:
new_edges[k] = v
return new_edges
if __name__ == "__main__":
import sys
# Read data
# =========
graph_to_study = sys.argv[1]
full_path = "/ComplexNetworkEntropy/"
file = graphfile.GraphFile(full_path + "data/" + graph_to_study + ".txt")
edges = file.read_edges_from_file()
# logging
# =======
d = datetime.date.today().strftime("%Y_%m_%d")
log_filename = full_path + "results/" + d + "_probabilities_log_" + graph_to_study + ".log"
logging.basicConfig(filename=log_filename, level=logging.INFO, format='%(asctime)s === %(message)s')
logging.info("Graph to study: %s", graph_to_study)
logging.info("Date: %s", d)
# Process data
# ==============
edges = remove_i_and_f(edges)
g = graph.Graph(edges)
# Parallel Entropy Calculation
# ============================
node_combinations = [(i, j) for i in g.nodes for j in g.nodes]
pool = Pool(maxtasksperchild=1)
start = datetime.datetime.now()
logging.info("Start time: %s", start)
print("Start time: ", start)
results = pool.starmap(g._log_probability_path_ij, node_combinations, chunksize=1)
end = datetime.datetime.now()
print("End time: ", end)
print("Run time: ", end - start)
logging.info("End time: %s", end)
logging.info("Total run time: %s", start)
pool.close()
pool.join()

maxtasksperchild ensures a worker is restarted after a certain amount of tasks. In other words, it kills the process after it runs maxtaskperchild iteration of your given function. It is provided to contain resource leakages caused by poor implementations on long running services.
chunksize groups a given collection/iterator in multiple tasks. It then ships over the internal pipe the whole group to reduce inter-process communication (IPC) overhead. The collection elements will still be processed 1 by 1. chunksize is useful if you have a large collection of small elements and the IPC overhead is significant in relation to the processing of the elements themselves. One side effect is that the same process will process a whole chunk.
Setting both parameters to 1 dramatically increases process rotation and IPC which are both quite resource-heavy especially on machines with high number of cores.

Related

Random key lookup on LMDB/python vs BerkeleyBD/python (How to make LMDB lookup faster)

I have this program written in python that uses berkeleydb to store data (event logs) which i migrated to lmdb. My problem is, before an event gets written, the program does a lookup if the event already exists. I noticed that the berkeleydb version is much faster in doing the single value lookup using 13k+ records (as if the lmdb version is 1 second slower for every lookup) even with transactions enabled in berkeleydb. Any idea how to speed up the lmdb version? Note that I've had 70gb+ (about 30 million records) worth of data already stored in my berkeleydb and doing additional processing on those events takes me more than an hour so I thought switching to lmdb would decrease the processing time.
My LMDB environment was opened this way (I event set the readahead to False (but the database size is just about 35mb so I don't think it matters):
env = lmdb.open(db_folder, map_size=100000000000, max_dbs=4, readahead=False)
database = env.open_db('events'.encode())
My berkeleydb was opened this way:
env = db.DBEnv()
env.open(db_folder, db.DB_INIT_MPOOL | db.DB_CREATE | db.DB_INIT_LOG | db.DB_INIT_TXN | db.DB_RECOVER, 0)
database = db.DB(env)
BerkeleyDB version of check:
if event['eId'].encode('utf-8') in database:
duplicate_count += 1
else:
try:
txn = env.txn_begin(None)
database[event['eId'].encode('utf-8')] = json.dumps(event).encode('utf-8')
except:
if txn is not None:
txn.abort()
txn = None
raise
else:
txn.commit()
txn = None
event_count += 1
lmdb version:
with env.begin(buffers=True, db=database) as txn:
if (txn.get(event['eId'].encode()) is not None):
dup_event_count += 1
else:
txn.put(event['eId'].encode(), json.dumps(event).encode('utf-8'))
event_count += 1
Solution:
Place with env.begin outside the loop:
#case('rand lookup')
def test():
with env.begin() as txn:
for word in words:
txn.get(word)
return len(words)
#case('per txn rand lookup')
def test():
for word in words:
with env.begin() as txn:
txn.get(word)
return len(words)
Figured this out myself. What I'm doing is a per transaction random lookup. I just had to place with env.begin outside of the for loop (not visible in my example) as suggested in this example: https://raw.githubusercontent.com/jnwatson/py-lmdb/master/examples/dirtybench.py

MLflow is taking longer than expected time to finish logging metrics and parameters

I'm running a code where I have to perform multiple iterations for a set of products to select the best performing model. While running multiple iterations for a single product, I need to log details of every single run using mlflow(using mlflow with pandas-udf). While logging for individual iterations are taking around 2 seconds but the parent run under which I'm tracking every iteration details is taking 1.5 hours to finish. Here is the code -
#F.pandas_udf( model_results_schema, F.PandasUDFType.GROUPED_MAP )
def get_gam_pe_results( model_input ):
...
...
for j, gam_terms in enumerate(term_list[-1]):
results_iteration_output_1, results_iteration_output, results_iteration_all = run_gam_model(gam_terms)
results_iteration_version = results_iteration_version.append(results_iteration_output)
unique_id = uuid.uuid1()
metric_list = ["AIC", "AICc", "GCV", "adjusted_R2", "deviance", "edof", "elasticity_in_k", "loglikelihood",
"scale"]
param_list = ["features"]
start_time = str(datetime.now())
with mlflow.start_run(run_id=parent_run_id, experiment_id=experiment_id):
with mlflow.start_run(run_name=str(model_input['prod_id'].iloc[1]) + "-" + unique_id.hex,
experiment_id=experiment_id, nested=True):
for item in results_iteration_output.columns.values.tolist():
if item in metric_list:
mlflow.log_metric(item, results_iteration_output[item].iloc[0])
if item in param_list:
mlflow.log_param(item, results_iteration_output[item].iloc[0])
end_time = str(datetime.now())
mlflow.log_param("start_time", start_time)
mlflow.log_param("end_time", end_time)
Outside pandas-udf -
current_time = str(datetime.today().replace(microsecond=0))
run_id = None
with mlflow.start_run(run_name="MLflow_pandas_udf_testing-"+current_time, experiment_id=experiment_id) as run:
run_id = run.info.run_uuid
gam_model_output = (Product_data
.withColumn("run_id", F.lit(run_id))
.groupby(['prod_id'])
.apply(get_gam_pe_results)
)
Note - Running this entire code in Databricks(cluster has 8 cores and 28gb ram).
Any idea why this parent run is taking so long to finish while it's only 2 seconds to finish each iterations?

Multiprocess : Persistent Pool?

I have code like the one below :
def expensive(self,c,v):
.....
def inner_loop(self,c,collector):
self.db.query('SELECT ...',(c,))
for v in self.db.cursor.fetchall() :
collector.append( self.expensive(c,v) )
def method(self):
# create a Pool
#join the Pool ??
self.db.query('SELECT ...')
for c in self.db.cursor.fetchall() :
collector = []
#RUN the whole cycle in parallel in separate processes
self.inner_loop(c, collector)
#do stuff with the collector
#! close the pool ?
both the Outer and the Inner loop are thousands of steps ...
I think I understand how to run a Pool of couple of processes.
All the examples I found show that more or less.
But in my case I need to lunch a persistent Pool and then feed the data (c-value). Once a inner-loop process has finished I have to supply the next-available-c-value.
And keep the processes running and collect the results.
How do I do that ?
A clunky idea I have is :
def method(self):
ws = 4
with Pool(processes=ws) as pool :
cs = []
for i,c in enumerate(..) :
cs.append(c)
if i % ws == 0 :
res = [pool.apply(self.inner_loop, (c)) for i in range(ws)]
cs = []
collector.append(res)
will this keep the same pool running !! i.e. not lunch new process every time ?i
Do I need 'if i % ws == 0' part or I can use imap(), map_async() and the Pool obj will block the loop when available workers are exhausted and continue when some are freed ?
Yes, the way that multiprocessing.Pool works is:
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue.
So simply submitting all your work to the pool via imap should be sufficient:
with Pool(processes=4) as pool:
initial_results = db.fetchall("SELECT c FROM outer")
results = [pool.imap(self.inner_loop, (c,)) for c in initial_results]
That said, if you really are doing this to fetch things from the DB, it may make more sense to move more processing down into that layer (bring the computation to the data rather than bringing the data to the computation).

How to disable retry in celery?

I am running a celerybeat scheduler every 15 mins where I need to fetch data from API (rate limit = 300 requests/min max) and store the results into the database. I would like to fetch the urls in parallel subject to rate limits at the same time. If any worker fails here, I don't want to retry since I will ping again in 15 mins. Any suggestions on how this can be accomplished in celery.
#celery.task(bind=True)
def fetch_store(self):
start = time()
return c.chain(c.group(emap.s() for _ in range(2000)), ereduce.s(start)).apply_async()
#celery.task(rate_limit='300/m')
def fetch():
#... requests data from external API
return data
#celery.task
def store(numbers, start):
end = time()
logger.info("Received" + numbers + " " + (end - start)/1000 + "seconds")

Can't receive the data from device RFB2000 in python

I am using the module "ctypes" to load RFBClient.dll,I use windll and the convention is stdcall. I want to remote control the device RFB2000 with these commands below:
first step for connection
All the commands
for my programme, the connection is successful but the problem is that i can't recieve the data, when I want to get temperature value, I call the function but it always returns 0, the restype is c_double and the argtypes is none, I can't see there is any problem. English is not my native language; please excuse typing errors.
import ctypes
import time
libc = ctypes.WinDLL("X:\\RFBClient.dll")
#connect to RFB software
libc.OpenRFBConnection(ctypes.c_char_p('127.0.0.1'.encode('UTF-8')))
#check if connection successful
libc.Connected()
#Set parameters
#num_automeas = 1; %Number of auto-measurement runs.
completion_count = 2; #% Number of On-Off pairs within each auto-measurement run.
OnHalfCycleTimeCount = 40; # set 2s on
OffHalfCycleTimeCount = 40; # set 2s off
Data=[]
libc.SetCompletionCount(completion_count)
libc.SetMeasureUntilCount(completion_count)
libc.SetOnHalfCycleCount(OnHalfCycleTimeCount)
libc.SetOffHalfCycleCount(OffHalfCycleTimeCount)
libc.NewAutoMeasurement()
#zeroing
time.sleep(1)
print("zeroing.....")
libc.Zero()
while libc.Zeroing()== -1:
time.sleep(1)
#libc.CheckingSensor()
print("measurement start")
libc.StartMeas()
time.sleep(0.5)
while libc.Measuring() == -1:
time.sleep(1)
print(libc.Measuring())
getTemperature = libc.GetTemperature
getTemperature.restype = ctypes.c_double
getTemperature.argtypes = []
print(getTemperature())

Resources