Adding dependency between threads in python - python-3.x

I have a question in multithreading in python (Python 3.7).
I have 2 sets of processes, 12 each in each set. Let's name the processes in set 1 as {P1, P2, P3,...,P12}. Similarly in set 2, we have processes {X1, X2, X3,...,X12}.
Now, I want to run 4 jobs at max in set 1 jobs. So at any given point of time, only a maximum of 4 jobs should be running.
In set 2, I want jobs to be dependent on set 1 jobs. So for example X1 should run only after P1 is completed, but I don't want to explicitly cap number of processes in set 2 (unlike set 1 restriction).
So let's say P jobs are data formatting jobs, and X are jobs for sending the data to some system. So I don't want to run more than 4 data formatting jobs at a time, but no such restriction for processes sending the data.
I tried ThreadPool in the below way:
with ft.ThreadPoolExecutor(max_workers=4) as executor:
future1 = executor.submit(P1) #P1
future2 = executor.submit(P2) #P2
future3 = executor.submit(P3) #P3
future4 = executor.submit(P4) #P4
future5 = executor.submit(P5) #P5
future6 = executor.submit(P6) #P6
future7 = executor.submit(P7) #P7
future8 = executor.submit(P8) #P8
future9 = executor.submit(P9) #P9
future10 = executor.submit(P10) #P10
future11 = executor.submit(P11) #P11
future12 = executor.submit(P12) #P12
future1.add_done_callback(X1)
future2.add_done_callback(X2)
future3.add_done_callback(X3)
future4.add_done_callback(X4)
future5.add_done_callback(X5)
future6.add_done_callback(X6)
future7.add_done_callback(X7)
future8.add_done_callback(X8)
future9.add_done_callback(X9)
future10.add_done_callback(X10)
future11.add_done_callback(X11)
future12.add_done_callback(X12)
But I guess what the above code does is, it triggers P1 to P4 in the first shot, and then at one point of time we have {P2,P3,P4,X1} running, which I don't want. I want if there are any P jobs in the threadpool, they should run 4 at a time, and in addition to that any number of X jobs can run.
Any help please?

Related

MLflow is taking longer than expected time to finish logging metrics and parameters

I'm running a code where I have to perform multiple iterations for a set of products to select the best performing model. While running multiple iterations for a single product, I need to log details of every single run using mlflow(using mlflow with pandas-udf). While logging for individual iterations are taking around 2 seconds but the parent run under which I'm tracking every iteration details is taking 1.5 hours to finish. Here is the code -
#F.pandas_udf( model_results_schema, F.PandasUDFType.GROUPED_MAP )
def get_gam_pe_results( model_input ):
...
...
for j, gam_terms in enumerate(term_list[-1]):
results_iteration_output_1, results_iteration_output, results_iteration_all = run_gam_model(gam_terms)
results_iteration_version = results_iteration_version.append(results_iteration_output)
unique_id = uuid.uuid1()
metric_list = ["AIC", "AICc", "GCV", "adjusted_R2", "deviance", "edof", "elasticity_in_k", "loglikelihood",
"scale"]
param_list = ["features"]
start_time = str(datetime.now())
with mlflow.start_run(run_id=parent_run_id, experiment_id=experiment_id):
with mlflow.start_run(run_name=str(model_input['prod_id'].iloc[1]) + "-" + unique_id.hex,
experiment_id=experiment_id, nested=True):
for item in results_iteration_output.columns.values.tolist():
if item in metric_list:
mlflow.log_metric(item, results_iteration_output[item].iloc[0])
if item in param_list:
mlflow.log_param(item, results_iteration_output[item].iloc[0])
end_time = str(datetime.now())
mlflow.log_param("start_time", start_time)
mlflow.log_param("end_time", end_time)
Outside pandas-udf -
current_time = str(datetime.today().replace(microsecond=0))
run_id = None
with mlflow.start_run(run_name="MLflow_pandas_udf_testing-"+current_time, experiment_id=experiment_id) as run:
run_id = run.info.run_uuid
gam_model_output = (Product_data
.withColumn("run_id", F.lit(run_id))
.groupby(['prod_id'])
.apply(get_gam_pe_results)
)
Note - Running this entire code in Databricks(cluster has 8 cores and 28gb ram).
Any idea why this parent run is taking so long to finish while it's only 2 seconds to finish each iterations?

Multiprocess : Persistent Pool?

I have code like the one below :
def expensive(self,c,v):
.....
def inner_loop(self,c,collector):
self.db.query('SELECT ...',(c,))
for v in self.db.cursor.fetchall() :
collector.append( self.expensive(c,v) )
def method(self):
# create a Pool
#join the Pool ??
self.db.query('SELECT ...')
for c in self.db.cursor.fetchall() :
collector = []
#RUN the whole cycle in parallel in separate processes
self.inner_loop(c, collector)
#do stuff with the collector
#! close the pool ?
both the Outer and the Inner loop are thousands of steps ...
I think I understand how to run a Pool of couple of processes.
All the examples I found show that more or less.
But in my case I need to lunch a persistent Pool and then feed the data (c-value). Once a inner-loop process has finished I have to supply the next-available-c-value.
And keep the processes running and collect the results.
How do I do that ?
A clunky idea I have is :
def method(self):
ws = 4
with Pool(processes=ws) as pool :
cs = []
for i,c in enumerate(..) :
cs.append(c)
if i % ws == 0 :
res = [pool.apply(self.inner_loop, (c)) for i in range(ws)]
cs = []
collector.append(res)
will this keep the same pool running !! i.e. not lunch new process every time ?i
Do I need 'if i % ws == 0' part or I can use imap(), map_async() and the Pool obj will block the loop when available workers are exhausted and continue when some are freed ?
Yes, the way that multiprocessing.Pool works is:
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue.
So simply submitting all your work to the pool via imap should be sufficient:
with Pool(processes=4) as pool:
initial_results = db.fetchall("SELECT c FROM outer")
results = [pool.imap(self.inner_loop, (c,)) for c in initial_results]
That said, if you really are doing this to fetch things from the DB, it may make more sense to move more processing down into that layer (bring the computation to the data rather than bringing the data to the computation).

Python asyncio wait() with cumulative timeout

I am writing a job scheduler where I schedule M jobs across N co-routines (N < M). As soon as one job finishes, I add a new job so that it can start immediately and run in parallel with the other jobs. Additionally, I would like to ensure that no single job takes more than a certain fixed amount of time. Any jobs that take too long should be cancelled. I have something pretty close, like this:
def update_run_set(waiting, running, max_concurrency):
number_to_add = min(len(waiting), max_concurrency - len(running))
for i in range(0, number_to_add):
next_one = waiting.pop()
running.add(next_one)
async def _run_test_invocations_asynchronously(jobs:List[MyJob], max_concurrency:int, timeout_seconds:int):
running = set() # These tasks are actively being run
waiting = set() # These tasks have not yet started
waiting = {_run_job_coroutine(job) for job in jobs}
update_run_set(waiting, running, max_concurrency)
while len(running) > 0:
done, running = await asyncio.wait(running, timeout=timeout_seconds,
return_when=asyncio.FIRST_COMPLETED)
if not done:
timeout_count = len(running)
[r.cancel() for r in running] # Start cancelling the timed out jobs
done, running = await asyncio.wait(running) # Wait for cancellation to finish
assert(len(done) == timeout_count)
assert(len(running) == 0)
else:
for d in done:
job_return_code = await d
if len(waiting) > 0:
update_run_set(waiting, running, max_concurrency)
assert(len(running) > 0)
The problem here is that say my timeout is 5 seconds, and I'm scheduling 3 jobs across 4 cores. Job A takes 2 seconds, Job B takes 6 seconds and job C takes 7 seconds.
We have something like this:
t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7
-------|-------|-------|-------|-------|-------|-------|-------|
AAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
However, at t=2 the asyncio.await() call returns because A completed. It then loops back up to the top and runs again. At this point B has already been running for 2 seconds, but since it starts the countdown over, and only has 4 seconds remaining until it completes, B will appear to be successful. So after 4 seconds we return again, B is successful, then we start the loop over and now C completes.
How do I make it so that B and C both fail? I somehow need the time to be preserved across calls to asyncio.wait().
One idea that I had is to do my own bookkeeping of how much time each job is allowed to continue running, and pass the minimum of these into asyncio.wait(). Then when something times out, I can cancel only those jobs whose time remaining was equal to the value I passed in for timeout_seconds.
This requires a lot of manual bookkeeping on my part though, and I can't help but wonder about floating point problems which cause me to decide that it's not time to cancel a job even though it really is). So I can't help but think that there's something easier. Would appreciate any ideas.
You can wrap each job into a coroutine that checks its timeout, e.g. using asyncio.wait_for. Limiting the number of parallel invocations could be done in the same coroutine using an asyncio.Semaphore. With those two combined, you only need one call to wait() or even just gather(). For example (untested):
# Run the job, limiting concurrency and time. This code could likely
# be part of _run_job_coroutine, omitted from the question.
async def _run_job_with_limits(job, sem, timeout):
async with sem:
try:
await asyncio.wait_for(_run_job_coroutine(job), timeout)
except asyncio.TimeoutError:
# timed out and canceled, decide what you want to return
pass
async def _run_test_invocations_async(jobs, max_concurrency, timeout):
sem = asyncio.Semaphore(max_concurrency)
return await asyncio.gather(
*(_run_job_with_limits(job, sem, timeout) for job in jobs)
)

Pathos queuing tasks then getting results

I have a computing task which effectively is running the same piece of code against a large number of datasets. I'm wanting to utilise large multicore systems(for instance 72 cores).
I'm testing it on a 2 core system
output_array = []
pool = ProcessPool(nodes=2)
i = 0
num_steps = len(glob.glob1(config_dir, "*.ini"))
worker_array = []
for root, dirs, files in os.walk(config_dir):
for file_name in files:
config_path = "%s/%s" % (config_dir, file_name)
row_result = pool.apipe(individual_config, config_path, dates, test_section, test_type, prediction_service, result_service)
worker_array.append(row_result)
with progressbar.ProgressBar(max_value=num_steps) as bar:
bar.update(0)
while len(worker_array) > 0:
for worker in worker_array:
if worker.ready():
result = worker.get()
output_array.append(result)
worker_array.remove(worker)
i = i + 1
bar.update(i)
"individual_config" is my worker function.
My rationale for this code is to load the data to create all the tasks(2820 tasks), put the queue objects in a list, then poll this list to pick up completed tasks and put the results into an array. A progress bar monitors how the task is progressing.
Each individual_config task, running on it's own takes 0.3-0.5 seconds, but there's 2820 of them so that'd be about 20 odd minutes on my system.
When running in this new pathos multiprocessing processingpool configuration, it's getting one or two completed every 10-15 seconds. The task is projected to take 20 hours. I'd expect some overhead of the multiprocessing, so not getting a double speed with two cores processing, but this seems to be something wrong. Suggestions?

scheduling a task at multiple timings(with different parameters) using celery beat but task run only once(with random parameters)

What i am trying to achieve
Write a scheduler, that uses a database to schedule similar tasks at different timings.
For the same i am using celery beat, the code snippet below would give an idea
try:
reader = MongoReader()
except:
raise
try:
tasks = reader.get_scheduled_tasks()
except:
raise
celerybeat_schedule = dict()
for task in tasks:
celerybeat_schedule[task["task_id"]] =dict()
celerybeat_schedule[task["task_id"]]["task"] = task["task_name"]
celerybeat_schedule[task["task_id"]]["args"] = (task,)
celerybeat_schedule[task["task_id"]]["schedule"] = get_task_schedule(task)
app.conf.update(BROKER_URL=rabbit_mq_endpoint, CELERY_TASK_SERIALIZER='json', CELERY_ACCEPT_CONTENT=['json'], CELERYBEAT_SCHEDULE=celerybeat_schedule)
so these are three steps
- reading all tasks from datastore
- creating a dictionary, celery scheduler which is populated by all tasks having properties, task_name(method that would run), parameters(data to pass to the method), schedule(stores when to run)
- updating this with celery configurations
Expected scenario
given all entries run the same celery task name that just prints, have same schedule to be run every 5 min, having different parameters specifying what to print, lets say db has
task name , parameter , schedule
regular_print , Hi , {"minutes" : 5}
regular_print , Hello , {"minutes" : 5}
regular_print , Bye , {"minutes" : 5}
I expect, these to be printing every 5 minutes to print all three
What happens
Only one of Hi, Hello, Bye prints( possible randomly, surely not in sequence)
Please help,
Thanks a lot in advance :)
Was able to resolve this using version 4 of celery. Sample similar to what worked for me.. can also find in documentation by celery for version 4
#taking address and user-pass from environment(you can mention direct values)
ex_host_queue = os.environ["EX_HOST_QUEUE"]
ex_port_queue = os.environ["EX_PORT_QUEUE"]
ex_user_queue = os.environ["EX_USERID_QUEUE"]
ex_pass_queue = os.environ["EX_PASSWORD_QUEUE"]
broker= "amqp://"+ex_user_queue+":"+ex_pass_queue+"#"+ex_host_queue+":"+ex_port_queue+"//"
#celery initialization
app = Celery(__name__,backend=broker, broker=broker)
app.conf.task_default_queue = 'scheduler_queue'
app.conf.update(
task_serializer='json',
accept_content=['json'], # Ignore other content
result_serializer='json'
)
task = {"task_id":1,"a":10,"b":20}
##method to update scheduler
def add_scheduled_task(task):
print("scheduling task")
del task["_id"]
print("adding task_id")
name = task["task_name"]
app.add_periodic_task(timedelta(minutes=1),add.s(task), name = task["task_id"])
#app.task(name='scheduler_task')
def scheduler_task(data):
print(str(data["a"]+data["b"]))

Resources