I have a script, parts of which at some time able to run in parallel. Python 3.6.6
The goal is to decrease execution time at maximum.
One of the parts is connection to Redis, getting the data for two keys, pickle.loads for each and returning processed objects.
What’s the best solution for such a tasks?
I’ve tried Queue() already, but Queue.get_nowait() locks the script, and after {process}.join() it also stops execution even though the task is done. Using pool.map raises TypeError: can't pickle _thread.lock objects.
All I could achieve is parallel running of all parts but still cannot connect the results
cPickle.load() will release the GIL so you can use it in multiple threads easily. But cPickle.loads() will not, so don't use that.
Basically, put your data from Redis into a StringIO then cPickle.load() from there. Do this in multiple threads using concurrent.futures.ThreadPoolExecutor.
Related
I am currently using Python's Threading to parallelize the execution of multiple Databricks notebooks. These are long-running notebooks, and I need to add some logic for killing the threads in the case where I want to restart the execution with new changes. When re-executing the master notebook without killing the threads, the cluster is quickly filled with computational heavy, long-lived threads, leaving little space for the actually required computations.
I have tried these suggestions without luck. Furthermore, I have tried getting the runId from dbutils.notebook.run() and killing the thread with dbutils.notebook.exit(runId), but since the call to dbutils.notebook.run() is synchronous, I am unable to obtain the runId before the notebook has executed.
I would appreciate any suggestion on how to solve this issue!
Hello #sondrelv and thank you for your question. I would like to clarify that dbutils.notebook.exit(value) is not used for killing other threads by runId. dbutils.notebook.exit(value) is used to cause the current (this) thread to exit and return a value.
I see the difficulty of management without an available interrupt inside notebook code. Given this limitation, I have tried to look for other ways to cancel the threads.
One way is to use other utilities to kill the thread/run.
Part of the difficulty in solving this, is that threads/runs created through dbutils.notebook.run() are ephemeral runs. The Databricks CLI databricks runs get --run-id <ephemeral_run_id> can fetch details of an ephemeral run. If details can be fetched, then the cancel should also work (databricks runs cancel ...).
The remaining difficulty is getting the run ids. Ephemeral runs are excluded from the CLI runs list operation databricks runs list.
As you noted, the dbutils.notebook.run() is synchronous, and does not return a value to code until it finishes.
However, in the notbook UI, the run ID and link is printed when it starts. There must be a way to capture these. I have not yet found how.
Another possible solution, would be to create some endpoint or resources for the child notebooks check whether they should continue execution or exit early using dbutils.notebooks.exit(). Doing this check between every few cells would be similar to the approaches in the article you linked, just applied to a notebook instead of a thread.
I am trying to work out how to process bulk records into elastic search using the bulk function and need to use threads to get some performance out of it. But I am stuck trying to work out how to limit the threads to 5 concurrent so its not to heavy on elastic.
I was thinking of just looping the db and filling a list, then when it hits eg (50), push to a thread for processing and continue. But this method will spawn to many threads and I cannot see an obvious way to limit the treads without waiting for all of them to finish, before adding another thread.
I have done this in golang before, where you can just add threads and when it hits the limit it will just wait before adding more to the queue, but seeming a little more elusive in python so far.
I am open to alternatives but this seems like the cleanest way to go so far, but there might be better methods like db -> queue with limit, then just threads to consume from the queue.. ?
look forward to some responses.
I am currently working with three matlab functions to make them run near simultaneously in single Matlab session(as I known matlab is single-threaded), these three functions are allocated with individual tasks, it might be difficult for me to explain all the detail of each function here, but try to include as much information as possible.
They are CONTROL/CAMERA/DATA_DISPLAY tasks, The approach I am using is creating Timer objects to have all the function callback continuously with different callback period time.
CONTROL will sending and receiving data through wifi with udp port, it will check the availability of package, and execute callback constantly
CAMERA receiving camera frame continuously through tcp and display it, one timer object T1 for this function to refresh the capture frame
DATA_DISPLAY display all the received data, this will refresh continuously, so another timer T2 for this function to refresh the display
However I noticed that the timer T2 is blocking the timer T1 when it is executed, and slowing down the whole process. I am working on a system using a multi-core CPU and I would expect MATLAB to be able to execute both timer objects in parallel taking advantage of the computational cores.
Through searching the parallel computing toolbox in matlab, it seems not able to deal with infinite loop or continuous callback, since the code will not finish and display nothing when execute, probably I am not so sure how to utilize this toolbox
Or can anyone provide any good idea of re-structuring the code into more efficient structure.
Many thanks
I see a problem using the parallel computing toolbox here. The design implies that the jobs are controlled via your primary matlab instance. Besides this, the primary instance is the only one with a gui, which would require to let your DISPLAY_DATA-Task control everything. I don't know if this is possible, but it would result in a very strange architecture. Besides this, inter process communication is not the best idea when processing large data amounts.
To solve the issue, I would use Java to display your data and realise the 'DISPLAY_DATA'-Part. The connection to java is very fast and simple to use. You will have to write a small java gui which has a appendframe-function that allows your CAMERA-Job to push new data. Obviously updating the gui should be done parallel without blocking.
I have written a nice parallel job processor that accepts jobs (functions, their arguments, timeout information etc.) and submits then to a Python multiprocessing pool. I can provide the full (long) code if requested, but the key step (as I see it) is the asynchronous application to the pool:
job.resultGetter = self.pool.apply_async(
func = job.workFunction,
kwds = job.workFunctionKeywordArguments
)
I am trying to use this parallel job processor with a large body of legacy code and, perhaps naturally, have run into pickling problems:
PicklingError: Can’t pickle <type ’instancemethod’>: attribute lookup builtin .instancemethod failed
This type of problem is observable when I try to submit a problematic object as an argument for a work function. The real problem is that this is legacy code and I am advised that I can make only very minor changes to it. So... is there some clever trick or simple modification I can make somewhere that could allow my parallel job processor code to cope with these traditionally unpicklable objects? I have total control over the parallel job processor code, so I am open to, say, wrapping every submitted function in another function. For the legacy code, I should be able to add the occasional small method to objects, but that's about it. Is there some clever approach to this type of problem?
use dill and pathos.multiprocessing instead of pickle and multiprocessing.
see here:
What can multiprocessing and dill do together?
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/
How to pickle functions/classes defined in __main__ (python)
I am building a simple application to download a set of XML files and parse them into a database using the async module (https://npmjs.org/package/node-async) for flow control. The overall flow is as follows:
Download list of datasets from API (single Request call)
Download metadata for each dataset to get link to XML file (async.each)
Download XML for each dataset (async.parallel)
Parse XML for each dataset into JSON objects (async.parallel)
Save each JSON object to a database (async.each)
In effect, for each dataset there is a parent process (2) which sets of a series of asynchronous child processes (3, 4, 5). The challenge that I am facing is that, because so many parent processes fire before all of the children of a particular process are complete, child processes seem to be getting queued up in the event loop, and it takes a long time for all of the child processes for a particular parent process to resolve and allow garbage collection to clean everything up. The result of this is that even though the program doesn't appear to have any memory leaks, memory usage is still too high, ultimately crashing the program.
One solution which worked was to make some of the child processes synchronous so that they can be grouped together in the event loop. However, I have also seen an alternative solution discussed here: https://groups.google.com/forum/#!topic/nodejs/Xp4htMTfvYY, which pushes parent processes into a queue and only allows a certain number to be running at once. My question then is does anyone know of a more robust module for handling this type of queueing, or any other viable alternative for handling this kind of flow control. I have been searching but so far no luck.
Thanks.
I decided to post this as an answer:
Don't launch all of the processes at once. Let the callback of one request launch the next one. The overall work is still asynchronous, but each request gets run in series. You can then pool up a certain number of the connections to be running simultaneously to maximize I/O throughput. Look at async.eachLimit and replace each of your async.each examples with it.
Your async.parallel calls may be causing issues as well.