Python - Running multiple python scripts that uses multiprocessing affects performance and errors out sometimes - python-3.x

I have a PYTHON script that uses multiprocessing to extract the data from DB2/Oracle database to CSV and ingest to Snowflake. When I run this script, performance is good (extracts the source table that is large dataset in 75 seconds). So I made a copy of this python script and changed the input parameters (basically different source tables). When I run all these python scripts together, performance gets an impact (for the same table, it extracts in 100 seconds) and sometimes i see an error 'Cannot allocate memory'.
I am using Jupyter Nootebook and all these different python scripts extracts different source tables to CSV files and saves it in same server location.
I am also checking on my own. But any help will be appreciated.
Thanks
Bala

If you are running multiple scripts that use multiprocessing and write to the same disk at the same time, you will eventually hit a bottleneck somewhere.
It could be concurrent access to the database, writing speed of the disk, amount of memory used or CPU cycles. What specifically is the problem here is impossible to say without doing measurments.
But e.g. writing things to a HDD is very slow compared to current CPU speeds.
Also, when you are running multiple scripts that use multiprocessing you could have more worker processes than the CPU has cores. In which case there will be some worker processes waiting for CPU time all the time.

Related

How to know how many resources a process uses in all its execution time? linux

I would like to know if there is a program to analyze how many resources it takes to execute a command.
for example as follows:
# magic_program python3 app.py
And that the program tells you how many resources the execution of a program uses, the use of cpu, memory, disk, network, etc.
that in a certain way watches over the program during execution and then gives you a report. If it doesn't exist, I would love to carry out a project like this.
Questions
Is there this magic program? if not, how viable would its creation be?

Run a parallel process saving results from a main process in Python

I have a function that creates some results for a list of tasks. I would like to save the results on the fly to 1) release memory compared to saving to appending to a results_list and 2) have the results of the first part in case of errors.
Here is a very short sample code:
for task in task_list:
result = do_awesome_stuff_to_task(task)
save_nice_results_to_db(result) # Send this job to another process and let the main process continue
Is there a way for the main process to create results for each task in task_list and each time a result is create send this to another processor/thread to save it, so the main loop can continue without waiting for the slow saving process?
I have looked at multiprocessing, but that seems mostly to speed up the loop over task_list rather than allow a secondary sub process to do other parts of the work. I have also looked into asyncio, but that seems mostly used for I/O.
All in all, I am looking for a way to have a main process looping over the task_list. For each task finished I would like to send the results to another subprocess to save the results. Notice, the do_awesome_stuff_to_task is much faster than savings process, hence, the main loop will have reached through multiple task before the first task is saved. I have thought of two ways of tackling this:
Use multiple sub process to save
Save every xx iteration - the save_results scale okay, so perhaps the save process can save xx iteration at a time while the main loop continuous?
Is this possible to do with Python? Where to look and what key considerations to take?
All help is appreciated.
It's hard to know what will be faster in your case without testing, but here's some thoughts on how to choose what to do.
If save_nice_results_to_db is slow because it's writing data to disk or network, make sure you aren't already at the maximum write speed of your hardware. Depending on the server at the other end, network traffic can sometimes benefit greatly from opening multiple ports at once to read/write so long as you stay within your total network transfer speed (of the mac interface as well as your ISP). SSD's can see some limited benefit from initiating multiple reads/writes at once, but too many will hurt performance. HDD's are almost universally slower when trying to do more than one thing at once. Everything is more efficient reading/writing larger chunks at a time.
multiprocessing must typically transfer data between the parent and child processes using pickle because they don't share memory. This has a high overhead, so if result is a large object, you may waste more time with the added overhead of sending the data to a child process than you could save by any sort of concurrency. (emphasis on the may. always test for yourself). As of 3.8 the shared_memory module was added which may be somewhat more efficient, but is much less flexible and easy to use.
threading benefits from all threads sharing memory so there is zero transfer overhead to "send" data between threads. Python threads however cannot execute bytecode concurrently due to the GIL (global interpreter lock), so multiple CPU cores cannot be leveraged to increase computation speed. This is due to python itself having many parts which are not thread-safe. Specific functions written in c may release this lock to get around this issue and leverage multiple cpu cores using threads, but once execution returns to the python interpreter, that lock is held again. Typically functions involving network access or file IO can release the GIL, as the interpreter is waiting on an operating system call which is usually thread safe. Other popular libraries like Numpy also make an effort to release the GIL while doing complex math operations on large arrays. You can only release the GIL from c/c++ code however, and not from python itself.
asyncio should get a special mention here, as it's designed specifically with concurrent network/file operations in mind. It uses coroutines instead of threads (even lower overhead than threads, which themselves are much lower overhead than processes) to queue up a bunch of operations, then uses an operating system call to wait on any of them to finish (event loop). Using this would also require your do_awesome_stuff_to_task to happen in a coroutine for it to happen at the same time as save_nice_results_to_db.
A trivial example of firing each result off to a thread to be processed:
for task in task_list:
result = do_awesome_stuff_to_task(task)
threading.Thread(target=save_nice_results_to_db, args=(result,)).start() # Send this job to another process and let the main process continue

Using Python multiprocessing on an HPC cluster

I am running a Python script on a Windows HPC cluster. A function in the script uses starmap from the multiprocessing package to parallelize a certain computationally intensive process.
When I run the script on a single non-cluster machine, I obtain the expected speed boost. When I log into a node and run the script locally, I obtain the expected speed boost. However, when the job manager runs the script, the speed boost from multiprocessing is either completely mitigated or, sometimes, even 2x slower. We have noticed that memory paging occurs when the starmap function is called. We believe that this has something to do with the nature of Python's multiprocessing, i.e. the fact that a separate Python interpreter is kicked off for each core.
Since we had success running from the console from a single node, we tried to run the script with HPC_CREATECONSOLE=True, to no avail.
Is there some kind of setting within the job manager that we should use when running Python scripts that use multiprocessing? Is multiprocessing just not appropriate for an HPC cluster?
Unfortunately I wasn't able to find an answer in the community. However, through experimentation, I was able to better isolate the problem and find a workable solution.
The problem arises from the nature of Python's multiprocessing implementation. When a Pool object is created (i.e. the manager class that controls the processing cores for the parallel work), a new Python run-time is started for each core. There are multiple places in my code where the multiprocessing package is used and a Pool object instantiated... every function that requires it creates a Pool object as needed and then joins and terminates before exiting. Therefore, if I call the function 3 times in the code, 8 instances of Python are spun up and then closed 3 times. On a single machine, the overhead of this was not significant at all compared to the computational load of the functions... however on the HPC it was absurdly high.
I re-architected the code so that a Pool object is created at the very beginning of the calling of process and then passed to each function as needed. It is closed, joined, and terminated at the end of the overall process.
We found that the bulk of the time was spent in the creation of the Pool object on each node. This was an improvement though because it was only being created once! We then realized that the underlying problem was that multiple nodes were trying to access Python at the same time in the same place from over the network (it was only installed on the head node). We installed Python and the application on all nodes, and the problem was completely fixed.
This solution was the result of trial and error... unfortunately our knowledge of cluster computing is pretty low at this point. I share this answer in the hopes that it will be critiqued so that we can obtain even more insight. Thank you for your time.

Timeout and kill parallel matlab execution

I have a matlab processing script located in the middle of a long processing pipeline running on linux.
The matlab script applies the same operation to a number N of datasets D_i (i=1,2,...,N) in parallel on (8 cores) via parfor.
Usually, processing the whole dataset takes about 2hours (on 8 cores).
Unfortunately, from time to time, looks like one of the matlab subprocesses crashes randomly. This makes the job impossible to complete (and the pipeline can't finish).
I am sure this does not depend on the data as if I reprocess specifically the D_i on which the process crashes, it is executed without problems. Moreover, up to now I've processed already thousands of the mentioned dataset.
How I deal with the problem now (...manually...):
After I start the matlab job, I periodically check the process list on the machine (via a simple top); whenever I have one matlab process alive after two hours of work, then I know for sure that it has crashed. Then I simply kill it and process the part of the dataset which has not been analyzed.
Question:
I am looking for suggestion on how to timeout ALL the matlab processes running and kill them whenever they are alive for more than e.g. 2hrs CPU.
You should be able to do this by restructuring your code to use PARFEVAL instead of PARFOR. There's a simple example in this entry on Loren's blog: http://blogs.mathworks.com/loren/2013/12/09/getting-data-from-a-web-api-in-parallel/ which shows how you can stop waiting for work after a given amount of time.

C# multithreading a process with a much fewer CPU is faster than a much more CPU

Currently our application is processing a large amount of files about over 1000 XML files on the same directory. The files are all being read, parsed and updated/saved to the database.
When we tested our application on a 12 core machine the total process is much slower than processing it on a 4 core machine.
What we observed is that the thread count produced by our application goes up to a range of 30 to 90 threads and the Context Switches is massively increasing. This is possibly caused by a lot of parallel execution being spawned but all of them are important.
Is the context switch the culprit? or the parallel read/write of files? or do we lessen the number of parallel tasks?
The bottle neck here is the disk access. No matter how many threads you start, the file system can only read one file at a time. Starting more threads will only make them fight over this single resource, increasing both the context switching and the disk seek times.
In the other end of the process is also a limitation as only one thread at a time can update a table in the database, but the database is designed to handle multiple processes.
Make a single thread responsible for the disk reads, and once a file has been read it can start a thread that processes it. That way you read from the disk in the most efficient way, and you have the multi threaded part of the operation behind the bottle neck.

Resources