I am running a Python script on a Windows HPC cluster. A function in the script uses starmap from the multiprocessing package to parallelize a certain computationally intensive process.
When I run the script on a single non-cluster machine, I obtain the expected speed boost. When I log into a node and run the script locally, I obtain the expected speed boost. However, when the job manager runs the script, the speed boost from multiprocessing is either completely mitigated or, sometimes, even 2x slower. We have noticed that memory paging occurs when the starmap function is called. We believe that this has something to do with the nature of Python's multiprocessing, i.e. the fact that a separate Python interpreter is kicked off for each core.
Since we had success running from the console from a single node, we tried to run the script with HPC_CREATECONSOLE=True, to no avail.
Is there some kind of setting within the job manager that we should use when running Python scripts that use multiprocessing? Is multiprocessing just not appropriate for an HPC cluster?
Unfortunately I wasn't able to find an answer in the community. However, through experimentation, I was able to better isolate the problem and find a workable solution.
The problem arises from the nature of Python's multiprocessing implementation. When a Pool object is created (i.e. the manager class that controls the processing cores for the parallel work), a new Python run-time is started for each core. There are multiple places in my code where the multiprocessing package is used and a Pool object instantiated... every function that requires it creates a Pool object as needed and then joins and terminates before exiting. Therefore, if I call the function 3 times in the code, 8 instances of Python are spun up and then closed 3 times. On a single machine, the overhead of this was not significant at all compared to the computational load of the functions... however on the HPC it was absurdly high.
I re-architected the code so that a Pool object is created at the very beginning of the calling of process and then passed to each function as needed. It is closed, joined, and terminated at the end of the overall process.
We found that the bulk of the time was spent in the creation of the Pool object on each node. This was an improvement though because it was only being created once! We then realized that the underlying problem was that multiple nodes were trying to access Python at the same time in the same place from over the network (it was only installed on the head node). We installed Python and the application on all nodes, and the problem was completely fixed.
This solution was the result of trial and error... unfortunately our knowledge of cluster computing is pretty low at this point. I share this answer in the hopes that it will be critiqued so that we can obtain even more insight. Thank you for your time.
Related
I have a PYTHON script that uses multiprocessing to extract the data from DB2/Oracle database to CSV and ingest to Snowflake. When I run this script, performance is good (extracts the source table that is large dataset in 75 seconds). So I made a copy of this python script and changed the input parameters (basically different source tables). When I run all these python scripts together, performance gets an impact (for the same table, it extracts in 100 seconds) and sometimes i see an error 'Cannot allocate memory'.
I am using Jupyter Nootebook and all these different python scripts extracts different source tables to CSV files and saves it in same server location.
I am also checking on my own. But any help will be appreciated.
Thanks
Bala
If you are running multiple scripts that use multiprocessing and write to the same disk at the same time, you will eventually hit a bottleneck somewhere.
It could be concurrent access to the database, writing speed of the disk, amount of memory used or CPU cycles. What specifically is the problem here is impossible to say without doing measurments.
But e.g. writing things to a HDD is very slow compared to current CPU speeds.
Also, when you are running multiple scripts that use multiprocessing you could have more worker processes than the CPU has cores. In which case there will be some worker processes waiting for CPU time all the time.
I need to use selenium for a scraping job with a heap javascript generated webpages. I can open several instances of the webdriver at a time and pass the websites to the instances using queue.
It can be done in multiple way.s though. I've experimented with both the Threading module and the Pool- and Process-ways from the multiprocessing module.
All work and will do the job quite fast.
This leaves me wondering: Which module is generalky prefered in a situation like this?
The main factor in CPython for choosing between Threads of Processes is based on your type of workload.
If you have a I/O bound type of workload, where most of your application time is spent waiting for data to come in or to go out, then your best choice is using Threads.
If, instead, your application is spending great time using the CPU, then Processes are your tool of choice.
This is due to the fact that, in CPython (the most commonly used interpreter) only one Thread at a time can make use of the CPU cores. For more information regarding this limitation just read about the Global Interpreter Lock (GIL).
There is another advantage when using Processes which is usually overlooked: Processes allow to achieve a greater degree of isolation. This means that if you have some unstable code (in your case could be the scraping logic) which might hang or crash badly, encapsulating it in a separate Process allows your service to detect the anomaly and recover (kill the Process and restart it).
I am using nodejs for a CPU intensive task ,which basicly generates large amount of data and stores it in a file. I am streaming the data to output files as it is generated for a single type of data.
Aim : I want to make the task of generating this data for multiple types of data in parallel (utilizing my multi-core cpu to its best).Without each of process having its own heap memory .Thus providing with larger process memory and increased speed of execution.
I was planning to use node fibers which is also used by meteor js for its own callback handling.But I am not sure if this will achieve what I want,as in one of the video on meteor fibers by Chris Mather mentions at the end that eventually everything is single threaded and node fibers somehow manges the same single threaded event loop to provide its functionality.
So,
Does this mean that if I use node fibers I wont be running my task in
parallel ,thus not utilizing my cpu cores ?
Does node webworker-threads will help me in achieving the
functionality I desire.As is mentioned on modules home page which
says that ,webworker threads will run on seperate/parallel cpu
process ,thus providing multi-threading in real sense ??
As ending question ,Does this mean that node.js is not advisable for
such CPU intensive tasks ?
note : I dont want to use asynchronous code structuring libs which are presented as threads,but infact just add syntatical sugar over same async code, as the tasks are largely CPU intensive .I have already used async capabilities to max .
// Update 1 (based on answer for clusters )
Sorry I forgot to mention this ,but problem with clusters I faced is :
Complex to load balance the amount of work I have in a way which makes sure a particular set of parallel tasks execute before certain other tasks.
Not sure if clusters really do what I want ,referring to these lines on webworker-threads npm homepage
The "can't block the event loop" problem is inherent to Node's evented model. No matter how many Node processes you have running as a Node-cluster, it won't solve its issues with CPU-bound tasks.
..... any light on how ..would be helpfull.
Rather than trying to implement multiple threads, you should find it much easier to use multiple processes with Node.js
See, for example, the cluster module. This allows you to easily run the same js code in multiple processes, e.g. one per core, and collect their results / be notified once they're completed.
If cluster does more than you need, then you can also just call fork directly.
If you must have thread-parallelism rather than process-, then you may want to look at writing an async native module. Then you have access to the libuv thread pool (though starving it may reduce I/O performance) or can fork your own threads as you wish (but then you're on your own for synchronising with the rest of Node).
After update 1
For load balancing, if what cluster does isn't working for you, then you can just do it yourself with fork, as I mentioned. The source for cluster is available.
For the other point, it means if the task is truly CPU-bound then there's no advantage Node will give you over other technologies, other than being simpler if everything else is using Node. The only option you have is to make sure you're using all the available CPU resources, which a worker pool will give you. If you're already using Node then the easiest options are using the ones it's already got (cluster or libuv). If they're not sufficient then yeah, you'll have to find something else.
Regardless of technology, it remains true that multi-process parallelism is a lot easier than multi-thread parallelism.
Note: despite what you say, you definitely do want to use async code precisely because it is CPI-intensive, otherwise your tasks will block all I/O. You do not want this to happen.
I have a django application which relies heavily on threading and I'm noticing no performance increment no matter how much processes or threads I add to the WSGIDaemonProcess.
I can't find a YES/NO answer out there and I'm wondering. Could it be that mod_wsgi is using the same interpreter for each request so I'm running in a bottleneck due to a GIL limitation?
If so, would you recommend something else that would help me workaround this limitation?
For a typical configuration, yes, all requests would be handle in same sub interpreter.
If in different sub interpreters of same process, you are still affected by the GIL.
Post your actual mod_wsgi configuration to confirm you have set things up right.
Consider trying New Relic to find out where real bottlenecks are.
Watch my PyCon US 2012 talk on finding bottlenecks
Short answer:
No.
Long answer:
This ability to make good use of more than processor, even when using multithreading, is further enchanced by the fact that Apache uses multiple processes for handling requests and not just a single process. Thus, even when there is some contention for the GIL within a specific process, it doesn't stop other processes from being able to run as the GIL is only local to a process and does not extend across processes.
Citation: https://code.google.com/p/modwsgi/wiki/ProcessesAndThreading
You haven't given enough information for anybody to recommend how to improve performance, but if you've actually written a thread-heavy program in Python, that's your first mistake.
Instead of running your program on CPython, maybe you should try Jython or IronPython instead. But then it wouldn't work with mod_wsgi, so we really need more details to understand what you're trying to do...
I'm confused whether using multiple processes for a web application will improve the performance. Apache's mod_wsgi provides an option to set the number of processes to be started for the daemon process group. I used fastcgi with lighttpd before and it also had an option to configure the max number of processes for each fastcgi application.
While I don't know how multi-processing is better, I do know something bad about it compared to single-process multi-threading model. For example, logging will be harder to implement in multi-processing scenario (link), especially when you also want log rotating. And since memory can't be shared, if you cache something in memory (the most straightforward way), you have multiple duplicate copies.
Do multiple processes better utilize multi-core computing power, or do they yield higher throughput? Or are they just there for some single threaded applications?
In the case of Python, or more specifically CPython as used by mod_wsgi, the issue is the Python GIL. Although you may have multiple threads in Python, the global interpreter lock effectively means that only one thread can be running Python code at a time. This means it cannot make use of multiple processors/cores properly on a system. Using multiple processes however does allow you to use all those processors/cores.
That said, for mod_wsgi it isn't all Python code but has a lot of C code and with Apache also being C code. During execution of the C code, the GIL is unlocked by that thread meaning that a thread running in C code can run in parallel to a thread running in Python code. Still not the best one can achieve, but can still make partial use of all the processors/core on your system.
For me details of this in relation to mod_wsgi read:
http://blog.dscpl.com.au/2007/09/parallel-python-discussion-and-modwsgi.html
http://blog.dscpl.com.au/2007/07/web-hosting-landscape-and-modwsgi.html
Multi-processing is less efficient than multi-threading, but it's more resilient to failure. Each process gets its own independent memory space and may be terminated and restarted (recycled) independent of other processes.