I am trying to ensure that a PyTorch program build in c++ uses only a single thread. The program runs on CPU.
It has a fairly small model, and multi-threading doesn't help and actually causes problems because my program is multithreaded allready. I have called:
at::set_num_interop_threads(1);
at::set_num_threads(1);
torch::set_num_threads(1);
omp_set_num_threads(1);
omp_set_dynamic(0);
omp_set_nested(0);
In addition, I have set the environment variable
OPENBLAS_NUM_THREADS to 1.
Still, when I spawn in single thread, a total of 16 threads show up on htop, and 16 of the processors of the machine go to 100%.
Am I missing something? What?
From the PyTorch docs, one can do:
torch.set_num_threads(1)
To be on the safe side, do this before you instantiate any models etc (so immediately after the import). This worked for me.
More info: https://jdhao.github.io/2020/07/06/pytorch_set_num_threads/
Related
Does the compiled (with simulink-coder-toolbox) simulink-model run with multiple threads or just with one thread/process? As far as I know, the simulation is a single process, if you do not have the parallel toolbox, but what about multithreading?
I am curious how simulink handles different stepsizes for simulation time in one model? For example, if there are 2 parallel paths in a model with different stepsizes (1 x complex work with 0.1s steptime and 100 x light work with 0.001s steptime), do these paths run one after the other or somehow in parallel fashion with threads to save execution time?
The Simulink Coder generates pretty plain vanilla C code, and by default compiles it as such. There's no inherent multithreading, or parallelism going on in the code itself.
Different sample rates in the model are given given task id's, and each step through the code will execute the code associated with the currently executing id. Tasks can also be split into different files allowing easier multitasking execution when deployed on an RTOS.
How the multiple tasks execute is largely dependent on the target OS and the compilation process. If you're compiling to a shared libary, or an exe, deployed on a non-real time OS (e.g. Windows) then you're not getting any multitasking. If you have an RTOS, have generated the code in an appropriate way, and compile appropriately then you will have multitasking.
There is a discussion of how this works in the doc: Model Single-Core, Multitasking Platform Execution
You have access to the code, and access to the build file (and can modify both should you wish.) The easiest way to see what is going on is to look at that code.
I am running a Python script on a Windows HPC cluster. A function in the script uses starmap from the multiprocessing package to parallelize a certain computationally intensive process.
When I run the script on a single non-cluster machine, I obtain the expected speed boost. When I log into a node and run the script locally, I obtain the expected speed boost. However, when the job manager runs the script, the speed boost from multiprocessing is either completely mitigated or, sometimes, even 2x slower. We have noticed that memory paging occurs when the starmap function is called. We believe that this has something to do with the nature of Python's multiprocessing, i.e. the fact that a separate Python interpreter is kicked off for each core.
Since we had success running from the console from a single node, we tried to run the script with HPC_CREATECONSOLE=True, to no avail.
Is there some kind of setting within the job manager that we should use when running Python scripts that use multiprocessing? Is multiprocessing just not appropriate for an HPC cluster?
Unfortunately I wasn't able to find an answer in the community. However, through experimentation, I was able to better isolate the problem and find a workable solution.
The problem arises from the nature of Python's multiprocessing implementation. When a Pool object is created (i.e. the manager class that controls the processing cores for the parallel work), a new Python run-time is started for each core. There are multiple places in my code where the multiprocessing package is used and a Pool object instantiated... every function that requires it creates a Pool object as needed and then joins and terminates before exiting. Therefore, if I call the function 3 times in the code, 8 instances of Python are spun up and then closed 3 times. On a single machine, the overhead of this was not significant at all compared to the computational load of the functions... however on the HPC it was absurdly high.
I re-architected the code so that a Pool object is created at the very beginning of the calling of process and then passed to each function as needed. It is closed, joined, and terminated at the end of the overall process.
We found that the bulk of the time was spent in the creation of the Pool object on each node. This was an improvement though because it was only being created once! We then realized that the underlying problem was that multiple nodes were trying to access Python at the same time in the same place from over the network (it was only installed on the head node). We installed Python and the application on all nodes, and the problem was completely fixed.
This solution was the result of trial and error... unfortunately our knowledge of cluster computing is pretty low at this point. I share this answer in the hopes that it will be critiqued so that we can obtain even more insight. Thank you for your time.
I'm trying for first time the Intel TBB and I'm stucked right at the beginning.
I've attached a simple image to show how I want to build my concurrency program.
I've took a look at here Simplest TBB example and here using TBB for non-parallel tasks.
TBB is nice but I don't know how to handle following problem: How can I define threadpools/taskpools which are started or stopped in dependency of memory consumption. To be precise, if memory-consumption of some dataclass is too much, its filling threadpool shall be stalled (e.g. no new spawns until the other threads have consumed the data inside coresspondig dataclass.
Thre result should be a CPU running all cores without memory overflow.
Are there any examples?
I try to get some code running which is "embarassingly parallel", so I just started to look into parallel processing. I am trying to use parLapply on a Linux machine (because it works perfectly fine under my Windows machine, whereas mclapply would limit the code to Linux) but I encounter some problems.
This is how my code looks like:
cl <- makeCluster(detectCores(), type="FORK") # fork -> psock when I use Win
clusterExport(cl, some.list.of.things)
out <- parLapply(cl, some.fun)
stopCluster(cl)
At first, I noted that the parallel implementation is actually much slower than the sequential one, the reason being that on my Linux machine, each child process inherits the CPU of the parent. At least I think I can draw this conclusion by making the observation that in the systems monitor, all my r-session processes had only about 8% or so CPU time, and only one core was used. See this really helpful thread here.
I ended up using the code of that last thread, namely:
system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
I need to mention here that I am not in any way familiar with any Linux basics. It is my university server run by other people, and I have no idea what the above code actually means and does apart from changing "1" to "ff" (whatever "ff" stands for). Anyway, after executing the above code, I can see that 3 out of 8 of my child processes receive almost full CPU time, which is a big improvement.
Having said that, there are 8 cores (determined by detectCores()), and 8 child processes (as seen in the systems monitor), but "only" 3 child processes are working.
Given that I am completely new to parallel processing, I was wondering if you could give me some guidance as to how to make all 8 cores used. I feel like a blind person that doesn't know what he should be looking for to fix that situation. Any pointers to what I should change or what might be the problem would be highly appreciated!
I am trying to run a multiple threads program on a 4 core processor, and I want each thread to run on a different core.
How can I do that? Because right now I see that they all running on the same core.
(I'm using Linux OS and my code was wrriten on c.)
Process schedulers make processes have an affinity towards a specific CPU. You've already loaded a bunch of stuff into cache, you may as well keep using this 'hot' cache.
You may be getting all the threads on this same core, since you already have the program loaded here.
I did find this: pthread_setaffinity_np. It seems clumsy, but i hope it's of some use.