Set number of threads to be used manually in Julia Pro - multithreading

I am using JuliaPro to work on julia. My test PC has,
Processor Intel(R) Core(TM) i3-1005G1 CPU # 1.20GHz, 1190 Mhz, 2 Core(s), 4 Logical Processor(s)
When I try to run Threads.nthreads() it will only show the value of 2. Is this the number of cores or the treads used?
I even tried going into the settings and changing the value of "Number of Threads", however it doesn't affect the number of threads utilized by the software.

You can set the number of threads in 2 ways (assuming Julia 1.5):
Set environment variable JULIA_NUM_THREADS before (!) starting Julia. If you change it inside Julia, it will not have any effect.
Start Julia with option -t, e.g. julia -t 4
Note that 2 threads for 2 physical cores is probably already the optimal setting, increasing it to 4 probably reduces performace.

Related

Increasing number of threads decreases cpu performance

I have written a program that tries to do some calculations. It can be run any number of cores.
Below it's a breakdown on how many calculations are performed if it's run on 1,2,3,4 cores (on a 4 logical processors laptop). The numbers in parenthesis show calculations per thread/core. My question is why the performance decreases very rapidly then number of threads/cores increases ? I dont expect the performance to double but it's significantly lower. I also observer the same issue when running 4 instance of the same program setup to run on one thread so i know that it's not an issue with the program it self.
The greatest improvement is going form 1 thread to 2 why is that ?
# Threads | calculations/sec
1 | 87000
2 | 129000 (65000,64000)
3 | 135000 (46000,45000,44000)
4 | 140000 (34000,34000,34000,32000)
One interesting thing is that exactly the same issue i can see on Google Cloud Platform in Compute Engine, when I run the program on 16 threads on 16 virtual cores platform the performance of each core drops down to like 8K states per second. However if I run 16 instances of the same program each instance does around 100K states /s. When I do the same test but on 4 cores on my home laptop i can still see the same drop of performance whether I run 4 separate .exe of one with 4 cores, I was expecting to see the same behavior on GCP but it's not the case, is it because of virtual cores ? do they behave differently.
Example code that reproduces this issue, just need to paste to your console application, but need to refresh the stats like 20 times for the performance to stabilize, not sure why it fluctuates so much. code example you can see that if you run the app with with 4 threads the you get significant performance impact compared to 4 instances with 1 thread each. I was hoping that that enabling gcServer will solver the problem but did not see any improvement

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Local and Global size influence on program execution - OpenCl

After reading a lot of definitions regarding global work size and local work size I still don't really understand what they are and how they work.
I think that global work size determine how many times kernel function will be called, but local work size?
I thought that local work size determine how many threads are gonna be used in the same time in parallel, but am I really correct?
Is local size a number of threads executing one kernel program per one global size value? I mean when we have global size = 1 and local size = 1, then kernel function will be called one time and only one thread will be working on it.
But when we have Global Size = 4096 and local size (if allowed that high) is 1024 then we have 4096 calls of kernel function and each call have 1024 threads working on it at the same time? Am I correct?
Here is some example code i found:
and my another question is: how local size change influence that code?
As i see it is clearly working on global_id's, no local one's so is local size change to bigger one than lets say 1 will influence time spent executing that algorithm?
And when we would have for loop in that algorithm, is it changing anything then regarding local size influence? Do we need to use local_id's to see any difference when changing local size?
I tested that on few of my programs, and even when I used only global_id's changing local work size gave me significantly shorter executing times.
So how does it work? I don't get it.
Thank you in advance!
I thought that local work size determine how many threads are gonna be
used in the same time in parallel, but am I really correct?
Correct but it is per compute unit, not whole device. If there are more compute units than local thread groups, then device is not fully used. When there are more thread groups than compute units but not exact multiple, some compute units wait for other at the end. When both values equal(or exact multiple), then "how many times" is important to fully occupy all ALUS.
For example a 8-core cpu could define 8 compute units(maybe +8 more with hardware multithreads). But a GPU with similar price can have 20 to 64 compute units. Then, even within a single compute unit, many groups of threads can be "in-flight" which is not explicitly tuned but changed by resource usage per thread and per compute unit and maybe per gpu.
how local size change influence that code? As i see it is clearly
working on global_id's, no local one's so is local size change to
bigger one than lets say 1 will influence time spent executing that
algorithm?
Vectorizable/parallelizable kernel codes could have advantage of distributing threads to ALUs, SIMDs of a core or wider SIMDs of a gpu compute unit. For a CPU, 8 scalar instructions could be issued at the same time. For a GPU, it could be as large as thousands. So when you decrease local size to 1, you limit width of parallel thread issue to 1 ALU which cripples performance for many architectures. When you make local size too big, resource per thread falls and performance takes a hit. If you don't have any idea, opencl api can tune local size for you if you give a null to its parameter.
And when we would have for loop in that algorithm, is it changing
anything then regarding local size influence? Do we need to use
local_id's to see any difference when changing local size?
For old and static scheduling architectures, loop unrolling is advised with a unroll step size equal to width of basic SIMD width. No, local id is just a query of a threads id in its compute unit so no need to query if you don't need it.
I tested that on few of my programs, and even when I used only
global_id's changing local work size gave me significantly shorter
executing times. So how does it work?
If kernel needs insane resources, you could think of 1 thread per local group. If kernel doesn't need any resource except immediate values, you should make it maximum local value. Resource allocation per thread(because of kernel codes) is important. New architectures have load balancing so it may not matter in future if you let api choose the optimum value.
To keep all ALUs busy, scheduler issues many threads per core, when one thread is waiting for memory operation, another thread can do ALU operation at the same time. This is good when resource usage is small. When you use %50 of all resources of a compute unit, it can have only 2 threads in flight. Threads share sharable resources such as L1 cache,local memory,register file.
Codes such as c[i]=a[i]+b[i] for scalar floats, are vectorizable. You can have better performance using float8,float16 and similar structs if compiler is not already doing it in background. This way it needs less threads to accomplish all work and also accesses to memory is faster. You can also add a loop in kernel to decrase number of threads even more, which is good for CPU since less thread dispatching is needed between 2 data blocks. For GPU, it may not matter.
Trivial example for a CPU:
4 core, local size = 10, global size = 100
core 1 and 2 have 3 thread groups each. Core 3 and 4 have only 2 thread groups.
1: 30 threads --> fully performant
2: 30 threads
3: 20 threads --> less performant, better preemption for other jobs
4: 20 threads
while instruction pipelining doesn't have much bubbles for cores 1 and 2, bubbles start after some time for cores 3 and 4 so they can be used for other jobs such as a second kernel running in parallel or operating system or some array copying. When you use all cores equally such as for 120 threads, then they finish more work per second but CPU cannot do array copies if kernels already using memory.(unless OS does preemption for other threads)

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

assign two MPI processes per core

How do I assign 2 MPI processes per core?
For example, if I do mpirun -np 4 ./application then it should use 2 physical cores to run 4 MPI processes (2 processes per core). I am using Open MPI 1.6. I did mpirun -np 4 -nc 2 ./application but wasn't able to run it.
It complains mpirun was unable to launch the specified application as it could not find an executable:
orterun (the Open MPI SPMD/MPMD launcher; mpirun/mpiexec are just symlinks to it) has some support for process binding but it is not flexible enough to allow you to bind two processes per core. You can try with -bycore -bind-to-core but it will err when all cores already have one process assigned to them.
But there is a workaround - you can use a rankfile where you explicitly specify which slot to bind each rank to. Here is an example: in order to run 4 processes on a dual-core CPU with 2 processes per core, you would do the following:
mpiexec -np 4 -H localhost -rf rankfile ./application
where rankfile is a text file with the following content:
rank 0=localhost slot=0:0
rank 1=localhost slot=0:0
rank 2=localhost slot=0:1
rank 3=localhost slot=0:1
This will place ranks 0 and 1 on core 0 of processor 0 and ranks 2 and 3 on core 1 of processor 0. Ugly but works:
$ mpiexec -np 4 -H localhost -rf rankfile -tag-output cat /proc/self/status | grep Cpus_allowed_list
[1,0]<stdout>:Cpus_allowed_list: 0
[1,1]<stdout>:Cpus_allowed_list: 0
[1,2]<stdout>:Cpus_allowed_list: 1
[1,3]<stdout>:Cpus_allowed_list: 1
Edit: From your other question is becomes clear that you are actually running on a hyperthreaded CPU. Then you would have to figure out the physical numbering of your logical processors (it's a bit confusing but physical numbering corresponds to the value of processor: as reported in /proc/cpuinfo). The easiest way to obtain it is to install the hwloc library. It provides the hwloc-ls tool that you can use like this:
$ hwloc-ls --of console
...
NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0) <-- Physical ID 0
PU L#1 (P#12) <-- Physical ID 12
...
Physical IDs are listed after P# in the brackets. In your 8-core case the second hyperthread of the first core (core 0) would most likely have ID 8 and hence your rankfile would look something like:
rank 0=localhost slot=p0
rank 1=localhost slot=p8
rank 2=localhost slot=p1
rank 3=localhost slot=p9
(note the p prefix - don't omit it)
If you don't have hwloc or you cannot install it, then you would have to parse /proc/cpuinfo on your own. Hyperthreads would have the same values of physical id and core id but different processor and apicid. The physical ID is equal to the value of processor.
I'm not sure if you have multiple machines or not, and the exact details of how you want the processes distributed, but I'd consider reading up:
mpirun man page
The manual indicates that it has ways of binding processes to different things, including nodes, sockets, and cpu cores.
It's important to note that you will achieve this if you simply run twice as many processes as you have CPU cores, since they will tend to evenly distribute over cores to share load.
I'd try something like the following, though the manual is somewhat ambiguous and I'm not 100% sure it will behave as intended, as long as you have a dual core:
mpirun -np 4 -npersocket 4 ./application
If you use PBS, or something like that, i would suggest this kind of submission:
qsub -l select=128:ncpus=40:mpiprocs=16 -v NPROC=2048./pbs_script.csh
In the present submission i select 128 computational nodes, that have 40 cores, and use 16 of them. In my case, i have 20 physical cores per node.
In this submission i block all the 40 cores of the node and nobody can use these resources. it can avoid other peoples from using the same node and competing with your job.
Using Open MPI 4.0, the two commands:
mpirun --oversubscribe -c 8 ./a.out
and
mpirun -map-by hwthread:OVERSUBSCRIBE -c 8 ./a.out
worked for me (I have a Ryzen 5 processor with 4 cores and 8 logical cores).
I tested with a do loop that includes operations on real numbers. All logical threads are used, though it seems that there is no speedup benefit since computation takes double the amount of time compared to using -c 4 option (with no oversubscribing).
You can run
mpirun --use-hwthread-cpus ./application
In this case, Open MPI will consider that a processor is a thread provided by the Hyperthreading. This contrasts with the default behavior when it considers that a processor is a CPU core.
Open MPI denotes the threads provided by the Hyperthreading as "hardware threads" when you use this option, and allocates one Open MPI processor per "hardware thread".

Resources