I installed CVXPY and SCS came with it. I ran several large models with SCS but it uses only 1 core. My models include exponential cones. I am running CVXPY on a linux server with 256 GB memory and 64 cores.
From the options that cvxpy supports for solvers there appears to be no multi-threading option for SCS though one is listed for CBC.
But when I run SCS with acceleration_lookback = 10 (default) I do have 8 of my 16 cores at 100%. Running with acceleration_lookback = 1 I only see 1 core active.
Related
My desktop CPU is intel i7-4770 which has 4 physical Cores and 8 logical Cores. To get the most performance, how can I start Julia with additional arguments. "julia -p 4 -t 8" is right?
These are two different options.
-t is for the number of threads withing a single Julia process
-p is for the number of processes in a (local) Julia cluster, each of those processes can have one or many threads.
The difference between threads and multiprocessing is the patterns for parallel computing - mainly resulting from different ways how the memory is accessed by tasks. For multithreading you will use Threads package and for mutltiprocessing the Distributed package.
The examples below should clear things out.
Running 4 threads in a single process:
$ julia -t 4
julia> Threads.nthreads()
4
julia> using Distributed
julia> Distributed.nworkers()
1
Running 4 single-threaded workers (a total of 5 julia processes) and checking the number of threads on the second worker:
$ julia -p 4
julia> using Distributed
julia> Distributed.nworkers()
4
julia> fetch(#spawnat 2 Threads.nthreads())
1
Running 4 multi-threaded workers (a total of 5 julia processes with each process having 4 threads) and checking the number of threads on the master and the second worker:
$ julia -p 4 -t 4
julia> using Distributed
julia> Distributed.nworkers()
4
julia> Threads.nthreads()
4
julia> fetch(#spawnat 2 Threads.nthreads())
4
Now regarding the performance the short answer is "it depends".
Some libraries will use the multi-threading functionality while other will mostly not.
For an example LinearAlgebra is by default using BLAS which has its own multi-threading setting:
$ julia -t 3
julia> using LinearAlgebra
julia> BLAS.vendor()
:openblas64
julia> BLAS.get_num_threads()
8
Other packages such as DataFrames are currently being heavily developed for multi-threading and should make a good use of the -t parameter.
Basically using -t auto which defaults to the number of logical cores could be a good setting.
When running your own algorithms you will decide whether to go for multi-threading or multi-processing. A general rule of thumb is that for numerical computation multi-threading is often easier to use but multi-processing scales butter (and using the --machine-file option you can have a huge distributed Julia cluster).
That will work, as will julia -p 8. However, which one gives optimal performance will likely depend on which exact algorithms and parallel processing methods your code is using. For reference, julia -p auto defaults to the number of logical cores, so in this case would be equivalent to julia -p 8. From the list of command line-arguments in the docs [1]:
-p, --procs {N|auto} Integer value N launches N additional local worker processes; auto launches as many workers as the number of local CPU threads (logical cores)
[1] https://docs.julialang.org/en/v1.6/manual/command-line-options/
Why psutil.cpu_count() show 16 on my 8 core mac. I am using Python 3.7.6 and psutil 5.7.2
That’s because it’s showing the logical cores (number of physical cores multiplied by the number of threads that can run on each core).
If you want to find out the physical core only then use: psutil.cpu_count(logical=False)
Read the docs here: https://psutil.readthedocs.io/en/latest/#psutil.cpu_count for details.
I am not really familiar with shared clusters, but I am assuming performance should not differ much in terms of completing a single task when compared with a laptop processor. I have a C++ code which I ran on my laptop with Intel(R)Core™ i7-4558U 2.80 GHz CPU and 16.0 GB RAM, with the operating system of 64 bit Windows 10. On the other hand, I have results of the same code from a publication which belong to the tests conducted on a shared cluster with Intel Xeon 2.3 GHz CPU and 4 GB memory limit with Linux operating system. The program uses CPLEX as the solver: my laptop has IBM Cplex 12.7 whereas previous runs used IBM CPLEX 12.4 (Cplex, 2012). My results seem to take 300 times more than the reported results of the previous run. Does this much difference make sense? If so what could be the driver behind it?
This could be attributed to performance variability (see, for example, section 5 of the MIPLIB 2010 paper here). In a nutshell, minor differences in problem formulation (e.g., order of constraints, input format, etc.), or running on different platforms, can have a great effect on the time to solve. With CPLEX 12.7, you can use the interactive to help you evaluate variability.
I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.
I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.