Set number of cores in OpenMP - multithreading

I'm running my program on an Intel® Xeon® Processor E5-1650 v3
http://ark.intel.com/products/82765/Intel-Xeon-Processor-E5-1650-v3-15M-Cache-3_50-GHz
The processor has 6 CPUs(6 cores), I'm trying to set the number of CPUs my prorgram is using. My application is using openmp.
I'm not trying to set the number of threads, but the number of CPUs. How can I do that?

Have you tried to use environment variables controlling thread affinity?
If you're compiling your code with gcc, you may want to use GOMP_CPU_AFFINITY or OMP_PLACES.
For Intel compilers, there are KMP_AFFINITY and KMP_PLACE_THREADS, see the Intel documentation.

Related

Emulating a heterogenous system, like an ARM Processor with P and E Cores

I'm trying to emulate a processor which consists processor cores with different max frequencies per core, like ARM processors or newer Intel processors which have a couple of Performance Cores and Efficiency Cores.
I tried it with Qemu, but I only didn't get far, the only thing I found was qemu-system-aarch64 where you can configure cores per die and die count using nema but i did't find a possibilty to change frequency or core architechture for a specific die. Is it even possible with qemu or is there a alternative? Preferably the emulation should be able to run linux.
For clarification, I'm trying to show that on a heterogeneus system i.e. a processor with different core speeds a certain framework works better then another one.
Thanks to Nate I found Intel Simics which is able to simulate heterogeneous systems.

Run a process on a specific CPU

Problem
I have a Soc containing let's say an Arm M7-core and an Arm A53-core, I want to only program the M7-core (Linux) and run a specific process on the A53-core.
Questions
Is that possible or should I program both of them?
I read about thread Affinity in this article, and here I am not sure whether Affinity controls the running CPU in the Soc or the running core in the CPU (ARM cpu has several cores), please help.
ARM big.LITTLE have three implementations: https://en.wikipedia.org/wiki/ARM_big.LITTLE . On the first two implementations you can only select a cpu pair (with an Arm M7-core and an Arm A53-core) to run your thread. Depending on the workload, your threads will be executed in an M7 or A53.
Only in Heterogeneous Multi-Processing (HMP) implementation OS scheduler sees all M7 and A53 cores and you can select a specific cpu type.
If the hardware has HMP, you can restrict your thread to a arbitrary set of cores using pthread_setaffinity_np ( https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html ). The cpu set macros (which manipulate core sets) identify cores by number, so you will have to discover which numbers are M7 or A53. Probably it is the same numbering in /proc/cpuinfo or /sys/devices/system/cpu/.

openMp and the number of cores vs cpus

I'm wondering about how openmp figures out how many threads it can run via the omp_get_max_threads() library call. I'm running on a centOS linux machine using gcc -fopenmp. My machine has 16 AMD Opteron(tm) Processor 6136 CPUs, each with 8 cores, all according to /proc/cpuinfo. If I run omp_get_num_procs() it returns 16. But omp_get_max_threads() also returns 16. Why isn't the max threads number 16*8?
When I run a program that uses 16 threads I see the program in top running at ~1600% of CPU and if I toggle 'Last used cpu (SMP)' that number moves around a bit. So the 1600% makes sense but is there any way to know which cores of which CPUs the threads are running on?
I'm pretty new to openmp so sorry if these questions seem naive.
You can use the hwloc tool set to know the binding of the threads of any application to the hardware threads/cores. You need only the name or the PID of the target running process. Here is an example:
$ hwloc-ps --pid 2038168 --threads --get-last-cpu-location
2038168 Machine:0 ./a.out
2038168 Core:5 a.out
2038169 Core:3 a.out
2038170 Core:1 a.out
2038171 Core:4 a.out
2038172 Core:0 a.out
2038173 Core:2 a.out
Here we can see that the process a.out (with the PID 2038168) uses 6 threads each map on different cores.
However, the mapping of threads on cores over time can change if you do not configure OpenMP properly (a starting point is to set the environment variables OMP_PROC_BIND and OMP_PLACES).
Additionally, you can use  hwloc-ps to understand the topology of your machine (how many cores there are, how many threads, how they are connected, etc.).
I am very surprise you can have 16 "AMD Opteron(tm) Processor 6136 CPUs". Indeed, this processor use the G34 socket which is available in up to 4-socket arrangements (and 8 dies). So, please check this with hwloc-ps!
An alternative way is to use a profiling tool (such as Intel VTune).

CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN not supported in the Intel/AMD OpenCL CPU runtime

Conditions:
I installed AMD OpenCL version AMD-APP-SDK-v2.8-lnx64 and Intel OpenCL version *intel_sdk_for_ocl_applications_xe_2013_r2_sdk_3.1.1.11385_x64* (version identification couldn't be more complex) according to the description on an HPC server with a dual socket Xeon E5-2650, Xeon Phi coprocessor, 64GB host memory and Red Hat Enterprise Server 6.4.
Problem description:
I would like to do device fission with OpenCL to get around the NUMA issue. Unfortunately the device (Intel CPU) or maybe the Linux kernel doesn't seem to support CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN. I tried both Intel OpenCL and AMD OpenCL. Although AMD OpenCL device query says that it supports the affinity domain option, actually it doesn't: when I try to run a code with CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN the clCreateSubDevices() function returns with -30 error code. I guess this is a bug in the current Intel OpenCL driver, according to a forum post.
Potential solution:
I thought that if I could select the first 16 parallel compute cores (8 cores + 8 hyper threads) (out of the total 32 parallel compute cores) those would map to the first socket. Unfortunately Intel OpenCL randomly distributes the 16 parallel compute cores across the 32 cores. AMD OpenCL on the other hand select the first 16 parallel compute cores, but the OpenCL compiler does a poor job on the kernel I'm running. So the no free lunch theorem applies here as well.
Questions:
Is there any way to specify which parallel compute cores the OpenCL should use for the computations?
Is there any way to overcome this NUMA issue with OpenCL?
Any comments regarding experiences with NUMA affinity are welcome.
Thank you!
UPDATE
A partial workaround, only applicable to single-socket testing:
(In Linux) Disable all cores from a NUMA node, so OpenCL ICD can only choose from hardware threads of the other NUMA node. Eg. on a 2 socket 32 HTT system:
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu31/online"
....
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu16/online"
I'm not sure if this hacking has no side effect, but so far it seems to work (for testing at least).

What is the difference between x64 and IA-64?

I was on Microsoft's website and noticed two different installers, one for x64 and one for IA-64. Reference:Installing the .NET Framework 4.5, 4.5.1
My understanding is that IA-64 is a subclass of x64, so I'm curious why it would have a separate installer.
x64 is used as a short term for the 64 bit extensions of the "classical" x86 architecture; almost any "normal" PC produced in the last years have a processor based on such architecture.
AMD invented the AMD64 extensions; Intel was more or less forced to implement them, and called them first IA-32e, then EM64T and finally Intel 64 (actually, the AMD and Intel extensions aren't exactly the same, but they are almost identical).
Many people also call this stuff x86-64, to have a vendor-independent name and to stress the fact that it's the 64 bit evolution of the x86 architecture. All the "regular" PCs that are sold with "64 bit processors" run on x86-64 architecture.
IA-64 (Intel Architecture 64) is an almost completely unrelated 64 bit architecture (also known as Itanium), developed by Intel initially for high-end servers. It was said that Itanium could have been a replacement for the x86 architecture, but this architecture didn't have much success (for various reasons), so it's unlikely that you'll ever need the IA-64 installers.
For more information, you may have a look at the wikipedia articles on x86-64 and Itanium.
IA-64 is the Intel Itanium Architecture. This is a Very Long Instruction Word (VLIW) processor instruction set.
x86_64 is the normal 64-bit architecture that is used by processors inside every laptop / desktop in today's computers. This processor is a dynamic processor.
The main difference between these two is that
In VLIW, the compiler resolves the dependencies between instructions and schedules them appropriately. The processor merely executes them.
With a dynamic processor, the compiler just schedules the instructions without worrying about dependencies. The processor takes care of dependencies, reorders them and executes them appropriately.
VLIW code is dependent on each chip's internal architecture. The compiler needs to know that information. The advantage of them is that it can extract much more parallelism than dynamic processors can give.
The code is independent on each chip's internal architecture for dynamic processors. It just needs to follow the instruction set. So code compiled on one machine can run on other machines very easily. The disadvantage though is that limited parallelism can be exploited from dynamic processors. And the internal logic and design is very complex and intricate than VLIW.
Nevertheless, dynamic processors are used today mostly by consumers (individuals), so they can run code compiled / generated on any machine. VLIW processors are used by servers and enterprises because of the parallelism they can produce.
they are different
IA-64 is itanium - an architecture for servers
x64 is what 64bit intel core and amd cpus implement
x64 is short for x86-64 which is an extension of the x86 instruction set.
IA-64 is for the Itanium 64 bit Architecture (by Intel)
IA-64 is for computers running Intel Itanium 64 bit processors. They do not support running 32 bit applications like x64 processors do. A special version of Windows is needed to run on these processors, thus the two different installers.
They have different instruction set, this is the key point.

Resources