Using Xeon Phi with only threads - multithreading

Is it possible to use Xeon Phi by just launching many threads,
or there are special type of programming required to use Xeon Phi?

Intel have some fairly good math libraries, IPP / MKL. Reading between the lines of what Xeon Phi seems to be I imagine that Intel have a version of those libraries that would exploit the very wide SIMD unit that appears to have become part of the architecture.
Intel's compiler will also put in multiple threads to execute for loops in parallel instead of in sequence. That would be one way of exploiting the large number of cores that Phi seems to have.
So it could be that with the right compiler and libraries programming for Phi could be fairly normal, until you start needing routines that the libraries haven't got.

You can read these document for more information on how to tap the many available threads on Xeon Phi:
http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture
http://software.intel.com/en-us/articles/choosing-the-right-threading-framework
and more on http://software.intel.com/en-us/mic-developer
To summarize, either manage threads manually (via TBB / pthreads / etc.), or use one of the supported parallel programming models:
OpenMP
MPI
Cilk Plus
OpenCL
OpenACC
Or use libraries that can automatically offload to the device, such as MKL or ArrayFire.

Related

Does hyper threading change the binary machine code of a compiled program?

Does hyper threading change the binary code sequence of a compiled program?
If we had a compiled binary code of say:
10100011100011100101010010011111100011111110010111
If hyper threading were enabled, what would a thread represent?
Would it be just some section of this binary code?
How does the operating system allocate time intervals for these threads?
For parallelism:
Would the compiled binary code be any different?
How the cores handle this binary sequence?
Just execute some section of the code in different cores?
How does the operating system allocate parallel task? Is there any specific structure?
Most programs are compiled to be run by a particular operating system, on any member of a broad family of processors (e.g., on Windows, for the x86-64 family). Within a given CPU family, there may be CPUs with different numbers of cores, and there may be cores with or without hyperthreading.
None of that changes the binary code. The same program typically will run on any member of the CPU family, without any changes.
The operating system in which the program runs may or may not be differently configured for different members of the processor family.
Sometimes, a program can be compiled to exploit the features of a specific CPU, but programs compiled in that way are unsuitable for distribution to different sites and/or PCs.
If we had a compiled binary code of say: 101000111000... If hyper threading were enabled, what would a thread represent? Would it be just some section of this binary code?
That's an unanswerable question. You can learn more about what "binary code" means by reading an introductory book on computer architecture. E.g., https://www.amazon.com/Elements-Computing-Systems-Building-Principles/dp/0262640686/

AVX float4/double4 struct

I am looking for a AVX-256/512 code for float4 / double4 struct that overloads the basic operations *,+,/,-,scale by scalar, etc to get a quick performance boost from vector operations in a code written using float4/double4. OpenCL has these data types as intrinsics but c++ code running on the XeonPhi needs new implementations taking advantage of the 512-bit SIMD units.
What you are seeking is Agner Fog's Vector Class Library(VCL). I have used this mostly replace the vector types in OpenCL.
With the VCL float4 is Vec4f and double4 is Vec4d. Like OpenCL you don't need to worry about AVX vs AVX512. If you use Vec8d and compile for AVX it will emulate AVX512 using two AVX registers.
The VCL has all the operations you want such as *,+,/,-,+=,-=,/=,*=, multiply and divide by scalar and many more features.
The main difference with OpenCL and the VCL is that OpenCL basically creates a CPU dispatcher. Whereas with the VCL you have to write a CPU dispatcher yourself (it includes some example code to do this with documentation). The VCL has optimized functions for SSE2 through AVX512 so you can target several different instruction sets. There is even a special version of the VCL for the Knights Corner Xeon Phi.
Another feature from OpenCL that I miss is the syntax for permuting. In OpenCL to reverse the order of the components of float4 you could do v.wzyx whereas with the VCL you would do permute4f<3,2,1,0>(v). I might be possible to create this syntax with C++ but I am not sure.
Using the VCL, OpenMP, and a custom CPU dispatcher I have largely replaced OpenCL on the CPU.

Direct Cpu Threads or OpenCL

I have search the various questions (and web) but did not find any satisfactory answer.
I am curious about whether to use threads to directly load the cores of the CPU or use an OpenCL implementation. Is OpenCl just there to make multi processors/cores just more portable, meaning porting the code to either GPU or CPU or is OpenCL faster and more efficient? I am aware that GPU's have more processing units but that is not the question. Is it indirect multi threading in code or using OpneCL?
Sorry I have another question...
If the IGP shares PCI lines with the Descrete Graphics Card and its drivers can not be loaded under Windows 7, I have to assume that it will not be available, even if you want to use the processing cores of the integrated GPU only. Is this correct or is there a way to access the IGP without drivers.
EDIT: As #Yann Vernier point out in the comment section, I haven't be strict enough with the terms I used. So in this post I use the term thread as a synonym of workitem. I'm not refering to the CPU threads.
I can’t really compare OCL with any other technologies that will allow using the different cores of a CPU as I only used OCL so far.
However I might bring some input about OCL especially that I don’t really agree with ScottD.
First of all, even though an OCL kernel developed to run on a GPU will run as well on a CPU it doesn’t mean that it’ll be efficient. The reason is simply that OCL doesn’t work the same way on CPU and GPU. To have a good understanding of how it differs, see the chap 6 of “heterogeneous computing with opencl”. To summary, while the GPU will launch a bunch of threads within a given workgroup at the same time, the CPU will execute on a core one thread after another within the same workgroup. See as well the point 3.4 of the standard about the two different types of programming models supported by OCL. This can explain why an OCL kernel could be less efficient on a CPU than a “classic” code: because it was design for a GPU. Whether a developer will target the CPU or the GPU is not a problem of “serious work” but is simply dependent of the type of programming model that suits best your need. Also, the fact that OCL support CPU as well is nice since it can degrade gracefully on computer not equipped with a proper GPU (though it must be hard to find such computer).
Regarding the AMD platform I’ve noticed some problem with the CPU as well on a laptop with an ATI. I observed low performance on some of my code and crashes as well. But the reason was due to the fact that the processor was an Intel. The AMD platform will declare to have a CPU device available even if it is an Intel CPU. However it won’t be able to use it as efficiently as it should. When I run the exact same code targeting the CPU but after installing (and using) the Intel platform all the issues were gone. That’s another possible reason for poor performance.
Regarding the iGPU, it does not share PCIe lines, it is on the CPU die (at least of Intel) and yes you need the driver to use it. I assume that you tried to install the driver and got a message like” your computer does not meet the minimum requirement…” or something similar. I guess it depends on the computer, but in my case, I have a desktop equipped with a NVIDIA and an i7 CPU (it has an HD4000 GPU). In order to use the iGPU I had first to enable it in the BIOS, which allowed me to install the driver. Of Course only one of the two GPU is used by the display at a time (depending on the BIOS setting), but I can access both with OCL.
In recent experiments using the Intel opencl tools we experienced that the opencl performance was very similar to CUDA and intrincics based AVX code on gcc and icc -- way better than earlier experiments (some years ago) where we saw opencl perform worse.

OpenCL, TBB, OpenMP

I have implemented few normal looping applications in OpenMP, TBB and OpenCL. In all these applications, OpeCL gives far better performance than others too when I am only running it on CPU with no specific optimizations done in kernels. OpenMP and TBB gives good performance too but far less than OpenCL, what could be reason for it because these both are CPU specialized frameworks and should gives at least a performance equal to OpenMP/TBB.
My second concern is that when it comes to OpenMP and TBB, OpenMP is always better in performance than TBB in my implementations in which I havent tuned it for a very good optimizations as I am not so expert. Is there a reason that OpenMP is normally better in performance than TBB? Because I think they both or even OpenCL too uses same kind of thread pooling at low level.... Any expert opinions? Thanks
One advantage that OpenCL has over TBB and OpenMP is that it can take better advantage of SIMD parallelism in your hardware. Some OpenCL implementations will run your code such that each work item runs in a SIMD vector lane of the machine, as well as running on separate cores. Depending on the algorithm, this could provide lots of performance benefits.
C compilers can also take some advantage of SIMD parallelism as well, using auto-vectorization, but the memory aliasing rules in C make it hard for this to work in some cases. Since OpenCL requires programmers to call out the work items and fence memory accesses explicitly, an OpenCL compiler can be more aggressive.
In the end, it depends on your code. One could find an algorithm for which any of OpenCL, OpenMP, or TBB are best.
OpenCL runtime for CPU and MIC provided by Intel uses TBB under the hood. It's far from just 'thread pooling at low level' since it takes advantage of sophisticated scheduling and partitioning algorithms provided by TBB for better load balance and so better utilization of CPUs.
As for TBB vs. OpenMP. Usually, it comes down to incorrect measurements. For example, TBB has no implicit barrier like in OpenMP, so a warm-up loop is not enough. You have to make sure all the threads are created and this overhead is not included into your measurements. Another example: sometimes, compilers are not able to vectorize the same code with TBB which is vectorized with OpenMP.
OpenCL kernels are compiled for the given hardware. The potential for vendor/hardware specific optimisations is huge.

Why doesn't Linux use the hardware context switch via the TSS?

I read the following statement:
The x86 architecture includes a
specific segment type called the Task
State Segment (TSS), to store hardware
contexts. Although Linux doesn't use
hardware context switches, it is
nonetheless forced to set up a TSS for
each distinct CPU in the system.
I am wondering:
Why doesn't Linux use the hardware support for context switch?
Isn't the hardware approach much faster than the software approach?
Is there any OS which does take advantage of the hardware context switch? Does windows use it?
At last and as always, thanks for your patience and reply.
-----------Added--------------
http://wiki.osdev.org/Context_Switching got some explanation.
People as confused as me could take a look at it. 8^)
The x86 TSS is very slow for hardware multitasking and offers almost no benefits when compared to software task switching. (In fact, I think doing it manually beats the TSS a lot of times)
The TSS is known also for being annoying and tedious to work with and it is not portable, even to x86-64. Linux aims at working on multiple architectures so they probably opted to use software task switching because it can be written in a machine independent way. Also, Software task switching provides a lot more power over what can be done and is generally easier to setup than the TSS is.
I believe Windows 3.1 used the TSS, but at least the NT >5 kernel does not. I do not know of any Unix-like OS that uses the TSS.
Do note that the TSS is mandatory. The thing that OSs do though is create a single TSS entry(per processor) and everytime they need to switch tasks, they just change out this single TSS. And also the only fields used in the TSS by software task switching is ESP0 and SS0. This is used to get to ring 0 from ring 3 code for interrupts. Without a TSS, there would be no known Ring 0 stack which would of course lead to a GPF and eventually triple fault.
Linux used to use HW-based switching, in the pre-1.3 timeframe iirc. I believe sw-based context switching turned out to be faster, and it is more flexible.
Another reason may have been minimizing arch-specific code. The first port of Linux to a non-x86 architecture was Alpha. Alpha didn't have TSS, so more code could be shared if all archs used SW switching. (Just a guess.) Unfortunately the kernel changelogs for the 1.2-1.3 kernel period are not well-preserved, so I can't be more specific.
Linux doesn't use a segmented memory model, so this segmentation specific feature isn't used.
x86 CPUs have many different kinds of hardware support for context switching, so the distinction isn't hardware vs software, but more how does an OS use the various hardware features available. It isn't necessary to use them all.
Linux is so efficiency focussed that you can bet that someone has profiled every option that is possible, and that the options currently used are the best available compromise.

Resources