Writing CUDA program for more than one GPU - linux

I have more than one GPU and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically? Utilizing resources of all available GPUs for the program.
A utility that may periodically report the available resource and my program will launch as many threads to GPUs.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?

I have more than GPUs and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically.
For arbitrary kernels that you write, there is no API that I am aware of (certainly no CUDA API) that "automatically" makes use of multiple GPUs. Today's multi-GPU aware programs often use a strategy like this:
detect how many GPUs are available
partition the data set into chunks based on the number of GPUs available
successively transfer the chunks to each GPU, and launch the computation kernel on each GPU, switching GPUs using cudaSetDevice().
A program that follows the above approach, approximately, is the cuda simpleMultiGPU sample code. Once you have worked out the methodology for 2 GPUs, it's not much additional effort to go to 4 or 8 GPUs. This of course assumes your work is already separable and the data/algorithm partitioning work is "done".
I think this is an area of active research in many places, so if you do a google search you may turn up papers like this one or this one. Whether these are of interest to you will probably depend on your exact needs.
There are some new developments with CUDA libraries available with CUDA 6 that can perform certain specific operations (e.g. BLAS, FFT) "automatically" using multiple GPUs. To investigate this further, review the relevant CUBLAS XT documentation and CUFFT XT multi-GPU documentation and sample code. As far as I know, at the current time, these operations are limited to 2 GPUs for automatic work distribution. And these allow for automatic distribution of specific workloads (BLAS, FFT) not arbitrary kernels.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
With the exception of the OGL/DX interop APIs CUDA is mostly orthogonal to choice of windows or linux as a platform. The typical IDE's are different (windows: nsight Visual Studio edition, Linux: nsight eclipse edition) but your code changes will mostly consist of ordinary porting differences between windows and linux. If you want to get started with linux, follow the getting started document.

Related

Which linux OS supports AVX-512 VNNI (Vector Neural Network Instruction)?

I need to deploy an EC2 instance where VNNI (Vector Neural Network Instruction) is supported. There are some EC2 instance types that can support the same.
From AWS:
Intel Deep Learning Boost (Intel DL Boost): A new set of built-in processor technologies designed to accelerate AI deep learning use cases. The 2nd Gen Intel Xeon Scalable processors extend Intel AVX-512 with a new Vector Neural Network Instruction (VNNI/INT8) that significantly increases deep learning inference performance over previous generation Intel Xeon Scalable processors (with FP32), for image recognition/segmentation, object detection, speech recognition, language translation, recommendation systems, reinforcement learning and others. VNNI may not be compatible with all Linux distributions. Please check documentation before using.
It is mentioned that VNNI may not be compatible with all Linux distributions. So, which Linux distribution supports VNNI? I am also not sure as to which documentation this statement refers to.
No kernel support is needed beyond that for AVX-512 (i.e. context switch handling of the new AVX-512 zmm and k registers). AVX-512VNNI instructions just operate on those registers, so there's no new architectural state to save/restore on context switch. https://en.wikichip.org/wiki/x86/avx512_vnni / https://en.wikipedia.org/wiki/AVX-512#VNNI
(Unlike AMX (Advanced Matrix Extensions), new in Sapphire Rapids; that does introduce large new "2D tile" registers, 8x 1KiB, that context-switches need to handle1.)
The other relevant thing for distros are compilers versions, like GCC or clang. https://godbolt.org/z/668rvhWPx shows GCC 8.1 and clang 7.0 (both released in 2018) compiling AVX-512VNNI _mm512_dpbusd_epi32 with -march=icelake-server or -march=icelake-client. Versions before that fail, so those are the minimum versions. (Or clang6.0 for -mavx512vnni, but that doesn't enable other things an IceLake CPU supports, or set tuning options.)
So if you want to use the latest hotness, you need a compiler that's at least somewhat up to date. It's generally a good idea to use a compiler newer than the CPU you're using, so compiler devs have had a chance to tweak tuning settings for it. And code-gen from intrinsics, especially newish instruction-sets like AVX-512, has generally improved over compiler versions, so if you care about performance of the generated code, you typically want a newer compiler version. (Regressions happen for some releases for some loops/functions, and thus for some programs, but on average newer compilers make faster code than old ones. That's a big part of what compiler devs spend time improving.)
You can install a new compiler on an old distro via backport packages or manually. Or you can just use a distro release that isn't old and crusty.
Footnote 1: See also a phoronix article re: non-empty AMX register state keeping the CPU from doing a deep sleep. Normally CPUs fully power down the core in deeper sleep states, stashing registers somewhere that stays powered. I'm guessing that they didn't provide space for AMX tiles to do that, so having state there prevents sleep. So if you're using AMX, you'll want Linux kernel at least 5.19.
In AWS, the instance type and OS combination that worked for me:
EC2 instance type: m5n.large (m5n instance family supports AVX-512 VNNI)
OS: Amazon Linux 2 (other Linux distributions should work as well, as explained by #BasileStarynkevitch and #PeterCordes).
For curious minds: What Linux distribution is the Amazon Linux AMI based on?

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

Monitoring the instructions of a running program in ubuntu?

I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.
I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)

Matlab 2011a Use all Cores Available on 64 bit Linux?

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

Any C++ libraries to run a single program on multiple PC's (i.e. "use grid computing to run my app")

I'm after a method of converting a single program to run on multiple computers on a network (think "grid computing").
I'm using MSVC 2007 and C++ (non-.NET).
The program I've written is ideally suited for parallel programming (its doing analysis of scientific data), so the more computers the better.
The classic answer for this would be MPI (Message Passing Interface). It requires a bit of work to get your program to work well with message passing, but the end result is that you can easily launch your executable across a cluster of machines that are running a MPI daemon.
There are several implementations. I've worked with MPICH, but I might consider doing this with Boost MPI (which didn't exist last time I was in the neighborhood).
Firstly, this topic is covered here:
https://stackoverflow.com/questions/2258332/distributed-computing-in-c
Secondly, a search for "C++ grid computing library", "grid computing for visual studio" and "C++ distributed computing library" returned the following:
OpenMP+OpenMPI. OpenMP handles the running of single C++ program on multiple CPU cores within the same machine, OpenMPI handles the messaging between multiple machines. OpenMP+OpenMPI=grid computing.
POP-C++, see http://gridgroup.hefr.ch/popc/.
Xoreax Grid Engine, see http://www.xoreax.com/high_performance_grid_computing.htm. Xoreax focuses on speeding up builds of Visual Studio, but the Xoreax Grid Engine can also be applied to generic applications. Looking at http://www.xoreax.com/xge_xoreax_grid_engine.htm quotes, we see the quote "Once a task-set (a set of tasks for distribution along with their dependency definitions) is defined through one of the interfaces described below, it can be executed on any machine running an IncrediBuild Agent.". See the accompanying CodeGuru article at http://www.codeproject.com/KB/showcase/Xoreax-Grid.aspx
Alchemi, see http://www.codeproject.com/KB/threads/alchemi.aspx.
RightScale, see http://www.rightscale.com/pdf/Grid-Whitepaper-Technical.pdf. A quote from the examples section of this paper: "Pharmaceutical protein analysis: Several million protein compound comparisons were performed in less than a day – a task that would have taken over a week on the customer’s internal resources ..."

Resources