CUDA/PyCUDA: Which GPU is running X11? - linux

In a Linux system with multiple GPUs, how can you determine which GPU is running X11 and which is completely free to run CUDA kernels? In a system that has a low powered GPU to run X11 and a higher powered GPU to run kernels, this can be determined with some heuristics to use the faster card. But on a system with two equal cards, this method cannot be used. Is there a CUDA and/or X11 API to determine this?
UPDATE: The command 'nvidia-smi -a' shows a whether a "display" is connected or not. I have yet to determine if this means physically connected, logically connected (running X11), or both. Running strace on this command shows lots of ioctls being invoked and no calls to X11, so assuming that the card is reporting that a display is physically connected.

There is a device property kernelExecTimeoutEnabled in the cudaDeviceProp structure which will indicate whether the device is subject to a display watchdog timer. That is the best indicator of whether a given CUDA device is running X11 (or the windows/Mac OS equivalent).
In PyCUDA you can query the device status like this:
In [1]: from pycuda import driver as drv
In [2]: drv.init()
In [3]: print drv.Device(0).get_attribute(drv.device_attribute.KERNEL_EXEC_TIMEOUT)
1
In [4]: print drv.Device(1).get_attribute(drv.device_attribute.KERNEL_EXEC_TIMEOUT)
0
Here device 0 has a display attached, and device 1 is a dedicated compute device.

I don't know any library function which could check that. However a one "hack" comes in mind:
X11, or any other system component that manages a connected monitor must consume some of the GPU memory.
So, check if both devices report the same amount of available global memory through 'cudaGetDeviceProperties' and then check the value of 'totalGlobalMem' field.
If it is the same, try allocating that (or only slightly lower) amount of memory on each of the GPU and see which one fails to do that (cudaMalloc returning an error flag).
Some time ago I read somewhere (I don't remember where) that when you increase your monitor resolution, while there is an active CUDA context on the GPU, the context may get invalidated. That hints that the above suggestion might work. Note however that I never actually tried it. It's just my wild guess.
If you manage to confirm that it works, or that it doesn't, let us know!

Related

What causes dma_map_page/dma_unmap_page to take longer time on some hardware?

I've been programming a Linux kernel module for several years for a PCIe device. One of the main feature is to transfer data from the PCIe card to the host memory using DMA.
I'm using streaming DMA, i.e. it's the user program that allocates the memory, and my kernel module has to do the job of locking the pages and creating the scatter gather structure. It works correctly.
However, when used on some more recent hardware with Intel processors, the function calls dma_map_page and dma_unmap_page are taking much longer time to execute.
I've tried to use dma_map_sg and dma_unmap_sg, it takes approximately the same longer-time.
I've tried to split the dma_unmap_sg into a first call to dma_sync_sg_for_cpu, followed by the call to dma_unmap_sg_attrs with attribute DMA_ATTR_SKIP_CPU_SYNC. It works correctly. And I can see the additional time is spend on the unmap operation, not on the sync.
I've tried to play with the linux command line parameters relating to the iommu (on, force, strict=0), and also intel_iommu, with no change in the behavior.
Some other hardware show a decent transfer rate, i.e. more than 6GB/s on PCIe3x8 (max 8GB/s).
The issue on some recent hardware is limiting transfer rate to ~3GB/s (I've checked that the card is correctly configured for PCIe3x8, and the programmer of the Windows device driver manages to achieve the 6GB/s on the same system. Things are more behind the curtains in Windows and I cannot get much information from him.)
On some hardware, the behavior is either normal or slowed, depending on the Linux distribution (and the Linux kernel version I guess). On some other hardware, the roles are reversed, i.e. the slow one becomes the fast one and vice-versa.
I cannot figure out the cause of this. Any clue?
The trouble was the bounce buffers. Didn't know about this.

OpenGL: render time limit on linux

I'm implementing some computation algorithm via OpenGL and Qt. All computations are executed in fragment shader.
Sometimes when i trying to execute some hard computations (that takes more than 5 seconds on GPU) OpenGL breaks computation before it ends. I suppose this is system like TDR from Windows.
I think that i should split input data by several parts but i need to know how long computation allowed.
How i can obtain render time limit on linux (it will be cool if there is crossplatform solution)?
I'm afraid this is not possible. After a lot of scouring through the documentation of both X and Wayland, I could not find anything mentioning GPU watchdog timer settings, so I believe this is driver-specific and likely inaccessible to the user (that or I am terrible at searching).
It is however possible to disable this watchdog under X on NVIDIA hardware by adding a line to your xorg.conf, which is then passed on to the graphics driver.
Option "Interactive" "boolean"
This option controls the behavior of the driver's watchdog, which attempts to detect and terminate GPU programs that get stuck, in order to ensure that the GPU remains available for other processes. GPU compute applications, however, often have long-running GPU programs, and killing them would be undesirable. If you are using GPU compute applications and they are getting prematurely terminated, try turning this option off.
Note that even the NVIDIA docs don't mention a numeric quantity for the timeout.

Why do we need a bootloader in an embedded device?

I'm working with ELinux kernel on ARM cortex-A8.
I know how the bootloader works and what job it's doing. But i've got a question - why do we need bootloader, why was the bootloader born?
Why we can't directly load the kernel into RAM from flash memory without bootloader? If we load it what will happen? In fact, processor will not support it, but why are we following the procedure?
In the context of Linux, the boot loader is responsible for some predefined tasks. As this question is arm tagged, I think that ARM booting might be a useful resource. Specifically, the boot loader was/is responsible for setting up an ATAG list that describing the amount of RAM, a kernel command line, and other parameters. One of the most important parameters is the machine type. With device trees, an entire description of the board is passed. This makes a stock ARM Linux impossible to boot with out some code to setup the parameters as described.
The parameters allows one generic Linux to support multiple devices. For instance, an ARM Debian kernel can support hundreds of different board types. Uboot or other boot loader can dynamically determine this information or it can be hard coded for the board.
You might also like to look at bootloader info page here at stack overflow.
A basic system might be able to setup ATAGS and copy NOR flash to SRAM. However, it is usually a little more complex than this. Linux needs RAM setup, so you may have to initialize an SDRAM controller. If you use NAND flash, you have to handle bad blocks and the copy may be a little more complex than memcpy().
Linux often has some latent driver bugs where a driver will assume that a clock is initialized. For instance if Uboot always initializes an Ethernet clock for a particular machine, the Linux Ethernet driver may have neglected to setup this clock. This can be especially true with clock trees.
Some systems require boot image formats that are not supported by Linux; for example a special header which can initialize hardware immediately; like configuring the devices to read initial code from. Additionally, often there is hardware that should be configured immediately; a boot loader can do this quickly whereas the normal structure of Linux may delay this significantly resulting in I/O conflicts, etc.
From a pragmatic perspective, it is simpler to use a boot loader. However, there is nothing to prevent you from altering Linux's source to boot directly from it; although it maybe like pasting the boot loader code directly to the start of Linux.
See Also: Coreboot, Uboot, and Wikipedia's comparison. Barebox is a lesser known, but well structured and modern boot loader for the ARM. RedBoot is also used in some ARM systems; RedBoot partitions are supported in the kernel tree.
A boot loader is a computer program that loads the main operating system or runtime environment for the computer after completion of the self-tests.
^ From Wikipedia Article
So basically bootloader is doing just what you wanted - copying data from flash into operating memory. It's really that simple.
If you want to know more about boostrapping the OS, I highly recommend you read the linked article. Boot phase consists, apart from tests, also of checking peripherals and some other things. Skipping them makes sense only on very simple embedded devices, and that's why their bootloaders are even simpler:
Some embedded systems do not require a noticeable boot sequence to begin functioning and when turned on may simply run operational programs that are stored in ROM.
The same source
The primary bootloader is usually built in into the silicon and performs the load of the first USER code that will be run in the system.
The bootloader exists because there is no standardized protocol for loading the first code, since it is chip dependent. Sometimes the code can be loaded through a serial port, a flash memory, or even a hard drive. It is bootloader function to locate it.
Once the user code is loaded and running, the bootloader is no longer used and the correctness of the system execution is user responsibility.
In the embedded linux chain, the primary bootloader will setup and run the Uboot. Then Uboot will find the linux kernel and load it.
Why we can't directly load the kernel into RAM from flash memory without bootloader? If we load it what will happen? In fact, processor will not support it, but why are we following the procedure?
Bartek, Artless, and Felipe all give parts of the picture.
Every embedded processor type (E.G. 386EX, Coretex-A53, EM5200) will do something automatically when it is reset or powered on. Sometimes that something is different depending on whether the power is cycled or the device is reset. Some embedded processors allow you to change that something based on voltages applied to different pins when the device is powered or reset.
Regardless, there is a limited amount of something that a processor can do, because of the physical space on-processor required to define that something, whether it is on-chip FLASH, instruction micro-code, or some other mechanism.
This limit means that the something is
fixed purpose, does one thing as quickly as possible.
limited in scope and capability, typically loading a small block of code (often a few kilobytes or less) into a fixed memory location and executing from the start of the loaded code.
unmodifiable.
So what a processor does in response to reset or power-cycle cannot be changed, and cannot do very much, and we don't want it to automatically copy hundreds of megabytes or gigabytes into memory which may not exist or may not be initialized, and which could take a looooong time.
So....
We set up a small program which is smaller than the smallest size permitted across all of the devices we are going to use. That program is stored wherever the something needs it to be.
Sometimes the small program is U-Boot. Sometimes even U-Boot is too big for initial load, so the small program then in turn loads U-Boot.
The point is that whatever gets loaded by the something, is modifiable as needed for a particular system. If it is U-Boot, great, if not, it knows where to load the main operating system or where to load U-Boot (or some other bootloader).
U-Boot (speaking of bootloaders in general) then configures a minimal set of devices, memory, chip settings, etc., to enable the main OS to be loaded and started. The main OS init takes care of any additional configuration or initialization.
So the sequence is:
Processor power-on or reset
Something loads initial boot code (or U-Boot style embedded bootloader)
Initial boot code (may not be needed)
U-Boot (or other general embedded bootloader)
Linux init
The kernel requires the hardware on which you are working to be in a particular state. All the hardware you used needs to be checked for its state and initialized for its further operation. This is one of the main reasons to use a boot loader in an embedded (or any other environment), apart from its use to load a kernel image into the RAM.
When you turn on a system, the RAM is also not in a useful state (fully initialized to use) for us to load kernel into it. Therefore, we cannot load a kernel directly (to answer your question)and thus arises the need for a construct to initialize it.
Apart from what is stated in all the other answers - which is correct - in some cases the system has to go through different execution modes, take as example TrustZone for secure ARM chips. It is possible to still consider it as sort of HW initialization, but what makes it peculiar is the fact that there are additional limitations (ex: memory available) that make it impractical, if not impossible, to do everything in a single binary, thus multiple stages of bootloader are available.
Furthermore, for security reason, each of them is signed and can perform its job only if it meets the security requirements.

Monitoring the instructions of a running program in ubuntu?

I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.
I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)

Is there a way to independently task and use heterogenous multi gpus in a windows 7 system?

Can I have two mixed chipset/generation AMD gpus in my desktop; a 6950 and 4870, and dedicate one gpu (4870) for opencl/gpgpu purposes only, eliminating the device from video output or display driving consideration by the OS, allowing the 4870 to essentially remain in a deep sleep or appear ejected/disabled until it's stream processors are called upon?
Compared to the 4870, the 6950 is a heavyweight in opencl calculations; enough so that it can crunch numbers and still allow an active user session, and even web browsing. HOWEVER, as soon as I navigate to a webpage with embedded flash video, forget what I have running and open media player or media center- basically any gpu-accelerated video task that requires the 6950 to initialize UVD, the display system hangs.
I'm looking for a way to plug my 4870 in an open pcie slot, have it sit in a dormant state with near-0 heat production and power consumption (essentially only maintain the interface signalling, like an ethernet card in a powered-off desktop maintaining the line and waiting for a WOL command), and attain a D0 state (I don't even care if the latency of this wake event is on the scale of seconds) to then run opencl calculations ON ITS OWN. I do not wish to achieve a non-CF heterogeneous gpu teaming setup! In my example of a UVD hung situation I would see manually stopping the opencl calculations on the 6950, beginning those calculations then on the 4870 to free up the 6950 for multimedia usage/gaming as my desire outcome (granted, with a hit to the calculation rate). Even better if the two gpus could independently run similar calculations while no one is using the desktop. I don't even mind if I have to initiate the power-state transitions of the 4870 from/into an 'OFF' state (say, by a shortcut on the desktop), as long as it doesn't require a system restart, ending the user session and logging off... and the manual ON/OFF 'switch' for the 4870 is something any proficient windows end-user could do- like click a shortcut to run a script, or even go into device manage and toggle enable/disable. As long as the 4870 isn't wastefully idling by for 1 sole use that may occur sporadically.
I couldn't think of a solution to facilitate this function besides writing a new ini for the 4870 to override the typical power management characteristics written for usage of the device as a typical graphics card (say to drop in/out of powered state w/o relinquishing irq or other allocated resources to 'hold the door open' on interface availability and addressing). But that is an endeavor that is both well above my abilities, and I easily see an additional involvement of licensing being necessitated to achieve.
Windows 7 (and maybe windows 10) doesn't define a "selected device". It's softwares' own responsibility to pick the right device. For example, google chrome's add-on software(for video decode) will pick whatever gpu comes as first target defined in it. If it is written to pick first-indexed device, then it needs a pci-e re-plug of both cards to switch their roles.
This OS written to fit for majority(%99) of users, not for multi-gpu users(%1 ?). It simply picks one of gpus or software has explicit control over devices and simply benchmarks all gpus and picks fastest. So you should look for software's abilities instead of OS.
Same thing goes for games too! When I play dota-2 on vulkan api, it uses HD7870 for compute(of textures, particles, etc..) but uses R7-240 for graphics! But I prefer the opposite because R7-240 can't draw fast. Because this game is written for majority of people who don't have more than 1 gpu.
Money controls development I'm sorry for this. Then, market-penetration is needed for money. %99 market penetration needs writing code for public, not scientific guys or rich ones. Public has simply 1 gpu and that is a cheap one.
I wish this:
select 1 gpu for: unzipping files, wathing videos, compressing internet uploads and caching for file system(up to 2GB)
select another gpu for: gaming, opencl applications, mining, ..
select all gpus for: games, benchmarks, seen as single device by my applications,..
but is not guaranteed to become true because money still talks.
If I were a driver developer, I would add a "rename" option(and become poor in return) to give N virtual devices to OS, so OS and other software will be able to gain only 1/N 'th power of whole system or N/N by just using those renames or main devices. A rename could be a single compute unit of a gpu. When OS tells drivers "give me %25 of all cores that share same memory" so it pick a device and gives %25 of total cores of system. So even users could create renames for their own work.
I even sent a message to microsoft for "file system cache on my 2nd graphics card" but they did not return!

Resources