How can I update NVIDIA driver especially for OpenCL on Google Colab? - linux

You can use GPU such as T4, P100 (as well as K80) on Google Colaboratory.
However, the default OpenCL driver does not support half precision features.
I want to use them.
Does the latest driver enable them?
And how can I update NVIDIA driver on Google Colaboratory?
The below is what I tried.
I downloaded the latest driver (XXX.run) from NVIDIA site and
!apt remove nvidia-*
# reboot by runtime menu
sh XXX.run
But I got
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in
your kernel. This may be because it is in use (for example, by an X
server, a CUDA program, or the NVIDIA Persistence Daemon), but this
may also happen if your kernel was configured without support for
module unloading. Please be sure to exit any programs that may be
using the GPU(s) before attempting to upgrade your driver. If no
GPU-based programs are running, you know that your kernel supports
module unloading, and you still receive this message, then an error
may have occured that has corrupted an NVIDIA kernel module's usage
count, for which the simplest remedy is to reboot your computer.
I also failed rmmod nvidia-uvm.
rmmod: ERROR: ../libkmod/libkmod.c:514 lookup_builtin_file() could not open builtin file '/lib/modules/4.14.137+/modules.builtin.bin'
rmmod: ERROR: ../libkmod/libkmod-module.c:793 kmod_module_remove_module() could not remove 'nvidia_uvm': Operation not permitted
rmmod: ERROR: could not remove module nvidia-uvm: Operation not permitted

Related

Failed to load TPM device

I have a board from a no name manufacturer with a tpm 2.0 device LetsTrust but I am getting an error while trying to load the TPM.
The board has Ubuntu arm64 installed, the kernel version is 4.9.170.
I have added the following modules to be loaded on boot:
tpm_tis_spi
tpm_tis_core
tpm
but I still can't see the tpm on /dev, when I check the boot logs with dmesg I only see **fly error: tpm_tis_remove ** I am not sure what it means exactly but it seems the kernel couldn't recognize the TPM for some reason.
There is a way where I could get more information to understand why this is happening? we have no manual from the manufacturer so I can't ask them for help.
I have tried to compile a new kernel with TPM support to already load those drivers as modules but the OS disk is somehow embedded in the board and I couldn't find any bootloader to be able to tell it to boot from the new kernel.

Direction Installing OpenCL on Linux Mint Dell 9550

I'm about to enter the next phase of a project where I am moving computation to the GPU. Unfortunately, I have had very poor success setting up OpenCL in my environment. I hoped I could garner some specific direction about what implementation of OpenCL to use and how to avoid certain pitfalls upon installation.
My machine:
Linux Mint 17.3
Dell XPS 15 9550 with an Nvidia GTX 960M graphics chip
Some specifics:
I have been unable to find any graphics drivers that work with this hardware other than the Nvidia-352 version found in this PPA:
https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
Every other one I try bricks the machine. I've reinstalled Mint more times than I can count finding this one driver. Keep in mind that I must use this configuration for my machine to work.
I attempted to install Nvidia's CUDA toolkit from their site (https://developer.nvidia.com/cuda-downloads) and for some reason the installation overwrote my Nvidia-352 driver and bricked the machine again.
At this point Im not certain which implementation is correct anyway. I do not want to try another and have the same thing happen.
Some specific questions:
Does every implementation of OpenCL assert itself over the currently installed drivers?
If it does then how can I direct my machine to use the correct one?
Which implementation would be right for my machine?
Can you think of any resources or links that I might be interested in to keep me moving forward? Specifically some installation instructions?
Thanks,
Chronic
Disclaimer: all this is based on my experience with Ubuntu 15.10, but hopefully Mint isn't too different.
Does every OpenCL installation overwrite the others?
If you're installing two different vendor's OpenCL implementations then no, they shouldn't overwrite each other. For example, I have Nvidia, Intel CPU, POCL and Beignet (Intel GPU) platforms installed and working. The only caveat is that the Intel CPU runtime overwrote libOpenCL.so* files, resulting in a crash in clinfo because it required libOpenCL.so.1 which the Intel CPU runtime decided to delete. Re-installing the package ocl-icd-opencl-dev fixed this and you can also make libOpenCL.so.1 a symlink to the actual .so file left by the Intel CPU runtime.
If you try installing two versions for the same platform, like you tried, then yes the last one you install will overwrite the previous one. In your case, remember that the CUDA toolkit also includes the GPU drivers. I haven't played with the CUDA toolkit in a while, perhaps there is an option to install the toolkit only and not the drivers, but since each toolkit requires a certain minimum driver version, you'd have to pick a toolkit version that works with the driver version you can get installed.
On Ubuntu, there is an nvidia-cuda-toolkit package you can sudo apt-get install. Id doesn't ask to change my drivers, hopefully it will work for you. I don't know what version of the toolkit this one installs.
Which implementation is right
If you only want to do OpenCL development then install the nvidia-352 package that worked for you, as well as installing ocl-icd-opencl-dev. This package installs the ocl-icd-libopencl and opencl-headers packages, giving the header files and libOpenCL.so (the ICD loader). You also need to sudo apt-get install nvidia-opencl-icd-352 as that provides the OpenCL runtime for Nvidia GPUs. If you also want to do CUDA development then you need the toolkit.
As a side note, install one of the CPU runtimes, e.g. POCL, in addition to the Nvidia runtime. I found this useful for detecting a bug in my kernel - the kernel worked most of the time on my Nvidia GPU but failed consistently on POCL. It was a race condition.
Useful links
Sorry, no up-to-date installation instructions. However, the instructions provided by each vendor with their OpenCL runtime (except Nvidia) seem to be good enough for me.
Here's some older instructions:
https://wiki.tiker.net/OpenCLHowTo
https://streamcomputing.eu/blog/2011-06-24/install-opencl-on-debianubuntu-orderly/ - The rest of the StreamComputing blog is also interesting.

libGL error with Linux target of OpenFL

I'm planning on doing game development with Haxe, utilizing its C++ target, and for that I chose the HaxeFlixel framework, which uses OpenFL as its backend. The "hello world" test runs just fine with flash, HTML5 seems to work (minus sound), though I'm not planning on using either of those, as the game I wish to create would be a desktop game that runs natively.
However, when I tried to run the HaxeFlixel hello world example with the target set to native linux, the test program crashed on startup and gave me the following errors:
libGL: screen 0 does not appear to be DRI2 capable
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/swrast_dri.so
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/swrast_dri.so
libGL: Can't open configuration file /home/zauber/.drirc: No such file or directory.
libGL error: failed to load driver: swrast
X Error of failed request: GLXUnsupportedPrivateRequest
Major opcode of failed request: 153 (GLX)
Minor opcode of failed request: 16 (X_GLXVendorPrivate)
Serial number of failed request: 211
Current serial number in output stream: 213
I'm at a loss as to how to fix the problem. I've never seen anything like it, and all other 3D software and games I have run just fine. I asked on the HaxeFlixel forums, but was only told that it might be a bug in OpenFL. That seems to be the case since I have the same problem with Awe6, another game framework that uses OpenFL.
I've done a google search for similar issues, but turned up pretty much nothing. I already have all the relevant libraries that I should have (mesa, nVidia drivers, dri2, 32bit libs), and all the solutions I found pretty much pointed to installing a specific library, which I already had installed.
So far, I have asked on both the OpenFL forums and on the IRC channel, and in both cases I was completely ignored. I really need to get this problem fixed because unless I do, I cannot proceed with my gamedev project.
For reference, my system is running 64bit Linux Mint 16, Linux kernel 3.11.0-12, and nVidia drivers 319.32
For reference, my system is running 64bit Linux Mint 16, Linux kernel 3.11.0-12, and nVidia drivers 319.32
Then something in your system configuration is completely messed up: For some reason your program loads a libGL.so provided by the Mesa drivers instead of the NVidia drivers libGL.so. The telltale sign is, that the loaded libGL complains about DRI2 not being available. NVidia's proprietary drivers don't use or support DRI2. DRI2 is the low level state tracker API of Mesa.
Make sure your system is properly configured. Most importantly make sure that none of the libraries, frameworks, etc. you use did something foolish, like bundling up a libGL.so.

Linux stuck in CPU soft lockup?

My system is a CentOS 6.3 (running Kernel version 2.6.32-279.el6.x86_64).
I have a loadable kernel module which is a driver that manages a PCIe card.
If I manually insert the driver using insmod while the OS is up and running, the driver loads successfully and is operational.
However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.
BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).
I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:
I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?
I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?
From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.
I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.
It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.
Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.
Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfo before and after the update.

wpa_supplicant tells "No Drivers enabled"

I have compiled wpa_supplicant code downloaded from http://hostap.epitest.fi/wpa_supplicant/ version is 0.7.3. I am getting No Drivers installed while trying to run the built wpa_supplicant. Am i missing anything in its compilation? Has anyone faced this error? Is there any setting to enable drivers also while compiling wpa_supplicant and wpa_cli?
Does your kernel automatically load the correct module for your wireless card? If not, modprobe the correct module, and try again.
Also, the wpa_supplicant(8) page's AVAILABLE DRIVERS section says that only a handful of cards are supported (disappointing, but at least you could look through the list before buying a card), and that support for the drivers may or may not be compiled in. So make sure your card's driver is on the list, and make sure you've compiled wpa_supplicant(8) with the correct driver.
You have to choose a driver to use by enabling it in .config file of wpa_supplicant before build. They will be in the form CONFIG_DRIVER_<name>.
CONFIG_DRIVER_WEXT and CONFIG_DRIVER_NL80211 are generic that suits many hardware.

Resources