nvidia error on Azure DSVM/DLVM - azure

I have been creating a few Ubuntu DSVMs and DLVMs on Azure with GPU and I keep getting intermittent errors. These manifest by nvidia-smi being really slow or getting the following error:
2018/01/11 19:42:33 Error: nvml: Driver/library version mismatch
This will appear if I try to run nvidia-smi or nvidia-docker. A reboot usually fixes it but it can reappear.
Does this sound like an intermittent error? Is there something that I can do to mitigate this?

NVIDIA just released a new version of the GPU driver for the GPUs used in Azure. The Ubuntu DSVM is configured to automatically install updates, so these will be installed for you in the background. The issue, though, is that the driver is compiled into the kernel, so you must reboot to get the new driver. The message Driver/library version mismatch means that the version in the kernel can’t use the installed libraries (because they were upgraded). This is why rebooting usually fixes it.
There is a second issue you might be facing: Azure released a new kernel a few days ago that is incompatible with the 387 version of the GPU driver. You won’t get this driver by default on the DSVM, but you might if you installed other packages. This error is different – something like nvidia-smi could not communicate with the nvidia module. The only way to fix it is to (1) get the very latest kernel with apt update and apt upgrade, then reboot, and (2) install a different driver with apt install nvidia-384.

Related

Bad exit status from /var/tmp/rpm-tmp.9hCo4Y (%build)

Currently working with a RHEL 6 offline system.
I am using a RHEL6 repo server to compile a new IGB driver (Intel® Network Adapter Driver for 82575/6, 82580, I350, and I210/211-based Gigabit Network Connections for Linux*) with a RHEL6 kernel update. It will be included in a RHEL6 patch update for an offline system. The IGB driver version I am using is igb-5.13.7 and the kernel update I am using for this patch is 2.6.32-754.48.1.el6 (x86_64).
When I use the command "rpmbuild -tb igb-5.13.7.tar.gz", the compilation errors out and I receive the following message: "Bad exit status from /var/tmp/rpm-tmp.9hCo4Y (%build)".
The last IGB network driver I compiled successfully was igb-5.10.2. Since then, I installed the most recent RHEL6 kernel update to the repo server, removed the old repodata, created new repodata, ran yum update, and restarted the repo started. I have done nothing out of the ordinary for my patch update and I have no idea why I am receiving the above error.
I read on another Stack Overflow question that if you look in the network driver's spec file (in this case igb.spec), you can find the error on line 28. For me, line 28 was "LANG=C" so I'm pretty sure that's not the issue.
Any advice on how to fix this issue or about what files I should look into?
I've tried using an older version of the IGB driver (igb-5.11.4) in case the new driver file was bad. I've tried redownloading the IGB driver I was using (igb-5.13.7) in case the file I had was corrupted.
I've tried checking the SPEC file which was recommended in another Stack Overflow to find the issue on line 28.
I've tried deleting all of the patch files and network driver from the repo server to start the whole process over from scratch. Same issue arises every time.
I've tried a solution that involves rebuilding an RPM from the Source RPM, installing the RPM spec, and rebuilding the package. However, I can't find the source RPM for the latest IGB driver. Willing to continue down this path if someone knows where to find the igb-5.13.7 source rpm or how to obtain it.

Any success installing VMware Workstation on virgin Rocky Linux 8.5?

Using a virgin (but updated) version of Rocky Linux 8.5, I am trying to install VMware Workstation 16.2.1 (and others), but get compile errors during the first attempt to run, when vmmon and vmnet are being built.
All the proper, current headers from kernel-devel and kernel-headers are installed.
I tried upgrading to the 5.16.4 kernal at kernel.org, with all associated headers, and basically get the same errors.
"Unable to install all modules." i.e., vmmon and vmnet
Posts i have found with searching the net seem to indicate that there was a "back-port" of an upstream fix to Rocky that has affected the ability to build the loadable kernel modules necessary to run vmware - but i cannot confirm this is actually the problem that I am experiencing.
So i simply ask these questions: Can anyone (today) install VMware Workstation 16.2.1 (or any version), on a fresh install of Rocky Linux 8.5?
If so, would you please point me at your installation instructions, because I am unable to build "vmmon" and "vmnet" modules today (2022-01-04), that allow me to actually run virtual machines with vmware? (The kernel modules fail to compile and build.)
(and after 15 years of using stackoverflow i do not have the reputation to create a "rocky-linux" question tag...)
See https://unix.stackexchange.com/questions/689436/the-vmmon-and-vmnet-vmware-workstation-kernel-modules-fail-to-build-on-rocky-lin
mbubecek's instructions work for a variety of releases and should compile perfectly and run without issue, if you follow his instructions.
I have successfully used these methods at least a half dozen times with Rocky 8.5 and 8.6 with vmware workstation 16.1 up to version 16.2.1
NOTE: This error is NOT Rocky Linux specific. Also happens on some versions of RHEL 8 and CentOS 8.x I would also expect this "fix" to work on all of the other linux versions that are RHEL 8-derived.
I've been having difficulty with the same issue, and a colleague pointed me to check my kernel. This is our "official" resolution. See if the below works for you.
This is due to differences between the kernel and the source code for the VMWare modules, see here for more information. You can get the correct kernel modules, and build them by executing the following commands
wget https://github.com/mkubecek/vmware-host-modules/archive/workstation-16.1.0.tar.gz
tar -xf workstation-16.1.0.tar.gz
cd vmware-host-modules-workstation-16.1.0/
make
sudo make install
If you get the error,
crosspage.c:53:16: fatal error: linux/frame.h: No such file or directory
The error is described here. The solution is to remove (i.e. comment out) the offending include file in crosspage.c After doing the sudo make install, it is a very good idea to restart you host.
You may need to manually insert the modules into the kernel the first time after running make install'. The kernel modules (vmmon.ko and vmnet.ko) will be found at /lib/modules//misc. The following set of command will do this:
cd /lib/modules/$(uname -r)/misc
sudo insmod vmmon.ko
sudo insmod vmnet.ko
The modules should be load automatically after a restart/reboot.
If you update vmware to a different version (say 16.2.1) you may need to this again. Just change the versions in the above commands. If you hit the update button on the splash-screen and failed to notice the version you are updating to, you can run `vmware -v' at a command prompt to get the version you updated to.

Theano GPU Installation Issue

I have installed Anaconda, theano , GPU Toolkit ver 8. I am getting this error.
ERROR: refusing to load cuda driver library because the version is blacklisted. Versions 373.06 and below are known to be ok.
If you want to bypass this check and force the driver load define GPUARRAY_FORCE_CUDA_DRIVER_LOAD in your environement.
ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
See discussion here.
I reinstalled libgpuarray and it worked for me.
Your cuda driver is faulty. Install another one, preferably 373.06. Using your current driver will result in wrong computations. DO NOT force the driver load

clinfo error for opencl amd

i had install AMDAPPSDK-3.0 for my laptop with intel i5 3rd generation configuration. i have no GPU other than my intel's processors inbuilt graphics card.
i had installed the SDK in the below way:
./AMD-APP-SDK-v3.0.130.136-GA-linux64.sh
my .bashrc file has:
**export LD_LIBRARY_PATH=/home/roadeo/AMDAPPSDK-3.0/lib/x86_64/
export AMDAPPSDKROOT="/home/roadeo/AMDAPPSDK-3.0"
export OPENCL_VENDOR_PATH="/home/roadeo/AMDAPPSDK-3.0/etc/OpenCL/vendors/"**
When i run clinfo to check whether OPENCL is installed properly or not. But i get this error:
**terminate called after throwing an instance of 'cl::Error'
what(): clGetPlatformIDs
Aborted core dumped.**
after googling i with frustration install fglrx using sudo apt-get. When i run clinfo i get a lot of details about opencl versions, vendor etc.. I don't know whether is it required or not.
What i m doing wrong kindly suggest.
I'm not familiar with AMD drivers on Linux, but it seems to me that installing the SDK only installed a bunch of examples, header files, etc. but did not actually install any OpenCL runtimes. Installing fglrx probably installed the CPU runtime, in which case the only device you'll see listed is your CPU. If you want to write OpenCL code for your GPU, you'll need to look at Beignet: https://freedesktop.org/wiki/Software/Beignet/

installing headers for 3.5 kernel in debian wheezy?

Yesterday, I compiled the 3.5 kernel in debian wheezy (testing), in a thinkpad edge S430 (i5). I did it following this blog, with all the default options. It seems succesful, but then, I tried to install the proprietary nvidia driver with m-a auto-install nvidia-kernel. The install is not able to proceed until the correct headers are installed. However, I have tried both manually to install linux-headers-3.5.0-18 and the linux-headers-amd64 package, but module assistant is not able to see them, showing the following message:
Bad luck, the kernel headers for the target kernel version could not be found and you did not specify other valid kernel headers to use.
There are other ways to install the driver, but I think that the problem with headers is broader.
Although I have been a Debian user for some years, I am far from being an expert, and I am not clear with the problems that I might face when compiling a 3.5 kernel on a Debian testing, so any help and explanation will be much appreciated.
First run
sudo m-a prepare
Getting source for kernel version: 3.8.5-ck1
Kernel headers available in /usr/src/linux-headers-3.8.5-ck1
Creating symlink..
Then do
sudo m-a a-i nvidia
and it should work.
Note that I did this on 3.8.5-ck1, but I built and installed that kernel in a similar fashion to how I wrote up the 3.5 build that you followed.

Resources