the tensorflow docker gpu image doesn't detect my GPU - linux

Running the latest docker with:
docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --notebook-dir=/tf --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.allow_origin='https://colab.research.google.com'
code:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
gives me:
2020-07-27 19:44:03.826149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-27 19:44:03.826179: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (-1)
2020-07-27 19:44:03.826201: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
I'm on Pop_OS 20.04, have tried installing the CUDA drivers from the Pop repository as well as from NVidia. No dice. Any help appreciated.
Running
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
gives me:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:09:00.0 On | N/A |
| 0% 52C P5 15W / 225W | 513MiB / 7959MiB | 17% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

As per the docs here and here, you have to add a "gpus" argument when creating a the docker container to have gpu support.
So you should start your container something like this. The "--gpus all" makes all the gpus available on the host to be visible to the container.
docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --notebook-dir=/tf --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.allow_origin='https://colab.research.google.com'
Also you can try running nvidia-smi on the tensorflow image to quickly check if gpu is accessible on the container.
docker run -it --rm --gpus all tensorflow/tensorflow:latest-gpu-jupyter nvidia-smi
Would return this in my case.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:07:00.0 On | N/A |
| 0% 45C P8 8W / 166W | 387MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
As you can see, I'm running an older nvidia driver(440.100), so I cannot confirm that this would solve your problem. I'm also on Pop_OS 20.04 and didn't install anything other than docker along with dependencies and nvidia-container-toolkit.
Also I would highly suggest avoiding the latest tag when creating containers as it might cause you to unknowingly upgrade to a newer image. Go with version numbered images.
For example tensorflow/tensorflow:2.3.0-gpu-jupyter.

Related

Pytorch Cuda for ubuntu 20.04

I'm trying to get pytorch with cuda 10 compatibility via :
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
from(https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254)
but there is timeout error:
Proceed ([y]/n)? y
Downloading and Extracting Packages
pytorch-mutex-1.0 | 3 KB | | 0%
torchvision-0.12.0 | 8.8 MB | | 0%
ffmpeg-4.3 | 9.9 MB | | 0%
pytorch-1.11.0 | 622.9 MB | | 0%
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/pytorch/noarch/pytorch-mutex-1.0-cuda.tar.bz2>
Elapsed: -
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/pytorch/noarch/pytorch-mutex-1.0-cuda.tar.bz2>
Elapsed: -
So I was running WSL2 and didn't shut down for many days. A reboot fixed the issue.

nvidia-docker : Got permission denied

Docker newbie question here, so please be nice.
I know this might be asked before but I could not find anything related to nvidia-docker.
I completed the installation instructions on the official guide.
When I wanted to test Nvidia-docker:
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
I got this error:
(base) user#adminme:~$ docker run --gpus all --rm nvidia/cuda nvidia-smi
docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/create: dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.
I found this answer here, but it felt a bit different for my case. I am very new to docker and still learning. let me know what you think?
here is some information about my remote Linux machine:
(base) user#adminme:~$ lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
02:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
nvidia-smi command:
(base) user#adminme:~$ nvidia-smi
Sun May 31 01:12:25 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:02:00.0 Off | N/A |
| 0% 33C P8 9W / 215W | 17MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2545 G /usr/lib/xorg/Xorg 15MiB |
+-----------------------------------------------------------------------------+
docker-version :
(base) user#adminme:~$ docker --version
Docker version 19.03.10, build 9424aeaee9
The quick fix would be to run the container using sudo:
sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
If you want to run docker as non-root user then you need to add it to the docker group.
Create the docker group if it does not exist
sudo groupadd docker
Add your user to the docker group.
sudo usermod -aG docker $USER
Run the following command or Logout and login again and run (that doesn't work you may need to reboot your machine first)
newgrp docker
Check if docker can be run without root
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Ref:- https://docs.docker.com/engine/install/linux-postinstall/
In addition to what nischay goyal answered, sometimes after adding the user to the docker group you have to do
su - ${USER}
in order to log out and log back.

Cannot dlopen some GPU libraries. Skipping registering GPU devices

Tensorflow is only using the CPU and wont use the GPU. I assume its because it expects Cuda 10.0 and it finds 10.2.
I had installed 10.2 but have purged it and installed 10.0.
Im running Ubuntu 19.10, AMD Ryzen 2700 Cpu, RTX 2080 S.
I have installed the 440 Nvidia driver, It says cuda version 10.2 when i check with nvidia-smi and nvcc -version.
From pip3: tensorflow-gpu 1.14.0
tensorflow-datasets 2.0.0
tensorflow-estimator 1.14.0
tensorflow-metadata 0.21.1
From Nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:08:00.0 On | N/A |
| 0% 48C P8 13W / 250W | 369MiB / 7979MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1110 G /usr/lib/xorg/Xorg 18MiB |
| 0 1611 G /usr/lib/xorg/Xorg 73MiB |
| 0 1816 G /usr/bin/gnome-shell 108MiB |
| 0 2655 C python3 115MiB |
+-----------------------------------------------------------------------------+
from nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
But when i check the version.txt i get 10.0.130
cat /usr/local/cuda/version.txt
CUDA Version 10.0.130
I check the devices with :
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
result:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 4810338588393992961
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 7271419476897292826
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 4332706623198547606
physical_device_desc: "device: XLA_GPU device"
]
How do i register the 10.0.130
Is that the reason why it wont run on GPU? Its super slow on the 8 Core CPU. Any advice?
Here is the log:
2020-02-13 14:11:31.411277: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-13 14:11:31.440150: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3193485000 Hz
2020-02-13 14:11:31.441076: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5625b689c790 executing computations on platform Host. Devices:
2020-02-13 14:11:31.441123: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2020-02-13 14:11:31.443001: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-13 14:11:31.472935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-13 14:11:31.473407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.845
pciBusID: 0000:08:00.0
2020-02-13 14:11:31.474361: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-02-13 14:11:31.487124: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-02-13 14:11:31.496148: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-02-13 14:11:31.498873: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-02-13 14:11:31.514842: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-02-13 14:11:31.525992: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-02-13 14:11:31.526168: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.2/lib64
2020-02-13 14:11:31.526183: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-02-13 14:11:31.618627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-13 14:11:31.618655: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0
2020-02-13 14:11:31.618662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N
2020-02-13 14:11:31.620367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-13 14:11:31.621395: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5625b732d5f0 executing computations on platform CUDA. Devices:
2020-02-13 14:11:31.621407: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 SUPER, Compute Capability 7.5
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13330791690361361129
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 11872341970779952422
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 15007819717683015571
physical_device_desc: "device: XLA_GPU device"
]
WARNING:tensorflow:From pokeGAN.py:172: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From pokeGAN.py:174: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From pokeGAN.py:77: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
2020-02-13 14:11:33.799163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-13 14:11:33.799597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.845
pciBusID: 0000:08:00.0
2020-02-13 14:11:33.799646: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-02-13 14:11:33.799658: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-02-13 14:11:33.799669: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2020-02-13 14:11:33.799684: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2020-02-13 14:11:33.799695: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2020-02-13 14:11:33.799706: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2020-02-13 14:11:33.799777: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.2/lib64
2020-02-13 14:11:33.799786: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2020-02-13 14:11:33.800016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-13 14:11:33.800028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]
WARNING:tensorflow:From pokeGAN.py:203: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
2020-02-13 14:11:34.197990: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING:tensorflow:From /home/node/.local/lib/python3.7/site-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
WARNING:tensorflow:From pokeGAN.py:211: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
total training sample num:91
batch size: 64, batch num per epoch: 1, epoch num: 5000
start training...
Judging from your logs it looks like tensorflow finds the correct cuda version but the cudnn library is missing.
2020-02-13 14:11:31.474361: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2020-02-13 14:11:31.526168: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.2/lib64
Have you installed the correct version of cudnn? As you can see here
tensorflow 1.14 also requires cudnn 7.4
The only thing that worked for me to solve this issue was to completely remove CUDA and reinstall it again.

user data is not running at launch aws ec2

I am trying to launch an ec2 linux instance (linux 2 ami) and while doing that in the user data, I am trying to
get node js installed, and at the same time install git.
Then I am trying to get clone my github repo and then start the node js server, all this done in the userdata.
I am trying to check everywhere, in the cloudlog init file to find some error why my user data is not working.
Here are the codes.
#!/bin/bash
sudo yum update -y
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.32.0/install.sh | bash
. ~/.nvm/nvm.sh
nvm install 4.4.5
sudo yum upgrade
sudo yum install git -y
git clone https://github.com/myname/one_user.git
cd one_user
dnsaddress=$(curl -s http://169.254.169.254/latest/meta-data/public-hostname)
export dns_name=${dnsaddress}
npm install -y
node server.js
The code below is from the cloud init log file.
Cloud-init v. 18.2-72.amzn2.0.6 running 'init-local' at Sun, 10 Feb 2019 15:49:35 +0000. Up 4.93 seconds.
Cloud-init v. 18.2-72.amzn2.0.6 running 'init' at Sun, 10 Feb 2019 15:49:38 +0000. Up 7.42 seconds.
ci-info: ++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: | eth0 | True | 10.0.1.72 | 255.255.255.0 | global | 0e:1f:76:6a:3c:6c |
ci-info: | eth0 | True | fe80::c1f:76ff:fe6a:3c6c/64 | . | link | 0e:1f:76:6a:3c:6c |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +--------+------+-----------------------------+---------------+--------+-------------------+
ci-info: ++++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++++
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.1.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.1.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U |
ci-info: | 2 | 169.254.169.254 | 0.0.0.0 | 255.255.255.255 | eth0 | UH |
ci-info: +-------+-----------------+----------+-----------------+-----------+-------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | 9 | fe80::/64 | :: | eth0 | U |
ci-info: | 11 | local | :: | eth0 | U |
ci-info: | 12 | ff00::/8 | :: | eth0 | U |
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 18.2-72.amzn2.0.6 running 'modules:config' at Sun, 10 Feb 2019 15:49:39 +0000. Up 8.99 seconds.
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Existing lock /var/run/yum.pid: another copy is running as pid 3265.
Another app is currently holding the yum lock; waiting for it to exit...
The other application is: yum
Memory : 31 M RSS (321 MB VSZ)
Started: Sun Feb 10 15:49:38 2019 - 00:02 ago
State : Sleeping, pid: 3265
Another app is currently holding the yum lock; waiting for it to exit...
The other application is: yum
Memory : 70 M RSS (361 MB VSZ)
Started: Sun Feb 10 15:49:38 2019 - 00:04 ago
State : Running, pid: 3265
--> 1:openssl-libs-1.0.2k-16.amzn2.0.1.x86_64 from installed removed (updateinfo)
--> 1:openssl-1.0.2k-16.amzn2.0.1.x86_64 from installed removed (updateinfo)
--> 1:openssl-libs-1.0.2k-16.amzn2.0.2.x86_64 from amzn2-core removed (updateinfo)
--> 1:openssl-1.0.2k-16.amzn2.0.2.x86_64 from amzn2-core removed (updateinfo)
1 package(s) needed (+0 related) for security, out of 3 available
Resolving Dependencies
--> Running transaction check
---> Package kernel-tools.x86_64 0:4.14.88-88.76.amzn2 will be updated
---> Package kernel-tools.x86_64 0:4.14.94-89.73.amzn2 will be an update
--> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Updating:
kernel-tools x86_64 4.14.94-89.73.amzn2 amzn2-core 111 k
Transaction Summary
================================================================================
Upgrade 1 Package
Total download size: 111 k
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Updating : kernel-tools-4.14.94-89.73.amzn2.x86_64 1/2
Cleanup : kernel-tools-4.14.88-88.76.amzn2.x86_64 2/2
Verifying : kernel-tools-4.14.94-89.73.amzn2.x86_64 1/2
Verifying : kernel-tools-4.14.88-88.76.amzn2.x86_64 2/2
Updated:
kernel-tools.x86_64 0:4.14.94-89.73.amzn2
Complete!
Cloud-init v. 18.2-72.amzn2.0.6 running 'modules:final' at Sun, 10 Feb 2019 15:49:47 +0000. Up 16.22 seconds.
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Existing lock /var/run/yum.pid: another copy is running as pid 3324.
Another app is currently holding the yum lock; waiting for it to exit...
The other application is: yum
Memory : 54 M RSS (270 MB VSZ)
Started: Sun Feb 10 15:49:45 2019 - 00:03 ago
State : Running, pid: 3324
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:4.14.94-89.73.amzn2 will be installed
---> Package openssl.x86_64 1:1.0.2k-16.amzn2.0.1 will be updated
---> Package openssl.x86_64 1:1.0.2k-16.amzn2.0.2 will be an update
---> Package openssl-libs.x86_64 1:1.0.2k-16.amzn2.0.1 will be updated
---> Package openssl-libs.x86_64 1:1.0.2k-16.amzn2.0.2 will be an update
--> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
kernel x86_64 4.14.94-89.73.amzn2 amzn2-core 19 M
Updating:
openssl x86_64 1:1.0.2k-16.amzn2.0.2 amzn2-core 496 k
openssl-libs x86_64 1:1.0.2k-16.amzn2.0.2 amzn2-core 1.2 M
Transaction Summary
================================================================================
Install 1 Package
Upgrade 2 Packages
Total download size: 21 M
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
--------------------------------------------------------------------------------
Total 33 MB/s | 21 MB 00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Updating : 1:openssl-libs-1.0.2k-16.amzn2.0.2.x86_64 1/5
Updating : 1:openssl-1.0.2k-16.amzn2.0.2.x86_64 2/5
Installing : kernel-4.14.94-89.73.amzn2.x86_64 3/5
Cleanup : 1:openssl-1.0.2k-16.amzn2.0.1.x86_64 4/5
Cleanup : 1:openssl-libs-1.0.2k-16.amzn2.0.1.x86_64 5/5
Verifying : 1:openssl-libs-1.0.2k-16.amzn2.0.2.x86_64 1/5
Verifying : kernel-4.14.94-89.73.amzn2.x86_64 2/5
Verifying : 1:openssl-1.0.2k-16.amzn2.0.2.x86_64 3/5
Verifying : 1:openssl-libs-1.0.2k-16.amzn2.0.1.x86_64 4/5
Verifying : 1:openssl-1.0.2k-16.amzn2.0.1.x86_64 5/5
Installed:
kernel.x86_64 0:4.14.94-89.73.amzn2
Updated:
openssl.x86_64 1:1.0.2k-16.amzn2.0.2
openssl-libs.x86_64 1:1.0.2k-16.amzn2.0.2
Complete!
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 10007 100 10007 0 0 10007 0 0:00:01 --:--:-- 0:00:01 99079
=> Downloading nvm as script to '/.nvm'
=> Profile not found. Tried (as defined in $PROFILE), ~/.bashrc, ~/.bash_profile, ~/.zshrc, and ~/.profile.
=> Create one of them and run this script again
=> Create it (touch ) and run this script again
OR
=> Append the following lines to the correct file yourself:
export NVM_DIR="/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh" # This loads nvm
=> Close and reopen your terminal to start using nvm or run the following to use it now:
export NVM_DIR="/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh" # This loads nvm
/var/lib/cloud/instance/scripts/part-001: line 4: /root/.nvm/nvm.sh: No such file or directory
/var/lib/cloud/instance/scripts/part-001: line 5: nvm: command not found
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Existing lock /var/run/yum.pid: another copy is running as pid 11772.
Another app is currently holding the yum lock; waiting for it to exit...
The other application is: yum
Memory : 52 M RSS (268 MB VSZ)
Started: Sun Feb 10 15:50:08 2019 - 00:02 ago
State : Running, pid: 11772
No packages marked for update
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
Resolving Dependencies
--> Running transaction check
---> Package git.x86_64 0:2.17.2-2.amzn2 will be installed
--> Processing Dependency: perl-Git = 2.17.2-2.amzn2 for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: git-core-doc = 2.17.2-2.amzn2 for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: git-core = 2.17.2-2.amzn2 for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: emacs-filesystem >= 25.3 for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: perl(Term::ReadKey) for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: perl(Git::I18N) for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: perl(Git) for package: git-2.17.2-2.amzn2.x86_64
--> Processing Dependency: libsecret-1.so.0()(64bit) for package: git-2.17.2-2.amzn2.x86_64
--> Running transaction check
---> Package emacs-filesystem.noarch 1:25.3-3.amzn2.0.1 will be installed
---> Package git-core.x86_64 0:2.17.2-2.amzn2 will be installed
---> Package git-core-doc.noarch 0:2.17.2-2.amzn2 will be installed
---> Package libsecret.x86_64 0:0.18.5-2.amzn2.0.2 will be installed
---> Package perl-Git.noarch 0:2.17.2-2.amzn2 will be installed
--> Processing Dependency: perl(Error) for package: perl-Git-2.17.2-2.amzn2.noarch
---> Package perl-TermReadKey.x86_64 0:2.30-20.amzn2.0.2 will be installed
--> Running transaction check
---> Package perl-Error.noarch 1:0.17020-2.amzn2 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
git x86_64 2.17.2-2.amzn2 amzn2-core 217 k
Installing for dependencies:
emacs-filesystem noarch 1:25.3-3.amzn2.0.1 amzn2-core 64 k
git-core x86_64 2.17.2-2.amzn2 amzn2-core 4.0 M
git-core-doc noarch 2.17.2-2.amzn2 amzn2-core 2.3 M
libsecret x86_64 0.18.5-2.amzn2.0.2 amzn2-core 153 k
perl-Error noarch 1:0.17020-2.amzn2 amzn2-core 32 k
perl-Git noarch 2.17.2-2.amzn2 amzn2-core 70 k
perl-TermReadKey x86_64 2.30-20.amzn2.0.2 amzn2-core 31 k
Transaction Summary
================================================================================
Install 1 Package (+7 Dependent packages)
Total download size: 6.8 M
Installed size: 36 M
Downloading packages:
--------------------------------------------------------------------------------
Total 18 MB/s | 6.8 MB 00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : git-core-2.17.2-2.amzn2.x86_64 1/8
Installing : git-core-doc-2.17.2-2.amzn2.noarch 2/8
Installing : libsecret-0.18.5-2.amzn2.0.2.x86_64 3/8
Installing : 1:perl-Error-0.17020-2.amzn2.noarch 4/8
Installing : perl-TermReadKey-2.30-20.amzn2.0.2.x86_64 5/8
Installing : 1:emacs-filesystem-25.3-3.amzn2.0.1.noarch 6/8
Installing : perl-Git-2.17.2-2.amzn2.noarch 7/8
Installing : git-2.17.2-2.amzn2.x86_64 8/8
Verifying : 1:emacs-filesystem-25.3-3.amzn2.0.1.noarch 1/8
Verifying : perl-TermReadKey-2.30-20.amzn2.0.2.x86_64 2/8
Verifying : 1:perl-Error-0.17020-2.amzn2.noarch 3/8
Verifying : libsecret-0.18.5-2.amzn2.0.2.x86_64 4/8
Verifying : git-core-2.17.2-2.amzn2.x86_64 5/8
Verifying : git-2.17.2-2.amzn2.x86_64 6/8
Verifying : perl-Git-2.17.2-2.amzn2.noarch 7/8
Verifying : git-core-doc-2.17.2-2.amzn2.noarch 8/8
Installed:
git.x86_64 0:2.17.2-2.amzn2
Dependency Installed:
emacs-filesystem.noarch 1:25.3-3.amzn2.0.1
git-core.x86_64 0:2.17.2-2.amzn2
git-core-doc.noarch 0:2.17.2-2.amzn2
libsecret.x86_64 0:0.18.5-2.amzn2.0.2
perl-Error.noarch 1:0.17020-2.amzn2
perl-Git.noarch 0:2.17.2-2.amzn2
perl-TermReadKey.x86_64 0:2.30-20.amzn2.0.2
Complete!
Cloning into 'zero2architect'...
/var/lib/cloud/instance/scripts/part-001: line 16: npm: command not found
/var/lib/cloud/instance/scripts/part-001: line 17: node: command not found
Feb 10 15:50:16 cloud-init[3314]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [127]
Feb 10 15:50:16 cloud-init[3314]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Feb 10 15:50:16 cloud-init[3314]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Cloud-init v. 18.2-72.amzn2.0.6 finished at Sun, 10 Feb 2019 15:50:16 +0000. Datasource DataSourceEc2. Up 45.55 seconds
Have you checked your SGs and NACL configurations? You need to allow port 80 (HTTP) and port 443 (HTTPS) in outbound SG rules - destination 0.0.0.0/0. You also need to allow the same on outbound NACL's rules and ephemeral ports on inbound rules. Makes sure that your instance is sitting in a public subnet (with route to IGW) and it has public/Elastic IPv4 attached to it.
If any one of those conditions is not met, your user data scrip will fail, since it need connection to the Internet (in your case).
If your instance is sitting in a private subnet, make sure that you are running NAT Gateway or NAT instance and that your route tables are properly configured, as well as SGs and NACLs. Also make sure that source/destination check is disabled on NAT instance if you are using it.
UPDATE
You log clearly says that you don't have npm and node installed so you need to pass installation instruction to the script as well.
npm install -y
is not the right way to install npm. You can follow these steps to install both node and npm.
curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.32.0/install.sh | bash
. ~/.nvm/nvm.sh
nvm install 4.4.5
This will install node version 4.4.5. If you want some other version of node to be installed, then change the number for whatever supported version.

CNTK on Azure Data Science VM

I have an N-Series Azure VM (the Data Science VM) with Tesla K80 GPU. According to the NVIDIA scanner my GPU driver is up to date.
When I run my CNTK Brainscript it says "No GPUs Found" and runs in CPU mode. What can I do to troubleshoot?
requestnodes [MPIWrapper]: using 1 out of 1 MPI nodes on a single host (1 reques
ted); we (0) are in (participating)
-------------------------------------------------------------------
Build info:
Built time: Dec 22 2016 01:43:24
Last modified date: Thu Dec 22 01:35:04 2016
Build type: Release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8
.0
CUB_PATH: c:\src\cub-1.4.1
CUDNN_PATH: C:\local\cudnn-8.0-windows10-x64-v5.1
Build Branch: HEAD
Build SHA1: 8e8b5ff92eff4647be5d41a5a515956907567126
Built by svcphil on DPHAIM-24
Build Path: C:\jenkins\workspace\CNTK-Build-Windows\Source\CNTK\
-------------------------------------------------------------------
No GPUs found
Edit: here is the output from NVidia_smi.exe:
C:\Program Files\NVIDIA Corporation\NVSMI>.\nvidia-smi.exe
Fri Jan 13 19:00:43 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 369.30 Driver Version: 369.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 TCC | 0BD1:00:00.0 Off | Off |
| N/A 43C P8 27W / 149W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 TCC | 5871:00:00.0 Off | Off |
| N/A 35C P8 34W / 149W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The Windows Data Science VM bydefault does not come with the GPU drivers, CUDA etc. We do have an extension called "Deep Learning toolkit for DSVM" that adds on drivers, CUDA and GPU edition of deep learning software like CNTK, Tensorflow, MxNet.
More Info: http://aka.ms/dsvm/deeplearning
We also recently released a Ubuntu version of DSVM with builtin CUDA, GPU drivers and several more deep learning tools and can be deployed either on GPU VM or CPU only VMs on Azure.
Would it be possible for you to run the python notebooks and see if you could run them with the device being set to gpu(id)? or from activated CNTK python environment you could try setting some device.
import cntk as C
from cntk.device import set_default_device, gpu
C.device.set_default_device(C.device.gpu(0))
This might give you some clues whether it is Brainscript specific issue.
Well the python script and Brainscript work now, after installing CUDA (I installed it to run NVIDIA_SMI). I should not have assumed that the Azure Data Science image (that only works with an N Series VM) has the necessary NVIDIA libraries pre-installed. :-)

Resources