Docker newbie question here, so please be nice.
I know this might be asked before but I could not find anything related to nvidia-docker.
I completed the installation instructions on the official guide.
When I wanted to test Nvidia-docker:
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
I got this error:
(base) user#adminme:~$ docker run --gpus all --rm nvidia/cuda nvidia-smi
docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/create: dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.
I found this answer here, but it felt a bit different for my case. I am very new to docker and still learning. let me know what you think?
here is some information about my remote Linux machine:
(base) user#adminme:~$ lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
02:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
nvidia-smi command:
(base) user#adminme:~$ nvidia-smi
Sun May 31 01:12:25 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:02:00.0 Off | N/A |
| 0% 33C P8 9W / 215W | 17MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2545 G /usr/lib/xorg/Xorg 15MiB |
+-----------------------------------------------------------------------------+
docker-version :
(base) user#adminme:~$ docker --version
Docker version 19.03.10, build 9424aeaee9
The quick fix would be to run the container using sudo:
sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
If you want to run docker as non-root user then you need to add it to the docker group.
Create the docker group if it does not exist
sudo groupadd docker
Add your user to the docker group.
sudo usermod -aG docker $USER
Run the following command or Logout and login again and run (that doesn't work you may need to reboot your machine first)
newgrp docker
Check if docker can be run without root
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Ref:- https://docs.docker.com/engine/install/linux-postinstall/
In addition to what nischay goyal answered, sometimes after adding the user to the docker group you have to do
su - ${USER}
in order to log out and log back.
Related
Running the latest docker with:
docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --notebook-dir=/tf --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.allow_origin='https://colab.research.google.com'
code:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
gives me:
2020-07-27 19:44:03.826149: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-27 19:44:03.826179: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (-1)
2020-07-27 19:44:03.826201: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] no NVIDIA GPU device is present: /dev/nvidia0 does not exist
I'm on Pop_OS 20.04, have tried installing the CUDA drivers from the Pop repository as well as from NVidia. No dice. Any help appreciated.
Running
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
gives me:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:09:00.0 On | N/A |
| 0% 52C P5 15W / 225W | 513MiB / 7959MiB | 17% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
As per the docs here and here, you have to add a "gpus" argument when creating a the docker container to have gpu support.
So you should start your container something like this. The "--gpus all" makes all the gpus available on the host to be visible to the container.
docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --notebook-dir=/tf --ip 0.0.0.0 --no-browser --allow-root --NotebookApp.allow_origin='https://colab.research.google.com'
Also you can try running nvidia-smi on the tensorflow image to quickly check if gpu is accessible on the container.
docker run -it --rm --gpus all tensorflow/tensorflow:latest-gpu-jupyter nvidia-smi
Would return this in my case.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:07:00.0 On | N/A |
| 0% 45C P8 8W / 166W | 387MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
As you can see, I'm running an older nvidia driver(440.100), so I cannot confirm that this would solve your problem. I'm also on Pop_OS 20.04 and didn't install anything other than docker along with dependencies and nvidia-container-toolkit.
Also I would highly suggest avoiding the latest tag when creating containers as it might cause you to unknowingly upgrade to a newer image. Go with version numbered images.
For example tensorflow/tensorflow:2.3.0-gpu-jupyter.
Just installed Docker CE following official instructions with the repository in Ubuntu 14.04
Installation went successfully, the daemon is running
$ ps aux | grep docker
[...] /usr/bin/dockerd --raw-logs [...]
My user is in the docker group:
$ groups
[...] docker
The cli can't seem to communicate (same with sudo)
$ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
Is the docker daemon running?
The socket seems to have the correct permissions:
$ ls -l /var/run/docker.sock
srw-rw---- 1 root docker 0 Feb 4 16:21 /var/run/docker.sock
The log seems to claim about some issues though
$ sudo tail -f /var/log/upstart/docker.log
Failed to connect to containerd: failed to dial "/var/run/docker/containerd/docker-containerd.sock": dial unix:///var/run/docker/containerd/docker-containerd.sock: timeout
/var/run/docker.sock is up
time="2018-02-04T16:22:21.031459040+01:00" level=info msg="libcontainerd: started new docker-containerd process" pid=17147
INFO[0000] starting containerd module=containerd revision=89623f28b87a6004d4b785663257362d1658a729 version=v1.0.0
INFO[0000] setting subreaper... module=containerd
containerd: invalid argument
time="2018-02-04T16:22:21.056685023+01:00" level=error msg="containerd did not exit successfully" error="exit status 1" module=libcontainerd
Any advice to make this work ?
Relog and Docker restart already done of course
As #bobbear suggested and is actually mentioned in the official doc one of the prerequisites is:
Version 3.10 or higher of the Linux kernel. The latest version of the kernel available for you platform is recommended.
After having checked my Kernel version:
$ uname -a
Linux [...] 3.2.[...]-generic [...]-Ubuntu [...] x86_64
I searched for candidates:
$ apt-cache search linux-image
And installed my new_kernel:
$ sudo apt-get install \
linux-image-new_kernel \
linux-headers-new_kernel \
linux-image-extra-new_kernel
Same situation happend on me. IS because your linux kernel version too low !!! check it use command "uname -r" , if the version below "3.10" (for example: debian 7 whezzy default version is 3.2 ) ,even you install docker-ce suceessfully, you will still can not start docker daemon success.That why! All most answers on the web tell you to 'restart' bla bla bla... but they did not consider this problem.
I have an N-Series Azure VM (the Data Science VM) with Tesla K80 GPU. According to the NVIDIA scanner my GPU driver is up to date.
When I run my CNTK Brainscript it says "No GPUs Found" and runs in CPU mode. What can I do to troubleshoot?
requestnodes [MPIWrapper]: using 1 out of 1 MPI nodes on a single host (1 reques
ted); we (0) are in (participating)
-------------------------------------------------------------------
Build info:
Built time: Dec 22 2016 01:43:24
Last modified date: Thu Dec 22 01:35:04 2016
Build type: Release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8
.0
CUB_PATH: c:\src\cub-1.4.1
CUDNN_PATH: C:\local\cudnn-8.0-windows10-x64-v5.1
Build Branch: HEAD
Build SHA1: 8e8b5ff92eff4647be5d41a5a515956907567126
Built by svcphil on DPHAIM-24
Build Path: C:\jenkins\workspace\CNTK-Build-Windows\Source\CNTK\
-------------------------------------------------------------------
No GPUs found
Edit: here is the output from NVidia_smi.exe:
C:\Program Files\NVIDIA Corporation\NVSMI>.\nvidia-smi.exe
Fri Jan 13 19:00:43 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 369.30 Driver Version: 369.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 TCC | 0BD1:00:00.0 Off | Off |
| N/A 43C P8 27W / 149W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 TCC | 5871:00:00.0 Off | Off |
| N/A 35C P8 34W / 149W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The Windows Data Science VM bydefault does not come with the GPU drivers, CUDA etc. We do have an extension called "Deep Learning toolkit for DSVM" that adds on drivers, CUDA and GPU edition of deep learning software like CNTK, Tensorflow, MxNet.
More Info: http://aka.ms/dsvm/deeplearning
We also recently released a Ubuntu version of DSVM with builtin CUDA, GPU drivers and several more deep learning tools and can be deployed either on GPU VM or CPU only VMs on Azure.
Would it be possible for you to run the python notebooks and see if you could run them with the device being set to gpu(id)? or from activated CNTK python environment you could try setting some device.
import cntk as C
from cntk.device import set_default_device, gpu
C.device.set_default_device(C.device.gpu(0))
This might give you some clues whether it is Brainscript specific issue.
Well the python script and Brainscript work now, after installing CUDA (I installed it to run NVIDIA_SMI). I should not have assumed that the Azure Data Science image (that only works with an N Series VM) has the necessary NVIDIA libraries pre-installed. :-)
I have an application that I eventually want to run on a cloud computing service (e.g., such as AWS or Google Cloud) packaged inside a docker image. The reason the application will need to run in the cloud is because it's designed to process large data files, but before I actually deploy, I'd like to test it first on a local laptop, using a single large data file that I've stored (for test and development purposes) on an external USB drive.
My development machine is an OSX laptop, and I'm using a recent version of docker:
stachyra> uname -a
Darwin Andrews-MacBook-Pro-76.local 14.5.0 Darwin Kernel Version 14.5.0: Tue Sep 1 21:23:09 PDT 2015; root:xnu-2782.50.1~1/RELEASE_X86_64 x86_64
stachyra> docker --version
Docker version 1.10.2, build c3959b1
OSX has mounted my external USB drive, device /dev/disk2s2, as /Volumes/MGR DATA:
stachyra> df
Filesystem 512-blocks Used Available Capacity iused ifree %iused Mounted on
/dev/disk1 974770480 435721376 538537104 45% 54529170 67317138 45% /
devfs 375 375 0 100% 650 0 100% /dev
map -hosts 0 0 0 100% 0 0 100% /net
map auto_home 0 0 0 100% 0 0 100% /home
/dev/disk2s2 3906291632 3869523640 36767992 100% 483690453 4595999 99% /Volumes/MGR DATA
/dev/disk3s1 196608 193160 3448 99% 24143 431 98% /Volumes/VirtualBox
stachyra> diskutil list
/dev/disk0
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *500.3 GB disk0
1: EFI EFI 209.7 MB disk0s1
2: Apple_CoreStorage 499.4 GB disk0s2
3: Apple_Boot Recovery HD 650.0 MB disk0s3
/dev/disk1
#: TYPE NAME SIZE IDENTIFIER
0: Apple_HFS Macintosh HD *499.1 GB disk1
Logical Volume on disk0s2
DB70B91A-3B57-4C82-A758-C4BDEA4160FD
Unlocked Encrypted
/dev/disk2
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *2.0 TB disk2
1: EFI EFI 209.7 MB disk2s1
2: Apple_HFS MGR DATA 2.0 TB disk2s2
/dev/disk3
#: TYPE NAME SIZE IDENTIFIER
0: GUID_partition_scheme *100.7 MB disk3
1: Apple_HFS VirtualBox 100.7 MB disk3s1
and it should also be noted, the drive has several directories and data which are visible inside it, at least when viewed directly through OSX:
stachyra> ls -l /Volumes/MGR\ DATA
total 0
drwxr-xr-x 6 stachyra staff 204 Apr 14 2015 1000genomes
drwxr-xr-x 5 stachyra staff 170 Oct 12 17:41 GIAB
drwxr-xr-x 4 stachyra staff 136 Apr 28 2015 genome_browser_tracks
drwxr-xr-x 24 stachyra staff 816 Oct 6 14:00 mitty
I have tried to follow the advice from this question, which describes how to mount a USB drive in docker when docker is running within a linux host. But my local laptop is OSX, not linux, so it doesn't seem to work.
Explicitly, when attempting to follow the advice of the accepted answer, I obtain the following result:
stachyra> docker run -i -t --privileged -v /dev/disk2s2:/dev/foo ubuntu bash
root#8da7b492a707:/# uname -a
Linux 8da7b492a707 4.1.18-boot2docker #1 SMP Sat Feb 20 08:24:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
root#8da7b492a707:/# ls -l /dev/foo
total 0
root#8da7b492a707:/#
Based upon the response, one can see that docker does indeed launch a linux container correctly, and it also creates a volume /dev/foo inside of the container as requested, but the actual contents of the USB drive are not accessible via that location--the ls -l command claims there are no files or directories there.
I also tried the second method described in an alternate response to the same question, and that fails even worse:
stachyra> docker run -i -t --device=/dev/disk2s2 ubuntu bash
docker: Error response from daemon: error gathering device information while adding custom device "/dev/disk2s2": not a device node.
stachyra>
I have found another discussion thread on stackoverflow which suggests that raw USB access is handled quite differently in OSX than in linux, which I suspect is probably the reason why both of the above attempts at USB access are failing.
But, what should I actually do about it? That is to say, what is the correct sequence of actions or commands to allow docker to access a USB device mounted on an OSX host, rather than linux?
I was finally able to access my USB drive from /var/media inside my container by using the machine-diskutil.sh script mentioned in warmoverflow's comment like so
machine-diskutil.sh mount my-machine-name /Volumes/my-usb-drive
and then starting the container like so
docker run -v /Volumes/my-usb-drive:/var/media -it my/image:latest bash
Because I had tried to add /Volumes/my-usb-drive as a shared folder manually in VirtualBox, I first got this error.
Error: The shared folder /Volumes/Seagate already exists on the
docker machine, please unmount it first.
So I removed it manually and re-ran the machine-diskutil.sh mount command without any problems. Great stuff!
As per #pgayvallet comment on GitHub:
As the daemon runs inside a VM in Docker Desktop, it is not possible to actually share a mac host device with the container inside the VM, and this will most definitely never be possible.
When I start any container by docker run, we get a new veth interface. After deleting container, veth interface which was linked with container should be removed. However, sometimes it's fail ( oftern then container started with errors):
root#hostname /home # ifconfig | grep veth | wc -l
53
root#hostname /home # docker run -d -P axibase/atsd -name axibase-atsd-
28381035d1ae2800dea51474c4dee9525f56c2347b1583f56131d8a23451a84e
Error response from daemon: Cannot start container 28381035d1ae2800dea51474c4dee9525f56c2347b1583f56131d8a23451a84e: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 33359 -j DNAT --to-destination 172.17.2.136:8883 ! -i docker0: iptables: No chain/target/match by that name.
(exit status 1)
root#hostname /home # ifconfig | grep veth | wc -l
55
root#hostname /home # docker rm -f 2838
2838
root#hostname /home # ifconfig | grep veth | wc -l
55
How I can identify which interfaces are linked with existing containers, and how I can remove extra interface which was linked with removed contrainers?
This way doesn't work (by root):
ifconfig veth55d245e down
brctl delbr veth55d245e
can't delete bridge veth55d245e: Operation not permitted
Extra interfaces now defined by transmitted traffic (if there are no activity, it's extra interface).
UPDATE
root#hostname ~ # uname -a
Linux hostname 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
root#hostname ~ # docker info
Containers: 10
Images: 273
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 502
Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-53-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 8
Total Memory: 47.16 GiB
Name: hostname
ID: 3SQM:44OG:77HJ:GBAU:2OWZ:C5CN:UWDV:JHRZ:LM7L:FJUN:AGUQ:HFAL
WARNING: No swap limit support
root#hostname ~ # docker version
Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64
There are three problems here:
Starting a single container should not increase the count of veth interfaces on your system by 2, because when Docker creates a veth pair, one end of the pair is isolated in the container namespace and is not visible from the host.
It looks like you're not able to start a container:
Error response from daemon: Cannot start container ...
Docker should be cleaning up the veth interfaces automatically.
These facts make me suspect that there is something fundamentally wrong in your environment. Can you update your question with details about what distribution you're using, which kernel version, and which Docker version?
How I can identify which interfaces are linked with existing containers, and how I can remove extra interface which was linked with removed contrainers?
With respect to manually deleting veth interfaces: A veth interface isn't a bridge, so of course you can't delete one with brctl.
To delete a veth interface:
# ip link delete <ifname>
Detecting "idle" interfaces is a thornier problem, because if you just look at traffic you're liable to accidentally delete something that was still in use but that just wasn't seeing much activity.
I think what you would actually want to look for are veth interfaces whose peer is also visible in the global network namespace. You can find the peer of a veth interface using these instructions, and then it would be a simple matter of seeing if that interface is visible, and then deleting one or the other (deleting a veth interface will also remove its peer).
Fixed by upgrade docker to last version.
New version:
root#hostname ~ # docker version
Client:
Version: 1.8.1
API version: 1.20
Go version: go1.4.2
Git commit: d12ea79
Built: Thu Aug 13 02:35:49 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.8.1
API version: 1.20
Go version: go1.4.2
Git commit: d12ea79
Built: Thu Aug 13 02:35:49 UTC 2015
OS/Arch: linux/amd64
Now interfaces remove together with containers. Old orphaned interfaces were deleted manually by following command:
# ip link delete <ifname>
Here is how you can delete them all together by pattern.
for name in $(ifconfig -a | sed 's/[ \t].*//;/^\(lo\|\)$/d' | grep veth)
do
echo $name
# ip link delete $name # uncomment this
done
In my case, all virtual ethernet network interface were created by Docker. For solving that, I've stopped all Docker services:
docker stop $(docker ps -q)
And the deleted all networks created by Docker:
docker network rm $(docker network ls -q)