Dockerized Tensorflow script can't see GPUs - python-3.x

I use docker containers to train deep learning models. These Docker containers are located on a Linux server and are trained there with several GPUs.
The problem is that Tensorflow does not recognize the GPUs inside the container. The Docker container looks like this:
FROM nvidia/cuda:10.2-runtime-ubuntu18.04
RUN apt-get update && apt-get install -y apt-utils
RUN apt-get install -y \
git \
pkg-config \
python3-pip \
python3.6 \
nano \
wget \
yasm
FROM python:3.6
COPY requirements.txt ./
# Here tensorflow-gpu == 2.1 is installed
RUN pip install --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
COPY . /
ENTRYPOINT ["python", "./main.py"]
If you now comment out the python specific lines from the docker file and replace them with CMD["nvidia-smi"] you can see that the GPUs inside the container are visible. Now the only question I have to ask myself is how it is possible for Tensorflow to detect the GPUs.
In Python code the GPUs are included as follows:
physical_devices = tf.config.experimental.list_physical_devices('GPU')
for physical_device in physical_devices:
tf.config.experimental.set_memory_growth(physical_device, True)

Related

Install specific version of python in docker

I try to run container from image nvcr.io/nvidia/tensorflow:22.08-tf2-py3. But I have a problem.
The built docker-image contains python3.8. But I don't understand why I have this version of python in my docker-image. It is necessary to use python with version>=3.10 for correct work with libraries that I need. Version=3.8 is not explicitly specified in Dockerfile. When I try to install an another version:
RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa && apt-get install -y python3.11
RUN python3.11 -m pip install --upgrade --no-cache -r requirements.txt
I get an error /usr/bin/python3.11: No module named pip during image building.
How can I correctly install specific version of python in my docker-image using Dockerfile?
You got Python 3.11 alright but you are missing the PIP module. Add python-pip or python3-pip to the list of packages you install with apt-get.
This is regarding your problem with python3.11 pip and not the tensor version support
Looking at the original python3.11 docker in here you should be able to use the bellow code.
# if this is called "PIP_VERSION", pip explodes with "ValueError: invalid truth value '<VERSION>'"
ENV PYTHON_PIP_VERSION 22.3
# https://github.com/docker-library/python/issues/365
ENV PYTHON_SETUPTOOLS_VERSION 65.5.0
# https://github.com/pypa/get-pip
ENV PYTHON_GET_PIP_URL https://github.com/pypa/get-pip/raw/66030fa03382b4914d4c4d0896961a0bdeeeb274/public/get-pip.py
ENV PYTHON_GET_PIP_SHA256 1e501cf004eac1b7eb1f97266d28f995ae835d30250bec7f8850562703067dc6
RUN set -eux; \
\
wget -O get-pip.py "$PYTHON_GET_PIP_URL"; \
echo "$PYTHON_GET_PIP_SHA256 *get-pip.py" | sha256sum -c -; \
\
export PYTHONDONTWRITEBYTECODE=1; \
\
python get-pip.py \
--disable-pip-version-check \
--no-cache-dir \
--no-compile \
"pip==$PYTHON_PIP_VERSION" \
"setuptools==$PYTHON_SETUPTOOLS_VERSION" \
; \
rm -f get-pip.py; \
\
pip --version
I was able to install pip for my python3.11 on Fedora(outside Docker) using python3 -m ensurepip (taken from here)

Docker make Nvidia GPUs visible during docker build process

I want to build a docker image where I want to compile custom kernels with pytorch. Therefore I need access to the available gpus in order to compile the custom kernels during docker build process. On the host machine everything is setted up including nvidia-container-runtime, nvidia-docker, Nvidia-Drivers, Cuda etc. The following command shows docker runtime information on the host system:
$ docker info|grep -i runtime
Runtimes: nvidia runc
Default Runtime: runc
As you can see the default runtime of docker in my case is runc. I think changing the default runtime from runc to nvidia would solve this problem, as noted here.
The proposed solution doesn't work in my case because:
I have no permissions to change the default runtime on system I use
I have no permissions to make changes to the daemon.json file
Is there a way to get access to the gpus during the build process in the Dockerfile in order to compile custom pytorch kernels for CPU and GPU (in my case DCNv2)?
Here is the minimal example of my Dockerfile to reproduce this problem. In this image, DCNv2 is only compiled for CPU and not for GPU.
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata && \
apt-get install -y --no-install-recommends software-properties-common && \
add-apt-repository ppa:deadsnakes/ppa && \
apt update && \
apt install -y --no-install-recommends python3.6 && \
apt-get install -y --no-install-recommends \
build-essential \
python3.6-dev \
python3-pip \
python3.6-tk \
pkg-config \
software-properties-common \
git
RUN ln -s /usr/bin/python3 /usr/bin/python & \
ln -s /usr/bin/pip3 /usr/bin/pip
RUN python -m pip install --no-cache-dir --upgrade pip setuptools && \
python -m pip install --no-cache-dir torch==1.4.0 torchvision==0.5.0
RUN git clone https://github.com/CharlesShang/DCNv2/
#Compile DCNv2
WORKDIR /DCNv2
RUN bash ./make.sh
# clean up
RUN apt-get clean && \
rm -rf /var/lib/apt/lists/*
#Build: docker build -t my_image .
#Run: docker run -it my_image
An not opitmal solution which worked would be be the following:
Comment out line RUN bash ./make.sh in Dockerfile
Build image: docker build -t my_image .
Run image in interactive mode: docker run --gpus all -it my_image
Compile DCNv2 manually: root#1cd02fd62461:/DCNv2# ./make.sh
Here DCNv2 is compiled for CPU and GPU, but that seems to me not an ideal solution, because I must compile DCNv2 every time when i start the container.

docker ERROR: Could not find a version that satisfies the requirement apturl==0.5.2

I am using windows 10 OS. I want to build an container based on linux so I can replicate code and dependencies developed from ubuntu. When I try to build it outputs Error message as above.
From my understanding docker for desktop runs linux OS kernel under-the-hood therefore allowing window users to run linux based containers, not sure why it is outputting this error.
My dockerfile looks like this:
FROM ubuntu:18.04
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"
RUN apt update \
&& apt install -y htop python3-dev wget
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
&& mkdir root/.conda \
&& sh Miniconda3-latest-Linux-x86_64.sh -b \
&& rm -f Miniconda3-latest-Linux-x86_64.sh
RUN conda create -y -n ml python=3.7
COPY . src/
RUN /bin/bash -c "cd src \
&& source activate ml \
&& pip install -r requirements.txt"
requirements.txt contains:
apturl==0.5.2
asn1crypto==0.24.0
bleach==2.1.2
Brlapi==0.6.6
certifi==2020.11.8
chardet==3.0.4
click==7.1.2
command-not-found==0.3
configparser==5.0.1
cryptography==2.1.4
cupshelpers==1.0
dataclasses==0.7
When I run docker build command it outputs:
1.649 ERROR: Could not find a version that satisfies the requirement apturl==0.5.2 1.649 ERROR: No matching distribution found for apturl==0.5.2 Deleting it and running it lead to another error. All error seem to be associated with ubuntu packages.
Am I not running a ubuntu container? why aren't I allowed to install ubuntu packages?
Thanks!
You try to install ubuntu packages with pip (which is for python packages")
try apt install -y apturl
If you want to install python packages write pip install package_name

Model training using Azure Container Instance with GPU much slower than local test with same container

I am trying to train a Yolo computer vision model using a container I built which includes an installation of Darknet. The container is using the Nvidia supplied base image: nvcr.io/nvidia/cuda:9.0-devel-ubuntu16.04
Using Nvidia-Docker on my local machine with a gtx 1080 ti, training runs very fast, however that same container running as an Azure Container Instance with a P100 gpu trains very slowly. It's almost as if it's not utilizing the gpu. I also noticed that the "nvidia-smi" command does not work in the container running in Azure, but it does work when I ssh into the container running locally on my machine.
Here is the Dockerfile I am using
FROM nvcr.io/nvidia/cuda:9.0-devel-ubuntu16.04
LABEL maintainer="alex.c.schultz#gmail.com" \
description="Pre-Configured Darknet Machine Learning Environment" \
version=1.0
# Container Dependency Setup
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install software-properties-common -y
RUN apt-get install vim -y
RUN apt-get install dos2unix -y
RUN apt-get install git -y
RUN apt-get install wget -y
RUN apt-get install python3-pip -y
RUN apt-get install libopencv-dev -y
# setup virtual environment
WORKDIR /
RUN pip3 install virtualenv
RUN virtualenv venv
WORKDIR venv
RUN mkdir notebooks
RUN mkdir data
RUN mkdir output
# Install Darknet
WORKDIR /venv
RUN git clone https://github.com/AlexeyAB/darknet
RUN sed -i 's/GPU=0/GPU=1/g' darknet/Makefile
RUN sed -i 's/OPENCV=0/OPENCV=1/g' darknet/Makefile
WORKDIR /venv/darknet
RUN make
# Install common pip packages
WORKDIR /venv
COPY requirements.txt ./
RUN . /venv/bin/activate && pip install -r requirements.txt
# Setup Environment
EXPOSE 8888
VOLUME ["/venv/notebooks", "/venv/data", "/venv/output"]
CMD . /venv/bin/activate && jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root
The requirements.txt file is as shown below:
jupyter
matplotlib
numpy
opencv-python
scipy
pandas
sklearn
The issue was that my training data was on an Azure File Share volume and the network latency was causing the training to be slow. I copied the data from the share into my container and then pointed the training to it and everything ran much faster.

How can I Dockeries a python script which contains spark dependencies?

I have a Python file, in which I tried to import Spark libraries.
When I built it with the Docker File it is giving me error as 'JAVA_HOME' is not set.
I tried to install Java through Docker file, but it is giving error as well.
Below is the Dockerfile I tried to execute.
FROM python:3.6.4
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y software-properties-common && \
add-apt-repository ppa:webupd8team/java -y && \
apt-get update && \
echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections && \
apt-get install -y oracle-java8-installer && \
apt-get clean
ENV JAVA_HOME /usr/lib/jvm/java-8-oracle
ADD Samplespark.py /
COPY Samplespark.py /opt/ml/Samplespark.py
RUN pip install pandas
RUN pip install numpy
RUN pip install pyspark
RUN pip install sklearn
RUN pip install sagemaker_pyspark
RUN pip install sagemaker
CMD [ "python", "./Samplespark.py" ]
ENTRYPOINT ["python","/opt/ml/Samplespark.py"]
Please help me to install the Java dependencies for PySpark in Docker.
You have Debian os, not ubuntu os. These ppas are for ubuntu os. According to this, article oracle java8 is not available in Debian due to licensing issues.
You have following options-
1. Use an Ubuntu docker image which comes with preinstalled oracle java8 like this one
2. Follow this tutorial on how to install Oracle java8 on Debian Jessie
3. Install open_jdk sudo apt-get install openjdk-8-jre

Resources