Pytorch not detecting multiple GPUs - pytorch

I have 10 GPUs available and 1 GPU (e.g. GPU#9) is in use by another torch process. I would like to run another process on any of the remaining GPUs (e.g. GPU#2, GPU#3, GPU#4) but I always get the error message:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
I tried several options and none of them worked:
OPTION 1: Selecting GPUs on python script
import os
os.environ[“CUDA_DEVICE_ORDER”] = ‘PCI_BUS_ID’
os.environ[“CUDA_VISIBLE_DEVICES”] = ‘2,3,4’
print(f’[INFO] Using GPU: {torch.cuda.current_device()}‘)
print(f’[INFO] Available GPUs: {torch.cuda.device_count()}')
for d in range(torch.cuda.device_count()):
print(torch.cuda.get_device_name(d))
It recognises the 3 GPUs and the process will now assign them as GPU#0, GPU#1 and GPU#3 within that process.
[INFO] Using GPU: 2,3,4
[INFO] Available GPUs: 10
GeForce GTX 1080 Ti
OPTION 2: Selecting GPU on command line:
CUDA_VISIBLE_DEVICES=2,3,4 python gdxray_cganTrainer.py
and I check on linux using
env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=2,3,4
None of the above 2 options work as still get the CUDA-capable devices are busy or unavailable. It looks like it knows that the already running process on GPU#9 was assigned as GPU#0 and when I select GPU#2,3,4 the first GPU will be assigned as GPU#0 in the new process. But it shouldn't be as they are different processes
I am using torch 1.8.0, python 3.8.8, cudatoolkit 11.1.1
Any help?

Related

WineBottler returning error message every time i try an open .exe file on Mac

I'm trying to open a plugin for SNAP (to process Sentinel-3 imagery) on my Mac - the plugin downloads as an .exe file which means I need to open it using WineBottler. Every time I try and open the file however, I get this error message:
###BOTTLING### default.sh
/var/folders/rz/rr6ytzhx5gl60f1v1tbc67xm0000gn/T/AppTranslocation/6CDA1855-FA78-4A2A-A976-2C1A539F36ED/d/WineBottler.app/Contents/Frameworks/WBottler.framework/Resources/bottler.sh: line 39: /Applications/Wine.app/Contents/Resources/bin/wine: Bad CPU type in executable
###BOTTLING### Gathering debug Info...
Versions
OS...........................: darwin21
Wine.........................:
WineBottler..................: 1.8.6
Wineticks....................: 20220411-next - sha256sum: b6370f13c4dc410023f2a4e4e9a4385d2a0420031666c2f30befccc9b39c8f65
Environment
PWD..........................: '/Applications/Wine.app/Contents/Resources/bin'
PATH.........................: /Applications/Wine.app/Contents/Resources/bin:/usr/bin:/bin:/usr/sbin:/sbin
USER.........................: hannah
HOME.........................: /Users/hannah
COMPUTERNAME.................: hannahâs MacBook Air
BUNDLERESOURCEPATH...........: /var/folders/rz/rr6ytzhx5gl60f1v1tbc67xm0000gn/T/AppTranslocation/6CDA1855-FA78-4A2A-A976-2C1A539F36ED/d/WineBottler.app/Contents/Frameworks/WBottler.framework/Resources
WINEPREFIX...................: /Applications/Wine.app/Contents/Resources
WINEPATH.....................: /Applications/Wine.app/Contents/Resources/bin
LD_LIBRARY_PATH..............: /Applications/Wine.app/Contents/Resources/lib:/opt/X11/lib:/usr/X11/lib
DYLD_FALLBACK_LIBRARY_PATH...: /Applications/Wine.app/Contents/Resources/lib:/usr/lib:/opt/X11/lib:/usr/X11/lib
SILENT.......................:
http_proxy...................:
https_proxy..................:
ftp_proxy....................:
socks5_proxy.................:
Bottle
TEMPLATE.....................:
BOTTLE.......................: /Users/hannah/Desktop/Untitled.app
INSTALLER_URL................: /Users/hannah/Desktop/iCOR_Setup_3.0.0.exe
INSTALLER_IS_ZIPPED..........: 0
INSTALLER_NAME...............: iCOR_Setup_3.0.0.exe
INSTALLER_ARGUMENTS..........:
REMOVE_MONO..................:
REMOVE_GECKO.................:
REMOVE_USERS.................:
REMOVE_INSTALLERS............:
WINETRICKS_ITEMS.............: winxp
DLL_OVERRIDES................:
EXECUTABLE_PATH..............: winefile
EXECUTABLE_ARGUMENTS.........:
EXECUTABLE_VERSION...........: 1.0.0
BUNDLE_COPYRIGHT.............: © Your Company
BUNDLE_IDENTIFIER............: com.yourcompany.yourapp
BUNDLE_CATEGORYTYPE..........: public.app-category.business
SILENT.......................:
Hardware:
Hardware Overview:
Model Name: MacBook Air
Model Identifier: MacBookAir7,2
Processor Name: Dual-Core Intel Core i5
Processor Speed: 1.6 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 3 MB
Hyper-Threading Technology: Enabled
Memory: 4 GB
System Firmware Version: 476.0.0.0.0
OS Loader Version: 540.120.3~22
SMC Version (system): 2.27f2
Serial Number (system): C02QM1XWG941
Hardware UUID: EE27242F-C2B2-59E6-AAED-D598D1D61044
Provisioning UDID: EE27242F-C2B2-59E6-AAED-D598D1D61044
###BOTTLING### Create .app...
###BOTTLING### Enabling CoreAudio, Colors, Antialiasing and flat menus...
/var/folders/rz/rr6ytzhx5gl60f1v1tbc67xm0000gn/T/AppTranslocation/6CDA1855-FA78-4A2A-A976-2C1A539F36ED/d/WineBottler.app/Contents/Frameworks/WBottler.framework/Resources/bottler.sh: line 134: /Applications/Wine.app/Contents/Resources/bin/wine: Bad CPU type in executable
### LOG ### Command '/Applications/Wine.app/Contents/Resources/bin/wine regedit /tmp/reg.reg' returned status 126.
###ERROR### Command '/Applications/Wine.app/Contents/Resources/bin/wine regedit /tmp/reg.reg' returned status 126.
Task returned with status 1.
I've tried downloading the 'stable' version of WineBottler, download and redownload it to no avail - it always returns this message. I can't seem to find any way of getting around this or recently posted question (a lot are from 2010-15 and are outdated in their solutions)
Does anyone know what I can do to get around this and open it? It's driving me insane!!!
Thanks!

Tensorflow error when used as Docker baseimage

Hi i am using below as my Docker image for fastapi application
FROM tensorflow/tensorflow:latest
when i run docker its running but i am getting this error
2021-06-23 23:31:50.516749: F tensorflow/core/lib/monitoring/sampler.cc:42] Check failed: bucket_limits_[i] > bucket_limits_[i - 1] (0 vs. 10)
qemu: uncaught target signal 6 (Aborted) - core dumped
[2021-06-23 23:31:50 +0530] [1] [WARNING] Worker with pid 2697 was terminated due to signal 6
and when i call api, i am not getting response, does it take time for api call or can you please tell me where it is wrong
I am guessing you are using a Mac with a M1 chip as This is a qemu bug, which is the upstream component we use for running Intel containers on M1 chips, this issue hasn't been solved yet. I suggest you can try and build TensorFlow for aarch64 Linux from source.

GPU Memory management issues when using TensorFlow

| Processes: GPU Memory |
| GPU PID Type Process name Usage
| 0 6944 C python3 11585MiB |
| 1 6944 C python3 11587MiB |
| 2 6944 C python3 10621MiB |
The nvidia-smi memory is not freed after the tensorflow is stopped in the middle.
Tried to using this
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.per_process_gpu_memory_fraction = 0.90
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
Also
with tf.device('/gpu:0'):
with tf.Graph().as_default():
Tried resetting the GPU
sudo nvidia-smi --gpu-reset -i 0
The memory can not be freed at all.
The solution was obtained from Tensorflow set CUDA_VISIBLE_DEVICES within jupyter thanks Yaroslav.
Most of the information was obtained from Tensorflow Stackoverflow documentation. I am not allowed to post it. Not sure why.
Insert this at the beginning of your code.
from tensorflow.python.client import device_lib
# Set the environment variables
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# Double check that you have the correct devices visible to TF
print("{0}\nThe available CPU/GPU devices on your system\n{0}".format('=' * 100))
print(device_lib.list_local_devices())
Different options to start with GPU or CPU. I am using the CPU. Can be changed from the below options
with tf.device('/cpu:0'):
# with tf.device('/gpu:0'):
# with tf.Graph().as_default():
Use the following lines in the session:
config = tf.ConfigProto(device_count={'GPU': 1}, log_device_placement=False,
allow_soft_placement=True)
# allocate only as much GPU memory based on runtime allocations
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
# Session needs to be closed
sess.close()
The below line will fix the issue of resources locked by python
with tf.Session(config=config) as sess:
Another helpful article to understand the importance of 'with'
Please do check the official tf.Session() from tensorflow.
Parameter explanation
To find out which devices your operations and tensors are assigned to, create the session with
log_device_placement configuration option set to True.
TensorFlow to automatically choose an existing and supported device to run the operations in case the specified
one doesn't exist, you can set allow_soft_placement=True in the configuration option when creating the session.

Analyse TF training speed - how to debug?

I've trained the same model on 2 installations:
1. iMac (late '13) 3.5Ghz i7 & 32GB ram
2. 1 node in a Debian slurm CPU-cluster with 24cores 2.2Ghz, 64gb ram
Both in TF version 1.0.1 (cpu), both were installed with pip. Training times on the model:
1. iMac: Total Time: 86.92s - 207,42s user 38,82s system 267% CPU 1:32,07
2. Linux: Total Time: 330.66s - real 5m38.478s user 22m14.264s
sys 3m5.964s
Obviously not what I was expecting. How can I search for bottlenecks, or profile, and find out why the Linux setup is so slow?

Using Linux virtual mouse driver

I am trying to implement a virtual mouse driver according to the Essential Linux device Drivers book. There is a user space application, which generates coordinates as well as a kernel module.
See: Virtual mouse driver and userspace application code and also a step by step on how to use this driver.
1.) I compile the code of the user space application and driver.
2.) Next i checked dmesg output and have,
input: Unspecified device as /class/input/input32
Virtual Mouse Driver Initialized
3.) The sysfs node was created properly during initialization (found in /sys/devices/platform/vms/coordinates)
4.) I know that the virtual mouse driver (input32 ) is linked to event5 by checking the following:
$ cat /proc/bus/input/devices
I: Bus=0000 Vendor=0000 Product=0000 Version=0000
N: Name=""
P: Phys=
S: Sysfs=/devices/virtual/input/input32
U: Uniq=
H: Handlers=event5
B: EV=5
B: REL=3
5.) Next i attach a GPM server to the event interface: gpm -m /dev/input/event5 -t evdev
6.) Run the user space application to generate random coordinates for virtual mouse and observe generated coordinates using od -x /dev/input/event5.
And nothing happens. Why?
Also here author mentioned that gdm should be stopped, using /etc/init.d/gdm stop, but i get "no such service" when stopping gdm.
Here is my complete script for building and runing virtual mouse:
make -C /usr/src/kernel/2.6.35.6-45.fc14.i686/ SUBDIRS=$PWD modules
gcc -o app_userspace app_userspace.c
insmod app.ko
gpm -m /dev/input-event5 -t evdev
./app_userspace
Makefile:
obj-m+=app.o
Kernel version: 2.6.35.6
As i said before i can recieve the result through od, but i received it through your program
echo 9 19 > /sys/devices/platform/virmouse/vmevent
gives:
time 1368284298.207654 type 2 code 0 value 9
time 1368284298.207657 type 2 code 1 value 19
time 1368284298.207662 type 0 code 0 value 0
So now the question is: what is wrong with X11? I would like to stress, that i tried this code under two different distributions Ubuntu 11.04 and Fedora 14.
Maybe this will help: in Xorg.0.log i see the following:
[ 21.022] (II) No input driver/identifier specified (ignoring)
[ 272.987] (II) config/udev: Adding input device (/dev/input/event5)
[ 272.987] (II) No input driver/identifier specified (ignoring)
[ 666.521] (II) config/udev: Adding input device (/dev/input/event5)
[ 666.521] (II) No input driver/identifier specified (ignoring)
I spent a huge amount of time, resolving this issue, and i would like to help other people, who run in this problem. I think some outer X11 features interfered my module work. After disabling GDM it now works fine (runlevel 3). Working code you can find here http://fred-zone.blogspot.ru/2010/01/mouse-linux-kernel-driver.html working distro ubuntu 11.04 (gdm disabled)
Try replacing the below lines of code in the input device driver
set_bit(EV_REL, vms_input_dev->evbit);
set_bit(REL_X, vms_input_dev->relbit);
set_bit(REL_Y, vms_input_dev->relbit);
with
vms_input_dev->name = "Virtual Mouse";
vms_input_dev->phys = "vmd/input0"; // "vmd" is the driver's name
vms_input_dev->id.bustype = BUS_VIRTUAL;
vms_input_dev->id.vendor = 0x0000;
vms_input_dev->id.product = 0x0000;
vms_input_dev->id.version = 0x0000;
vms_input_dev->evbit[0] = BIT_MASK(EV_KEY) | BIT_MASK(EV_REL);
vms_input_dev->keybit[BIT_WORD(BTN_MOUSE)] = BIT_MASK(BTN_LEFT) | BIT_MASK(BTN_RIGHT) | BIT_MASK(BTN_MIDDLE);
vms_input_dev->relbit[0] = BIT_MASK(REL_X) | BIT_MASK(REL_Y);
vms_input_dev->keybit[BIT_WORD(BTN_MOUSE)] |= BIT_MASK(BTN_SIDE) | BIT_MASK(BTN_EXTRA);
vms_input_dev->relbit[0] |= BIT_MASK(REL_WHEEL);
It worked for me on ubuntu 12.04

Resources