GPU Memory management issues when using TensorFlow

GPU Memory management issues when using TensorFlow - keras

| Processes: GPU Memory |
| GPU PID Type Process name Usage
| 0 6944 C python3 11585MiB |
| 1 6944 C python3 11587MiB |
| 2 6944 C python3 10621MiB |
The nvidia-smi memory is not freed after the tensorflow is stopped in the middle.
Tried to using this
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.per_process_gpu_memory_fraction = 0.90
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
Also
with tf.device('/gpu:0'):
with tf.Graph().as_default():
Tried resetting the GPU
sudo nvidia-smi --gpu-reset -i 0
The memory can not be freed at all.

The solution was obtained from Tensorflow set CUDA_VISIBLE_DEVICES within jupyter thanks Yaroslav.
Most of the information was obtained from Tensorflow Stackoverflow documentation. I am not allowed to post it. Not sure why.
Insert this at the beginning of your code.
from tensorflow.python.client import device_lib
# Set the environment variables
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
# Double check that you have the correct devices visible to TF
print("{0}\nThe available CPU/GPU devices on your system\n{0}".format('=' * 100))
print(device_lib.list_local_devices())
Different options to start with GPU or CPU. I am using the CPU. Can be changed from the below options
with tf.device('/cpu:0'):
# with tf.device('/gpu:0'):
# with tf.Graph().as_default():
Use the following lines in the session:
config = tf.ConfigProto(device_count={'GPU': 1}, log_device_placement=False,
allow_soft_placement=True)
# allocate only as much GPU memory based on runtime allocations
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
# Session needs to be closed
sess.close()
The below line will fix the issue of resources locked by python
with tf.Session(config=config) as sess:
Another helpful article to understand the importance of 'with'
Please do check the official tf.Session() from tensorflow.
Parameter explanation
To find out which devices your operations and tensors are assigned to, create the session with
log_device_placement configuration option set to True.
TensorFlow to automatically choose an existing and supported device to run the operations in case the specified
one doesn't exist, you can set allow_soft_placement=True in the configuration option when creating the session.

Related

Pytorch not detecting multiple GPUs

I have 10 GPUs available and 1 GPU (e.g. GPU#9) is in use by another torch process. I would like to run another process on any of the remaining GPUs (e.g. GPU#2, GPU#3, GPU#4) but I always get the error message:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
I tried several options and none of them worked:
OPTION 1: Selecting GPUs on python script
import os
os.environ[“CUDA_DEVICE_ORDER”] = ‘PCI_BUS_ID’
os.environ[“CUDA_VISIBLE_DEVICES”] = ‘2,3,4’
print(f’[INFO] Using GPU: {torch.cuda.current_device()}‘)
print(f’[INFO] Available GPUs: {torch.cuda.device_count()}')
for d in range(torch.cuda.device_count()):
print(torch.cuda.get_device_name(d))
It recognises the 3 GPUs and the process will now assign them as GPU#0, GPU#1 and GPU#3 within that process.
[INFO] Using GPU: 2,3,4
[INFO] Available GPUs: 10
GeForce GTX 1080 Ti
OPTION 2: Selecting GPU on command line:
CUDA_VISIBLE_DEVICES=2,3,4 python gdxray_cganTrainer.py
and I check on linux using
env | grep CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=2,3,4
None of the above 2 options work as still get the CUDA-capable devices are busy or unavailable. It looks like it knows that the already running process on GPU#9 was assigned as GPU#0 and when I select GPU#2,3,4 the first GPU will be assigned as GPU#0 in the new process. But it shouldn't be as they are different processes
I am using torch 1.8.0, python 3.8.8, cudatoolkit 11.1.1
Any help?

WSL2 Linux: Numba CUDA Error_no_device however OS can see it

I am having issues with getting numba to work on my WSL installation. I am trying to use the OpenPCDet library from here (https://github.com/open-mmlab/OpenPCDet). However, I am getting this error. It seems like Numba isn't able to see the device. I also tried doing numba -s to get information, this is what I get (a subset is presented).
__CUDA Information__
Error: CUDA device intialisation problem. Message:Error at driver init:
[100] Call to cuInit results in CUDA_ERROR_NO_DEVICE:
Error class:
ROC Information
ROC available : False
Error initialising ROC due to : No ROC toolchains found.
No HSA Agents found, encountered exception when searching:
Error at driver init:
NUMBA_HSA_DRIVER /opt/rocm/lib/libhsa-runtime64.so is not a valid file path. Note it must be a filepath of the .so/.dll/.dylib or the driver:
Conda Information
conda_build_version : not installed
conda_env_version : 4.12.0
platform : linux-64
python_version : 3.9.7.final.0
root_writable : True
Current Conda Env (Subset)
cudatoolkit 10.2.89 hfd86e86_1
cumm-cu102 0.2.8 pypi_0 pypi
numba 0.55.1 pypi_0 pypi
pytorch 1.7.1 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
Also, just to check whether WSL sees a GPU device, I used the command on this website (https://ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2). I get:
Run "nbody -benchmark [-numbodies=]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies= (number of bodies (>= 1) to run in simulation)
-device= (where d=0,1,2.... for the CUDA device to use)
-numdevices= (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Windowed mode
Simulation data stored in video memory
Single precision floating point simulation
1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1
Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1080 Ti]
28672 bodies, total time for 10 iterations: 23.114 ms
= 355.669 billion interactions per second
= 7113.380 single-precision GFLOP/s at 20 flops per interaction
To provide further information. The nvidia-smi commands results in the following output:
Sun May 22 16:37:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 512.15 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 23% 26C P8 11W / 250W | 1671MiB / 11264MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Can somebody help with this? Thanks!

Trouble initializing SDK node using USB-TTL M210 v2

I am trying to connect M210 v2 RTK to a desktop computer with Ubuntu 18.04, ROS Melodic and parallel installation of Opencv 3.3.1 and 4.5.3 using a USB-TTL RS232 to make UART connection and an USB-USB connecting drone and desktop to be able to run Advanced Sensing.
When I call ls -l /dev/ttyACM* && ls -l /dev/ttyUSB* it returns that it is indentified the USB and ACM connection.
crw-rw---- 1 root dialout 166, 0 out 4 13:18 /dev/ttyACM0
crw-rw---- 1 root dialout 188, 0 out 4 13:18 /dev/ttyUSB0
I also set the transfer rate of TTL-USB to 921600 using minicom, and gave persmission to device to read and write with sudo usermod -a -G dialout $USER && sudo chmod 666 /dev/ttyUSB0
Unfortunatelly when I launch roslaunch dji_osdk_ros dji_sdk_node.launch it appears some connection problem presented below and I am not being able to fix it. I have been trying to turn on/off drone and RC several times ass described here, but the problem still stand.
started roslaunch server http://V3D06:43613/
SUMMARY
========
PARAMETERS
* /dji_sdk/acm_name: /dev/ttyACM0
* /dji_sdk/align_time: False
* /dji_sdk/app_id: 1076017
* /dji_sdk/app_version: 1
* /dji_sdk/baud_rate: 921600
* /dji_sdk/dxc: False
* /dji_sdk/enc_key: 6bd1d26f8dd897e4b...
* /dji_sdk/serial_name: /dev/ttyUSB0
* /dji_sdk/use_broadcast: False
* /rosdistro: melodic
* /rosversion: 1.14.12
NODES
/
dji_sdk (dji_osdk_ros/dji_sdk_node)
auto-starting new master
process[master]: started with pid [2436]
ROS_MASTER_URI=http://localhost:11311
setting /run_id to bde7b4d2-252e-11ec-8a59-1831bfb3e154
process[rosout-1]: started with pid [2458]
started core service [/rosout]
process[dji_sdk-2]: started with pid [2464]
[ INFO] [1633364323.534426789]: Advanced Sensing is Enabled on M210.
Read App ID
User Configuration read successfully.
[1276751.089]STATUS/1 # getDroneVersion, L1702: ret = 0
[1276751.089]STATUS/1 # parseDroneVersionInfo, L1122: Device Serial No. = 1DADG3E00100U4
[1276751.089]STATUS/1 # parseDroneVersionInfo, L1124: Firmware = 3.4.3.44
[1276751.089]STATUS/1 # functionalSetUp, L279: Shake hand with drone successfully by getting drone version.
[1276751.089]STATUS/1 # legacyX5SEnableTask, L56: Legacy X5S Enable task created.
[1276752.089]STATUS/1 # sendHeartbeatToFCTask, L1576: OSDK send heart beat to fc task created.
[1276752.289]STATUS/1 # Control, L40: The control class is going to be deprecated.It will be better to use the FlightController class instead!
[1276752.290]STATUS/1 # FileMgrImpl, L253: register download file callback handler successfully.
[1276753.557]STATUS/1 # PSDKModule, L98: MOP only support M300, so mop client will not be initialized here.
[1276753.557]STATUS/1 # PSDKModule, L98: MOP only support M300, so mop client will not be initialized here.
[1276753.557]STATUS/1 # PSDKModule, L98: MOP only support M300, so mop client will not be initialized here.
[1276753.557]STATUS/1 # initDJIHms, L900: DJI HMS is not supported on this platform!
[1276753.567]STATUS/1 # getDroneVersion, L1702: ret = 0
[1276753.567]STATUS/1 # parseDroneVersionInfo, L1122: Device Serial No. = 1DADG3E00100U4
[1276753.567]STATUS/1 # parseDroneVersionInfo, L1124: Firmware = 3.4.3.44
[1276753.567]STATUS/1 # AdvancedSensing, L145: Advanced Sensing init for the M210 drone
[1276753.567]STATUS/1 # init, L49: Looking for USB device...
[1276753.572]STATUS/1 # init, L65: Found 8 USB devices, identifying DJI device...
[1276753.572]STATUS/1 # init, L83: Found a DJI device...
[1276753.572]STATUS/1 # init, L96: Attempting to open DJI USB device...
[1276753.572]ERRORLOG/1 # init, L101: Failed to open DJI USB device...
[1276753.572]ERRORLOG/1 # init, L102: Error code: -3
[1276753.572]ERRORLOG/1 # init, L105: Please make sure you provide a udev file for your system and reboot the computer
[1276753.573]STATUS/1 # LiveViewImpl, L89: Finding if liveview stream is available now.
[1276754.076]STATUS/1 # init, L254: Start advanced sensing initalization
[1276754.076]STATUS/1 # activate, L1329: version 0x304032C
[1276754.076]STATUS/1 # adv_pthread, L46: adv pthread created !!!!!!!!!!!!!!!!!!!!!!!
[1276754.076]STATUS/1 # adv_pthread, L48: adv pthread running !!!!!!!!!!!!!!!!!!!!!!!
[dji_sdk-2] process has died [pid 2464, exit code -11, cmd /home/vant3d/catkin_ws/devel/lib/dji_osdk_ros/dji_sdk_node __name:=dji_sdk __log:=/home/vant3d/.ros/log/bde7b4d2-252e-11ec-8a59-1831bfb3e154/dji_sdk-2.log].
log file: /home/vant3d/.ros/log/bde7b4d2-252e-11ec-8a59-1831bfb3e154/dji_sdk-2*.log
It appears it has some problem providing a udev file, but I don't know how to fix it. Does anyone have some idea to help on this problems?
Thank you!

That's my post. Firstly turn off advanced sensing to try whether a basic FTDI works.
The second which DJI OSDK version are you using? does the OSDK version match the version in OSDK-ROS? I saw you have M300 in. that is usually in OSDK 4+. For M210, I only use 3.8 and 3.9
If basic FTDI works, and you can get all the feedback. there is a higher chance that you have the wrong ACM config. DJI RNDIS thing is nasty and may not be config properly. You need to manually set static IP of 192.168.43.1 (or I remember something like this 42 or 43, you need to check on this static IP) and set it manually

slurmd unable to communicate with slurmctld

I followed the steps to troubleshoot here: https://slurm.schedmd.com/troubleshoot.html.
When running scontrol show slurmd, I get:
Active Steps = NONE
Actual CPUs = 1
Actual Boards = 1
Actual sockets = 1
Actual cores = 1
Actual threads per core = 1
Actual real memory = 984 MB
Actual temp disk space = 492 MB
Boot time = 2019-03-27T17:53:56
Hostname = fedora2
Last slurmctld msg time = NONE
Slurmd PID = 1549
Slurmd Debug = 4
Slurmd Logfile = /var/log/slurmd.log
Version = 17.11.13-2
I don't know why slurmd on fedora2 can't communicate with the controller on fedora1. slurmctld daemon is running fine on fedora1.
The slurm.conf is as follows:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=fedora1
#
ControlMachine=fedora1
ControlAddr=192.168.1.4
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=fedora
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=verbose
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=verbose
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=fedora1 NodeAddr=192.168.1.4 CPUs=1 State=UNKNOWN
NodeName=fedora2 NodeAddr=192.168.1.5 CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=fedora[1-2] Default=YES MaxTime=INFINITE State=UP
The output of tail /var/log/slurmd.log on fedora2, on multiple lines:
error: Unable to register: Unable to contact slurm controller (connect failure)

Make sure that:
no firewall prevents the slurmd daemon from talking to the controller
munge is running on each server
the dates are in sync
the Slurm versions are identical
the name fedora1 can be resolved to the correct IP

Using Linux virtual mouse driver

I am trying to implement a virtual mouse driver according to the Essential Linux device Drivers book. There is a user space application, which generates coordinates as well as a kernel module.
See: Virtual mouse driver and userspace application code and also a step by step on how to use this driver.
1.) I compile the code of the user space application and driver.
2.) Next i checked dmesg output and have,
input: Unspecified device as /class/input/input32
Virtual Mouse Driver Initialized
3.) The sysfs node was created properly during initialization (found in /sys/devices/platform/vms/coordinates)
4.) I know that the virtual mouse driver (input32 ) is linked to event5 by checking the following:
$ cat /proc/bus/input/devices
I: Bus=0000 Vendor=0000 Product=0000 Version=0000
N: Name=""
P: Phys=
S: Sysfs=/devices/virtual/input/input32
U: Uniq=
H: Handlers=event5
B: EV=5
B: REL=3
5.) Next i attach a GPM server to the event interface: gpm -m /dev/input/event5 -t evdev
6.) Run the user space application to generate random coordinates for virtual mouse and observe generated coordinates using od -x /dev/input/event5.
And nothing happens. Why?
Also here author mentioned that gdm should be stopped, using /etc/init.d/gdm stop, but i get "no such service" when stopping gdm.
Here is my complete script for building and runing virtual mouse:
make -C /usr/src/kernel/2.6.35.6-45.fc14.i686/ SUBDIRS=$PWD modules
gcc -o app_userspace app_userspace.c
insmod app.ko
gpm -m /dev/input-event5 -t evdev
./app_userspace
Makefile:
obj-m+=app.o
Kernel version: 2.6.35.6
As i said before i can recieve the result through od, but i received it through your program
echo 9 19 > /sys/devices/platform/virmouse/vmevent
gives:
time 1368284298.207654 type 2 code 0 value 9
time 1368284298.207657 type 2 code 1 value 19
time 1368284298.207662 type 0 code 0 value 0
So now the question is: what is wrong with X11? I would like to stress, that i tried this code under two different distributions Ubuntu 11.04 and Fedora 14.
Maybe this will help: in Xorg.0.log i see the following:
[ 21.022] (II) No input driver/identifier specified (ignoring)
[ 272.987] (II) config/udev: Adding input device (/dev/input/event5)
[ 272.987] (II) No input driver/identifier specified (ignoring)
[ 666.521] (II) config/udev: Adding input device (/dev/input/event5)
[ 666.521] (II) No input driver/identifier specified (ignoring)

I spent a huge amount of time, resolving this issue, and i would like to help other people, who run in this problem. I think some outer X11 features interfered my module work. After disabling GDM it now works fine (runlevel 3). Working code you can find here http://fred-zone.blogspot.ru/2010/01/mouse-linux-kernel-driver.html working distro ubuntu 11.04 (gdm disabled)

Try replacing the below lines of code in the input device driver
set_bit(EV_REL, vms_input_dev->evbit);
set_bit(REL_X, vms_input_dev->relbit);
set_bit(REL_Y, vms_input_dev->relbit);
with
vms_input_dev->name = "Virtual Mouse";
vms_input_dev->phys = "vmd/input0"; // "vmd" is the driver's name
vms_input_dev->id.bustype = BUS_VIRTUAL;
vms_input_dev->id.vendor = 0x0000;
vms_input_dev->id.product = 0x0000;
vms_input_dev->id.version = 0x0000;
vms_input_dev->evbit[0] = BIT_MASK(EV_KEY) | BIT_MASK(EV_REL);
vms_input_dev->keybit[BIT_WORD(BTN_MOUSE)] = BIT_MASK(BTN_LEFT) | BIT_MASK(BTN_RIGHT) | BIT_MASK(BTN_MIDDLE);
vms_input_dev->relbit[0] = BIT_MASK(REL_X) | BIT_MASK(REL_Y);
vms_input_dev->keybit[BIT_WORD(BTN_MOUSE)] |= BIT_MASK(BTN_SIDE) | BIT_MASK(BTN_EXTRA);
vms_input_dev->relbit[0] |= BIT_MASK(REL_WHEEL);
It worked for me on ubuntu 12.04

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

GPU Memory management issues when using TensorFlow - keras

Related

Pytorch not detecting multiple GPUs

WSL2 Linux: Numba CUDA Error_no_device however OS can see it

Trouble initializing SDK node using USB-TTL M210 v2

slurmd unable to communicate with slurmctld

Using Linux virtual mouse driver

Categories

Resources