SLURM srun print log instance-wise - slurm

While using slurm on multi-node cluster,
I ran
srun -N 2 -C worker nvidia-smi
The output of this command is mangled/interleaved instead of in order.
Example output:
Tue Dec 15 22:37:55 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
Tue Dec 15 22:37:55 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:16.0 Off | 0 |
| N/A 46C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 0 Tesla V100-SXM2... On | 00000000:00:16.0 Off | 0 |
| N/A 40C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 49C P0 46W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 39C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Expected output:
Instance 1
Tue Dec 15 22:37:55 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
+-------------------------------+----------------------+----------------------+
| 0 Tesla V100-SXM2... On | 00000000:00:16.0 Off | 0 |
| N/A 40C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 39C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Instance 2
Tue Dec 15 22:37:55 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:16.0 Off | 0 |
| N/A 46C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 49C P0 46W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

You can use the --label option to prefix each output line with the task number, and then use sort to group lines together:
srun --label -N 2 -C worker nvidia-smi | sort -n

Related

How do I output in a nice table in the terminal the mapping of the gpu id, the pid and the **username**?

I saw: How do I customize nvidia-smi 's output to show PID username? but doesn't do what I want. I want the output to look:
USER GPU PID hostname %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
brando9 0 1234 ampere3 ... etc... whatever don't really care
but instead I see:
(metalearning_gpu) brando9~ $ nvidia-smi; ps -up `nvidia-smi -q -x | grep pid | sed -e 's/<pid>//g' -e 's/<\/pid>//g' -e 's/^[[:space:]]*//'`; hostname
Mon Feb 6 19:19:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 67W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0A:00.0 Off | 0 |
| N/A 28C P0 61W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:44:00.0 Off | 0 |
| N/A 29C P0 63W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4A:00.0 Off | 0 |
| N/A 32C P0 65W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:84:00.0 Off | 0 |
| N/A 33C P0 65W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:8A:00.0 Off | 0 |
| N/A 30C P0 71W / 400W | 66729MiB / 81920MiB | 14% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:C0:00.0 Off | 0 |
| N/A 30C P0 62W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:C3:00.0 Off | 0 |
| N/A 32C P0 64W / 400W | 2MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 5 N/A N/A 49854 C .../envs/a100_env/bin/python 66727MiB |
+-----------------------------------------------------------------------------+
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
kexinh 49854 359 0.3 130510112 6749364 ? Rsl 18:16 226:30 /dfs/user/kexinh/miniconda3/envs/a100_env/bin/python -m ipykernel_launcher -f /afs/cs.stanford.edu/u/kexinh/.local/share/jupyter/runtime/kernel-bbc9f45e-4513-4643-82c3-0f67dde751
ampere3
How do I add a column such that I can see the gpu id, pid and user name easily in bash/the terminal?
Even a command using python is fine e.g.
python -c 'some one liner python script that works'
related:
quora: https://www.quora.com/unanswered/How-do-I-output-in-a-nice-table-in-the-terminal-the-mapping-of-the-GPU-ID-the-pid-and-the-username
related: How do I customize nvidia-smi 's output to show PID username?
cross reddit nvidia: https://www.reddit.com/r/nvidia/comments/10vr808/how_do_i_output_in_a_nice_table_in_the_terminal/
cross reddit hpc: https://www.reddit.com/r/HPC/comments/10x9w6x/how_do_i_output_in_a_nice_table_in_the_terminal/
pytorch: https://discuss.pytorch.org/t/how-do-i-output-in-a-nice-table-in-the-terminal-the-mapping-of-the-gpu-id-the-pid-and-the-username/172043
cross reddit linux: https://www.reddit.com/r/linux/comments/10x9xfw/how_do_i_output_in_a_nice_table_in_the_terminal/
Here is the answer: (can't reopen question)
Answer:
(echo "GPU_ID PID UID APP" ; for GPU in 0 1 2 3 ; do for PID in $( nvidia-smi -q --id=${GPU} --display=PIDS | awk '/Process ID/{print $NF}') ; do echo -n "${GPU} ${PID} " ; ps -up ${PID} | awk 'NR-1 {print $1,$NF}' ; done ; done) | column -t
credit:
https://www.reddit.com/r/HPC/comments/10x9w6x/comment/j7sg7w2/?utm_source=share&utm_medium=web2x&context=3
this solves my issues: https://stackoverflow.com/a/75403918/1601580
Here is the answer
Answer:
(echo "GPU_ID PID UID APP" ; for GPU in 0 1 2 3 ; do for PID in $( nvidia-smi -q --id=${GPU} --display=PIDS | awk '/Process ID/{print $NF}') ; do echo -n "${GPU} ${PID} " ; ps -up ${PID} | awk 'NR-1 {print $1,$NF}' ; done ; done) | column -t
credit:
https://www.reddit.com/r/HPC/comments/10x9w6x/comment/j7sg7w2/?utm_source=share&utm_medium=web2x&context=3
this solves my issues: https://stackoverflow.com/a/75403918/1601580
This one is also nice and adds memory utililization:
(echo "GPU_ID PID MEM% UTIL% UID APP" ; for GPU in 0 1 2 3 ; do for PID in $( nvidia-smi -q --id=${GPU} --display=PIDS | awk '/Process ID/{print $NF}') ; do echo -n "${GPU} ${PID} " ; nvidia-smi -q --id=${GPU} --display=UTILIZATION | grep -A4 -E '^[[:space:]]*Utilization' | awk 'NR=0{gut=0 ;mut=0} $1=="Gpu"{gut=$3} $1=="Memory"{mut=$3} END{printf "%s %s ",mut,gut}' ; ps -up ${PID} | gawk 'NR-1 {print $1,$NF}' ; done ; done) | column -t
output:
GPU_ID PID MEM% UTIL% UID APP
0 319310 16 58 minkai exp_cond_lumo_latent1
1 320206 11 38 minkai exp_cond_mu_latent1
3 59140 0 0 kexinh --wandb
3 1202222 0 0 brando9 5CNN_opt_as_model_for_few_shot

Nvidia A100 Devices Not Found on EC2

I am having problems accessing nvidia A100 GPUs on an AWS EC2 instance (p4d.24xlarge). However, I am able to use V100 GPUs (p3.16xlarge) without any problem.
On the p4d, I rebuilt everything from source, just as I did on the p3 instance, including nvidia-drivers from https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda_11.1.1_455.32.00_linux.run.
Any ideas what the problem might be?
When I run nvidia-smi, it shows that there are eight A100 GPUs available (expected). I wrote some simple code to query the number of GPUs in the system (code below) and get the following error:
Obtaining devices...
GPUassert: system not yet initialized DevInfo.cu 20
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#define USE_CUDA
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main(int argc, char **argv)
{
int numDevs = 0;
printf("Obtaining devices...\n");
gpuErrchk(cudaGetDeviceCount(&numDevs));
printf("Number of devices: %d\n", numDevs);
return 0;
}
nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:10:1C.0 Off | 0 |
| N/A 32C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB Off | 00000000:10:1D.0 Off | 0 |
| N/A 31C P0 47W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 A100-SXM4-40GB Off | 00000000:20:1C.0 Off | 0 |
| N/A 31C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 A100-SXM4-40GB Off | 00000000:20:1D.0 Off | 0 |
| N/A 32C P0 49W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 A100-SXM4-40GB Off | 00000000:90:1C.0 Off | 0 |
| N/A 32C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 A100-SXM4-40GB Off | 00000000:90:1D.0 Off | 0 |
| N/A 31C P0 48W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 A100-SXM4-40GB Off | 00000000:A0:1C.0 Off | 0 |
| N/A 33C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

How to set GPU count to 0 using os.environ['CUDA_VISIBLE_DEVICES'] =""?

So I have the following GPU configured in my system:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.33 Driver Version: 461.33 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... TCC | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 25W / 250W | 1MiB / 32642MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... TCC | 00000000:D8:00.0 Off | 0 |
| N/A 31C P0 25W / 250W | 1MiB / 32642MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Now, via python, I have to set the environment, such that, GPU count = 0.
I have tried the following, after learning from various sources:
import os
os.environ["CUDA_VISIBLE_DEVICES"]=""
import torch
torch.device_count()
But, it still gives me the output as "2" as in for 2 GPUs in the system.
How to set the environment, such that it outputs "0" ?
Any other way, to set the count to "0" is also appreciated but it should be any ML-Library agnostic. (For example, I can't use device = torch.device("cpu") as this will work only for Pytorch and not for other libraries)
To prevent your GPU from being used, set os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
The easiest way to do this is to run python with the correct environment set. For example, on Linux
CUDA_VISIBLE_DEVICES="" python ...
The following should also work:
os.environ["CUDA_VISIBLE_DEVICES"]=""
But this must be done before you first import torch.
What I think is happening in your case is you must be importing torch earlier, perhaps indirectly via some libraries that use torch.
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
should set you to not use GPUs. From https://sodocumentation.net/tensorflow/topic/10621/tensorflow-gpu-setup#run-tensorflow-on-cpu-only---using-the--cuda-visible-devices--environment-variable-
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
torch.cuda.device_count() # result is 2
os.environ["CUDA_VISIBLE_DEVICES"]="0"
torch.cuda.device_count() # result is 1, using first GPU
os.environ["CUDA_VISIBLE_DEVICES"]="1"
torch.cuda.device_count() # result is 1, using second GPU

How to monitor GPU memory usage when training a DNN?

I give a result example. I want to ask how to get the data like this graph.
You can use pytorch commands such as torch.cuda.memory_stats to get information about current GPU memory usage and then create a temporal graph based on these reports.
I think this is the best
torch.cuda.mem_get_info
it returns the global free and total GPU memory occupied for a given device using cudaMemGetInfo.
Another method is using nvidia-smi which looks like this
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 0% 50C P8 12W / 215W | 1088MiB / 8113MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1091 G /usr/lib/xorg/Xorg 24MiB |
| 0 N/A N/A 1158 G /usr/bin/gnome-shell 48MiB |
+-----------------------------------------------------------------------------+
And use subprocess to get the string for example
import subprocess
import re
command = 'nvidia-smi'
while True:
p = subprocess.check_output(command)
ram_using = re.findall(r'\b\d+MiB+ /', str(p))[0][:-5]
ram_total = re.findall(r'/ \b\d+MiB', str(p))[0][3:-3]
ram_percent = int(ram_using) / int(ram_total)
Or just split the line with str(p).split('\n') and count the string length

How to check if keras training is already running in a GPU?

Sometimes I make a mistake and try to run two simultaneous trainings with keras in the same GPU (two different scripts), making my machine crash or breaking both trainings.
I would like to be able to test in my script if there is some training running and therefore either change of gpu or stop the new training.
The only hint I found searching for an answer is to use nvidia-smi to check processes running in gpus?
An example of nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 411.63 Driver Version: 411.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp WDDM | 00000000:03:00.0 Off | N/A |
| 42% 67C P2 81W / 250W | 10114MiB / 12288MiB | 54% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp WDDM | 00000000:04:00.0 Off | N/A |
| 35% 58C P2 144W / 250W | 10315MiB / 12288MiB | 73% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11660 C ...\conda\envs\tensorflow18-gpu\python.exe N/A |
| 1 1532 C+G Insufficient Permissions N/A |
| 1 5388 C+G C:\Windows\explorer.exe N/A |
| 1 6648 C+G Insufficient Permissions N/A |
| 1 7396 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 1 7688 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
| 1 9808 C ...\conda\envs\tensorflow18-gpu\python.exe N/A |
| 1 10820 C+G Insufficient Permissions N/A |
| 1 11232 C+G ...x64__8wekyb3d8bbwe\Microsoft.Photos.exe N/A |
+-----------------------------------------------------------------------------+
In this case there is python.exe running in GPU 0 and in GPU 1.
Is there a more direct solution? Thanks
You can try this python package, GPUtil

Resources