How to measure GPU usage per process in Windows using python? - windows-10

I would like to measure the GPU usage per process as done in Windows taskmgr.exe, but I have encountered several problems when attempting to use the pyNVML library. As a result, I have a few questions.
First, is it currently possible to measure the exact GPU usage per process in Windows using Python? I have already tried the nvidia-smi query, but this doesn't seem to show memory used and utilization percent for each process.
Second, if it is possible to measure GPU usage in this way using Python, I would like to measure and show it in a similar fashion as done in the Windows taskmgr.exe of Windows 10.
Here is my code so far:
nvmlInit()
deviceCount = nvmlDeviceGetCount()
#print(deviceCount)
for device_id in range(deviceCount):
hd = nvmlDeviceGetHandleByIndex(device_id)
#print(handle)
cps = nvmlDeviceGetGraphicsRunningProcesses(hd)
for ps in cps :
pp = ps.pid
#print(pp)
try :
name = str(nvmlSystemGetProcessName(ps.pid))
n = name.split("\\")
#print(n[len(n)-1][:-1])
process_name = n[len(n)-1][:-1]
if process_name == 'chrome.exe':
print(process_name, pp, ps.usedGpuMemory)
except:
pass
and my result:
chrome.exe 16688 None
As you can see, this does not reveal the GPU memory usage per process, but I need the information shown in taskmgr's GPU section. (I have no need of visualization.)
My computer specs are Windows 10 pro, GTX 950, i5-6600
If this is impossible in Python at the moment, do you have any other recommendations to automatically collect GPU usage per process.
Thank you.

Check the answer of Jonathan DEKHTIAR Here explaining the reason why it doesn't work.
As a workaround, you can try fetching values from powershell Get-Counter -Counter "\GPU Engine(*)\Utilization Percentage"

Related

Get CPU load with python3 in Raspberry Pi OS Bullseye

My goal is to get the current CPU usage ##% returned to display on a ePaper HAT display. I'm currently using python3.
I've tried 2 solutions found on stackoverflow and they produced unexpected results.
import os
def getCPUuse():
return str(os.popen("top -n1 | awk '/Cpu\(s\):/ {print $2}'").readline())
print(getCPUuse())
I'm getting "TERM environment variable not set." outputted in the shell when running this proposed code.
I'm not sure how to make this message go away, as the other proposed solutions to "TERM environment variable not set." is to set the variable XTERM but the variable seems to be set already. When entering set | grep TERM into terminal, "TERM=xterm-256color" is returned.
import psutil
def get_CPU():
return psutil.cpu_percent()
print(get_CPU())
Here is another proposed code but running this always returns "0.0". I'm suspicious that the CPU load is constantly 0.0 so I used
htop
in the terminal, and it looks like average CPU load was ~2.8%, not 0.
Perhaps I should use from gpiozero import LoadAverage instead? I'm new to programming hardware. If someone with more experience can offer pointers on whether https://gpiozero.readthedocs.io/en/stable/api_internal.html#loadaverage is promising, that'd be helpful too.
I'm trying to keep solutions based on Python3.

When would I use model.to("cuda:1") as opposed to model.to("cuda:0")?

I have a user with two GPU's; the first one is AMD which can't run CUDA, and the second one is a cuda-capable NVIDIA GPU. I am using the code model.half().to("cuda:0"). I'm not sure if the invocation successfully used the GPU, nor am I able to test it because I don't have any spare computer with more than 1 GPU lying around.
In this case, does "cuda:0" mean the first device which can run CUDA, so it would've worked even if their first device was AMD? Or would I need to say "cuda:1" instead? How would I detect which number is the first CUDA-capable device?
The package nvidia-smi can help to track GPU's memory while running your code.
To install, run pip install nvidia-ml-py3. Take a look at this code snip:
import nvidia_smi
cuda_idx = 0 # edit device index that you want to track
to_cuda = f'cuda:{cuda_idx}' # 'cuda:0' in this case
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(cuda_idx)
def B2G(num):
return round(num/(1024**3),2)
def print_memory(name, handle, pre_used):
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
used = info.used
print(f'{name}: {B2G(used)}')
print(f'This step use: {B2G(used-pre_used)}')
print('------------')
return used
# start
mem = print_memory('Start', handle, 0)
model = ... # init your model
model.to(to_cuda)
mem = print_memory('Init model', handle, mem)
Above is the example with nvidia-smi that can help to track the memory that needs for each part of the model and print it in GB unit.
Edited: To check the list of GPUs:
def check_gpu():
for i in range(torch.cuda.device_count()):
device_name = f'cuda:{i}'
print(f'{i} device name:{torch.cuda.get_device_name(torch.device(device_name))}')
I tested it and as I suspected the model.half().to("cuda:0") will put your model in the first available GPU with CUDA support i.e. NVIDIA GPU in your case, the AMD GPU isn't visible as a cuda device, feel safe to assume cuda:0 is only a CUDA enabled GPU, and AMD GPU won't be seen by your program.
Have a good day.
There are plenty of methods of torch.cuda to query and monitor GPU devices.
For example, you can check the type of each device:
torch.cuda.get_device_name(torch.device('cuda:0'))
% or
torch.cuda.get_device_name(torch.device('cuda:1'))
In my case, the output of get_device_name returns:
'Quadro RTX 6000'
If you want a more programmatic way to explore the properties of your devices, you can use torch.cuda.get_device_properties.
Once you are working with a device (or believe you are), you can use [torch.cuda]'s memory management functions to monitor GPU memory usage.
For instance, you can get a very detailed account of the current state of your device's memory using:
torch.cuda.memory_stats(torch.device('cuda:0'))
% or
torch.cuda.memory_stats(torch.device('cuda:0'))
If you want nvidia-smi-like stats on utilization, you can use torch.cuda.utilization

Tensorflow supports multiple threads/streams on one GPU for training?

UPDATE:
I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 / max_streams /) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
options.config.gpu_options().force_gpu_compatible();
}
======================================
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
3.https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-244251867
I have similar experience with you.
I have two GPU, each GPU run three threads, each thread running a session, each session running time fluct a lot.
if run only one thread on each GPU, session running time is quite stable.
from these appearence, we can conclude that ,thread in tensorflow not cowork well,
the mechanism of tensorflow has problem.

Reduce multiprocessing for statsmodels glm

I am currently doing proof of concept for one of our business process that requires logistic regression. I have been using statsmodels glm to perform classification against our data set (as per below code). Our data set consists of ~10M rows and around 80 features (where almost 70+ are dummies e.g. "1" or "0" based on the defined categorical variables). Using smaller data set, glm works fine, however if i run it against the full data set, python is throwing an error "cannot allocate memory".
glmmodel = smf.glm(formula, data, family=sm.families.Binomial())
glmresult = glmmodel.fit()
resultstring = glmresult.summary().as_csv()
This got me thinking that this might be due to statsmodels is designed to make use of all the available cpu cores and each subprocess underneath creates a copy of the data set into RAM (please correct me if I am mistaken). Question now would be if there is a way for glm to just make use of minimal number of cores? I am not into performance but just want to be able to run the glm against the full data set.
For reference, below is the machine configuration and some more information if needed.
CPU: 10 cores
RAM: 40 GB (usable/free ~25GB as there are other processes running on the
same machine)
swap: 16 GB
dataset size: 1.4 GB (based on Panda's DataFrame.info(memory_usage='deep')
GLM uses multiprocessing only through the linear algbra libraries
The following copies my FAQ issue description from https://github.com/statsmodels/statsmodels/issues/2914
It includes some links to other issues where this shows up.
(quote:)
Statsmodels is using joblib in a few places for parallel processing where it's under our control. Current usage is mainly for bootstrap and it is not used in the models directly.
However, some of the underlying Blas/Lapack libraries in numpy/scipy also use mutliple cores. This can be efficient for linear algebra with large arrays, but it can also slow down the operations especially when we want to use parallel processing on a higher level.
How can we restrict the number of cores used by the linear algebra libraries?
This depends on which linear algebra library is used. see mailing list thread
https://groups.google.com/d/msg/pystatsmodels/Lz9-In0pgPk/BtcYsj_ABQAJ
openblas: try setting the environment variable OMP_NUM_THREADS=1
Accelerate on OSX, set VECLIB_MAXIMUM_THREADS
mkl in anaconda:
import mkl
mkl.set_num_threads(1)
This is because Statsmodels use IRLS in estimating GLM and the IRLS process utilize its WLS regression routine which again uses QR decomposition. The QR decomposition is directly done on the X and your X has 10million rows, 80 columns which turns out putting a lot of stress on the memory and CPU.
Here is the source code from statsmodels:
if method == 'pinv':
pinv_wexog = np.linalg.pinv(self.wexog)
params = pinv_wexog.dot(self.wendog)
elif method == 'qr':
Q, R = np.linalg.qr(self.wexog)
params = np.linalg.solve(R, np.dot(Q.T, self.wendog))
else:
params, _, _, _ = np.linalg.lstsq(self.wexog, self.wendog,

Python3 multiprocessing: Memory Allocation Error

I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?
The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.
range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.

Resources