Choose 2nd GPU on server - python-3.x

I am running code on server. There are 2 GPUs there, and the 1st one is busy. Yet, I can't find a way to switch between them. I am using pytorch if that is important. Following lines of code should be modified:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
Modification may be stated only here.
Thanks.

cuda by defaults chooses cuda:0, switching to the other GPU may be done through cuda:1
So, your line becomes:
device = 'cuda:1' if torch.cuda.is_available() else 'cpu'
You can read more about CUDA semantics.

Here is the way I'm doing it while using FastAI and pre-trained model for inference.
First, while model definition with fai (import fastai.vision.all as fai) I obtain the model instance and put it to specified GPU (say with gpu_id=3):
model = fai.nn.Sequential(body, head)
model.cuda(device=gpu_id)
Then, while loading model weights I also specify which device to use (otherwise it creates the copy of a model in GPU 0):
model.load_state_dict(torch.load(your_model_state_filepath, torch.device(gpu_id)))

Related

When would I use model.to("cuda:1") as opposed to model.to("cuda:0")?

I have a user with two GPU's; the first one is AMD which can't run CUDA, and the second one is a cuda-capable NVIDIA GPU. I am using the code model.half().to("cuda:0"). I'm not sure if the invocation successfully used the GPU, nor am I able to test it because I don't have any spare computer with more than 1 GPU lying around.
In this case, does "cuda:0" mean the first device which can run CUDA, so it would've worked even if their first device was AMD? Or would I need to say "cuda:1" instead? How would I detect which number is the first CUDA-capable device?
The package nvidia-smi can help to track GPU's memory while running your code.
To install, run pip install nvidia-ml-py3. Take a look at this code snip:
import nvidia_smi
cuda_idx = 0 # edit device index that you want to track
to_cuda = f'cuda:{cuda_idx}' # 'cuda:0' in this case
nvidia_smi.nvmlInit()
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(cuda_idx)
def B2G(num):
return round(num/(1024**3),2)
def print_memory(name, handle, pre_used):
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
used = info.used
print(f'{name}: {B2G(used)}')
print(f'This step use: {B2G(used-pre_used)}')
print('------------')
return used
# start
mem = print_memory('Start', handle, 0)
model = ... # init your model
model.to(to_cuda)
mem = print_memory('Init model', handle, mem)
Above is the example with nvidia-smi that can help to track the memory that needs for each part of the model and print it in GB unit.
Edited: To check the list of GPUs:
def check_gpu():
for i in range(torch.cuda.device_count()):
device_name = f'cuda:{i}'
print(f'{i} device name:{torch.cuda.get_device_name(torch.device(device_name))}')
I tested it and as I suspected the model.half().to("cuda:0") will put your model in the first available GPU with CUDA support i.e. NVIDIA GPU in your case, the AMD GPU isn't visible as a cuda device, feel safe to assume cuda:0 is only a CUDA enabled GPU, and AMD GPU won't be seen by your program.
Have a good day.
There are plenty of methods of torch.cuda to query and monitor GPU devices.
For example, you can check the type of each device:
torch.cuda.get_device_name(torch.device('cuda:0'))
% or
torch.cuda.get_device_name(torch.device('cuda:1'))
In my case, the output of get_device_name returns:
'Quadro RTX 6000'
If you want a more programmatic way to explore the properties of your devices, you can use torch.cuda.get_device_properties.
Once you are working with a device (or believe you are), you can use [torch.cuda]'s memory management functions to monitor GPU memory usage.
For instance, you can get a very detailed account of the current state of your device's memory using:
torch.cuda.memory_stats(torch.device('cuda:0'))
% or
torch.cuda.memory_stats(torch.device('cuda:0'))
If you want nvidia-smi-like stats on utilization, you can use torch.cuda.utilization

Tensorflow supports multiple threads/streams on one GPU for training?

UPDATE:
I found the source code of GPUDevice, it hard-coded max streams to 1, may I know the know reason?
GPUDevice(const SessionOptions& options, const string& name,
Bytes memory_limit, const DeviceLocality& locality,
TfGpuId tf_gpu_id, const string& physical_device_desc,
Allocator* gpu_allocator, Allocator* cpu_allocator)
: BaseGPUDevice(options, name, memory_limit, locality, tf_gpu_id,
physical_device_desc, gpu_allocator, cpu_allocator,
false /* sync every op */, 1 / max_streams /) {
if (options.config.has_gpu_options()) {
force_gpu_compatible_ =
options.config.gpu_options().force_gpu_compatible();
}
======================================
I am wondering whether TensorFlow(1.x version) supports multi-thread or multi-stream on a single GPU. If not, I am curious the underlying reasons, TF did this on some purposes or some libs like CUDA prevents TF from providing or some other reasons?
Like some previous posts[1,2], I tried to run multiple training ops in TF, i.e. sees.run([train_op1, train_op2],feed_dict={...}), I used the TF timeline to profile each iteration. However, TF timeline always showed that two train ops run sequentially (although timeline is not accurate[3], the wall time of each op suggests sequential running). I also looked at some source code of TF, it looks like the each op are computed by in device->ComputeAsync() or device->Compute(), and the GPU is blocked when computing an op. If I am correct, one GPU can only run a single op each time, which may lower GPU utilization.
1.Running multiple tensorflow sessions concurrently
2.Run parallel op with different inputs and same placeholder
3.https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-244251867
I have similar experience with you.
I have two GPU, each GPU run three threads, each thread running a session, each session running time fluct a lot.
if run only one thread on each GPU, session running time is quite stable.
from these appearence, we can conclude that ,thread in tensorflow not cowork well,
the mechanism of tensorflow has problem.

Pytorch 0.4.0: There are three ways to create tensors on CUDA device. Is there some difference between them?

I failed in the third way. t3 is still on CPU. No idea why.
a = np.random.randn(1, 1, 2, 3)
t1 = torch.tensor(a)
t1 = t3.to(torch.device('cuda'))
t2 = torch.tensor(a)
t2 = t2.cuda()
t3 = torch.tensor(a, device=torch.device('cuda'))
All three methods worked for me.
In 1 and 2, you create a tensor on CPU and then move it to GPU when you use .to(device) or .cuda(). They are the same here.
However, when you use .to(device) method you can explicitly tell torch to move to specific GPU by setting device=torch.device("cuda:<id>"). with .cuda() you have to do .cuda(<id>) to move to some particular GPU.
Why do these two methods exist then?
.to(device) was introduced in 0.4 because it is easier to declare device variable at top of the code as
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
and use .to(device) everywhere. This makes it quite easy to switch from CPU to GPU and vice-versa
Before this, we had to use .cuda() and your code will have if check for cuda.is_available() everywhere which made it cumbersome to switch between GPU/CPU.
The third method doesn't create a tensor on the CPU and directly copies data to GPU, which is more efficient.
Example to make a 50 by 50 tensor of 0's directly on your nvidia GPU:
zeros_tensor_gpu = torch.zeros((50, 50), device='cuda')
This will immensely speed up creation for big tensors such as 4000 by 4000

The meaning of "n_jobs == 1" in GridSearchCV with using multiple GPU

I have been training NN model by using Keras framework with 4 NVIDIA GPU. (Data Row Count: ~160,000, Column Count: 5). Now I want to optimize its parameter by using GridSearchCV.
However, I encountered several different errors whenever I tried to change n_jobs to other values than one. Error, such as
CUDA OUT OF MEMORY
Can not get device properties error code : 3
Then I read this web page,
"# if you're not using a GPU, you can set n_jobs to something other than 1"
http://queirozf.com/entries/scikit-learn-pipeline-examples
So it is not possible to use multiple GPU with GridSearchCV?
[Environment]
Ubuntu 16.04
Python 3.6.0
Keras / Scikit-Learn
Thanks!
According to the FAQ in scikit learn - GPU is NOT supported. Link
You can use n_jobs to use your CPU cores. If you want to run at maximum speed you might want to use almost all your cores:
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1

Debugging the optmization run while training variables of a pre-trained tensorflow model

I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.

Resources