Currently, I am doing y Udemy Python course for data science. In there, there is the following example to train a model in Tensorflow:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
model = Sequential()
# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
My goal now was to get this to run on my GPU. For that, I altered the last part as follows (the epochs are low on purpose, I just want to see how long it takes per epoch before scaling up):
with tf.device("/gpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
and for comparison, also as follows:
with tf.device("/cpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
However, the result is very unexpected: Either, both versions occupy all memory of the GPU but seemingly don't do any calculations on it, and take the exact same time per epoch. Or, the GPU version simply crashes with the following error:
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value, from_value)
InternalError: Blas GEMM launch failed : a.shape=(32, 78), b.shape=(78, 78), m=32, n=78, k=78
[[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
Function call stack:
distributed_function
Sometimes it crashes, sometimes it kind of works but takes as long as the CPU. Sometimes even the CPU version takes 20 sec per epoch, other times it takes 40 sec. The code stays the same, all that changes is that I restart the Kernel in between. I really don't understand it.
When I test the GPU and conda environment using the following code, everything seems to work fine, reproducible and the GPU is about 20x as fast as the CPU:
# https:// gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
def test_device (device_name: str):
shape = (int(10000), int(10000))
startTime = datetime.now()
with tf.device(device_name):
random_matrix = tf.random.uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
sum_operation = tf.reduce_sum(dot_operation)
result = sum_operation
print("Shape:", shape, "Device:", device_name)
print("—"*50)
print(result)
print("Time taken:", datetime.now() - startTime)
print("\n" * 2)
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec
So, I am sure there is something I am doing wrong.
TLTR:
What would be the correct way to call model.fit on the GPU? How can different runs (without changing the code) result in so drastically different outcomes (Crash, vastly different calculation times)?
Any help is greatly appreciated, thx!
After a lot of try and error I finally found a working way to either force CPU or "mixed usage". GPU only doesn't seem to work, though. The with tf.device() method from my original post doesn't seem to do anything in this scenario. I have to hide the GPU if I want to use the CPU, only (Tensorflow 2.1.0):
CPU only
# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([], 'GPU') # hide the GPU
tf.config.set_visible_devices(cpus[0], 'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 3-4 sec per epoch and does not tax the GPU.
Restart the Kernel, then:
GPU only
# force GPU (make GPU visible)
# note: does not work without restarting the kernel, otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([], 'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0], 'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
That doesn't work as, apparently, the CPU is required by this model:
"NotFoundError: No CPU devices are available in this process"
Default (mixed CPU & GPU):
Restart the Kernel, then:
# test if CPU and GPU are visible
tf.config.get_visible_devices()
# [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 5-6 sec per epoch, consumes all the RAM of the GPU and uses a small amount of processing power of the GPU (<10%). Apparently, this is slower than using the CPU only for this model (8 GB video RAM vs. 16 GB System RAM??).
If the default mode (CPU & GPU) throws the following error, it seems the GPU is occupied by another process and restarting Windows helps:
"InternalError: Blas GEMM launch failed"
There are still lots of mysteries left for me:
Why is the "mixed" mode slower than CPU only?
Can you change visible devices without having to restart the Kernel to avoid the following error? "Visible devices cannot be modified after being initialized"
Why does the with tf.device() method not work for this model (no effect), whereas it works for the test_device() code?
If anybody can provide some insight, thank you very much :)
I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version
I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)
Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.
I have been training NN model by using Keras framework with 4 NVIDIA GPU. (Data Row Count: ~160,000, Column Count: 5). Now I want to optimize its parameter by using GridSearchCV.
However, I encountered several different errors whenever I tried to change n_jobs to other values than one. Error, such as
CUDA OUT OF MEMORY
Can not get device properties error code : 3
Then I read this web page,
"# if you're not using a GPU, you can set n_jobs to something other than 1"
http://queirozf.com/entries/scikit-learn-pipeline-examples
So it is not possible to use multiple GPU with GridSearchCV?
[Environment]
Ubuntu 16.04
Python 3.6.0
Keras / Scikit-Learn
Thanks!
According to the FAQ in scikit learn - GPU is NOT supported. Link
You can use n_jobs to use your CPU cores. If you want to run at maximum speed you might want to use almost all your cores:
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1
I am loading a pre-trained model and then extracting only the trainable variables which I want to optimize (basically change or fine-tune) according to my custom loss. The problem is the moment I pass a mini-batch of data to it, it just hangs and there is no progress. I used Tensorboard for visualization but don't know how to debug when there is no log info available. I had put some basic print statements around it but didn't get any helpful information.
Just to give an idea, this is the piece of code sequentially
# Load and build the model
model = skip_thoughts_model.SkipThoughtsModel(model_config, mode="train")
with tf.variable_scope("SkipThoughts"):
model.build()
theta = [v for v in tf.get_collection(tf.GraphKeys.MODEL_VARIABLES, scope='SkipThoughts') if "SkipThoughts" in v.name]
# F Representation using Skip-Thoughts model
opt_F = tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
# Training
sess.run([opt_F], feed_dict = {idx: idxTensor})
And the model is from this repository:
The problem is with training i.e. the last step. I verified that the theta list is not empty it has 26 elements in it, like ...
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/beta:0
SkipThoughts/decoder_pre/gru_cell/candidate/layer_norm/w/gamma:0
SkipThoughts/logits/weights:0
SkipThoughts/logits/biases:0
SkipThoughts/decoder_post/gru_cell/gates/layer_norm/w_h/beta:0
...
Also, even after using tf.debug the issue remains. Maybe it really takes lot of time or is stuck awaiting for some other process? So, I also tried breaking down the
tf.train.AdamOptimizer(learning_rate).minimize(model.total_loss, var_list=[theta])
step into
gvs = tf.train.AdamOptimizer(learning_rate).compute_gradients(model.total_loss, var_list=theta)
opt_F = opt.apply_gradients(gvs)
...
g = sess.run(gvs, feed_dict = {idx: idxTensor})
so that I can check if the gradients are computed in the first place, which got stuck at the same point. In addition to that, I also tried computing the gradients with tf.gradients over just one of the variables and that too for one dimension, but the issue still exists.
I am running this piece of code on an IPython notebook on Azure Cluster with 1 GPU Tesla K80. The GPU usage stays the same throughout the execution and there is no out of memory error.
The kernel interrupt doesn't work and the only way to stop it is by restarting the notebook. Moreover, if I compile this code into a Python file then too I need to explicitly kill the process. However, in any such case I don't get the stack trace to know what is the exact place it is stuck! How should one debug such an issue?
Any help and pointers in this regard would be much appreciated.