Currently, I am doing y Udemy Python course for data science. In there, there is the following example to train a model in Tensorflow:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
model = Sequential()
# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
My goal now was to get this to run on my GPU. For that, I altered the last part as follows (the epochs are low on purpose, I just want to see how long it takes per epoch before scaling up):
with tf.device("/gpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
and for comparison, also as follows:
with tf.device("/cpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
However, the result is very unexpected: Either, both versions occupy all memory of the GPU but seemingly don't do any calculations on it, and take the exact same time per epoch. Or, the GPU version simply crashes with the following error:
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value, from_value)
InternalError: Blas GEMM launch failed : a.shape=(32, 78), b.shape=(78, 78), m=32, n=78, k=78
[[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
Function call stack:
distributed_function
Sometimes it crashes, sometimes it kind of works but takes as long as the CPU. Sometimes even the CPU version takes 20 sec per epoch, other times it takes 40 sec. The code stays the same, all that changes is that I restart the Kernel in between. I really don't understand it.
When I test the GPU and conda environment using the following code, everything seems to work fine, reproducible and the GPU is about 20x as fast as the CPU:
# https:// gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
def test_device (device_name: str):
shape = (int(10000), int(10000))
startTime = datetime.now()
with tf.device(device_name):
random_matrix = tf.random.uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
sum_operation = tf.reduce_sum(dot_operation)
result = sum_operation
print("Shape:", shape, "Device:", device_name)
print("—"*50)
print(result)
print("Time taken:", datetime.now() - startTime)
print("\n" * 2)
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec
So, I am sure there is something I am doing wrong.
TLTR:
What would be the correct way to call model.fit on the GPU? How can different runs (without changing the code) result in so drastically different outcomes (Crash, vastly different calculation times)?
Any help is greatly appreciated, thx!
After a lot of try and error I finally found a working way to either force CPU or "mixed usage". GPU only doesn't seem to work, though. The with tf.device() method from my original post doesn't seem to do anything in this scenario. I have to hide the GPU if I want to use the CPU, only (Tensorflow 2.1.0):
CPU only
# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([], 'GPU') # hide the GPU
tf.config.set_visible_devices(cpus[0], 'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 3-4 sec per epoch and does not tax the GPU.
Restart the Kernel, then:
GPU only
# force GPU (make GPU visible)
# note: does not work without restarting the kernel, otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([], 'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0], 'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
That doesn't work as, apparently, the CPU is required by this model:
"NotFoundError: No CPU devices are available in this process"
Default (mixed CPU & GPU):
Restart the Kernel, then:
# test if CPU and GPU are visible
tf.config.get_visible_devices()
# [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 5-6 sec per epoch, consumes all the RAM of the GPU and uses a small amount of processing power of the GPU (<10%). Apparently, this is slower than using the CPU only for this model (8 GB video RAM vs. 16 GB System RAM??).
If the default mode (CPU & GPU) throws the following error, it seems the GPU is occupied by another process and restarting Windows helps:
"InternalError: Blas GEMM launch failed"
There are still lots of mysteries left for me:
Why is the "mixed" mode slower than CPU only?
Can you change visible devices without having to restart the Kernel to avoid the following error? "Visible devices cannot be modified after being initialized"
Why does the with tf.device() method not work for this model (no effect), whereas it works for the test_device() code?
If anybody can provide some insight, thank you very much :)
Related
I have RTX 3070. Somehow using autocast slows down my code.
torch.version.cuda prints 11.1, torch.backends.cudnn.version() prints 8005 and my PyTorch version is 1.9.0. I’m using Ubuntu 20.04 with Kernel 5.11.0-25-generic.
That’s the code I’ve been using:
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
with torch.cuda.amp.autocast():
outputs = net(inputs)
oss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
end.record()
torch.cuda.synchronize()
print(start.elapsed_time(end))
Without torch.cuda.amp.autocast(), 1 epoch takes 22 seconds, whereas with autocast() 1 epoch takes 30 seconds.
It turns out, my model was not big enough to utilize mixed precision. When I increased the in/out channels of convolutional layer, it finally worked as expected.
I came across this post because I was trying the same code and seeing slower performance. BTW, to use the GPU, you need to port data into tensor core in each step:
inputs, labels = data[0].to(device), data[1].to(device)
Even I made my network 10 times bigger I did not see the performance.
Something else might be wrong at setup level.
I am going to try Pytorch lightening.
I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version
I want to tune the parameters of the "SVR()" regression function. It starts processing and doesn't stop, I am unable to figure out the problem. I am predicting a parameter using the SVM regression function SVR(). The results are not good with default values in Python.so I want to try tunning it with "GridSearchCv". The last part "grids.fit(Xtrain,ytrain)" start running without giving any error and doesn't stop.
SVR() tunning using GridSearch
Code:
from sklearn.model_selection import GridSearchCV.
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C' : [1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')},
modelsvr = SVR(),
grids = GridSearchCV(modelsvr,param,cv=5)
grids.fit(Xtrain,ytrain)
It Continues to process without stopping.
Yes, you are right. I have come across the same scenario, when I try to run GridsearchCV for SVR(). The possible reasons are, 1) Your Processor memory(RAM) must be less, 2) The train data sample size is more, equal chance of consuming more time to run Gridsearch since your processor is low memory, so without any error the Job running time will be more.
For your info: I have run Gridsearch with train sample size of 30K using 16GB RAM memory space, it elapsed 210mins to finish the run. So, here patience is must.
Happy Analyzing !!
Maybe you should add two more options to your GridSearch (n_jobs and verbose) :
grid_search = GridSearchCV(estimator = svr_gs, param_grid = param,
cv = 3, n_jobs = -1, verbose = 2)
verbose means that you see some output about the progress of your process.
n_jobs is the numebr of used cores (-1 means all cores/threads you have available)
I'm surprised to face an Out-of-Memory error using tf.keras.applications.ResNet50 implementation on an Nvidia RTX2080Ti (with 11Gb of memory !).
Question:
Is there something wrong with the workflow I use?
Notes:
I'm using tensorflow-gpu==2.0.0b1 with CUDA v10.1
I work on a segmentation task, thus the large output_shape
I build the batches myself, thus the use of train_on_batch()
Even when setting memory_growth to True, the memory get filled-up from 700Mb to 10850Mb in a fraction of a second.
Code:
import tensorflow as tf
import tensorflow.keras as ke
import numpy as np
ke.backend.clear_session()
inputs = ke.layers.Input(shape=(512,1024,3), dtype="float32")
outputs = ke.applications.ResNet50(include_top=False, weights="imagenet")(inputs)
outputs = ke.layers.Lambda(lambda x: tf.compat.v1.image.resize_bilinear(x, size=(512,1024)))(outputs)
outputs = ke.layers.Conv2D(2, 1, activation="softmax")(outputs)
model = ke.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=ke.optimizers.RMSprop(lr=0.001), loss=ke.losses.CategoricalCrossentropy())
images = np.zeros((1,512,1024,3), dtype=np.float32)
targets = np.zeros((1,512,1024,2), dtype=np.float32)
model.train_on_batch(images, targets)
Resnet being the complex complex model, the dimensions of the input might be the reason for OOM error. Try reducing the dimensions and the corresponding batch size(as much as the memory can hold) and try.
As mentioned in comments it worked with batch size 1 and with dimensions 700*512.
I have two computers with the same GPU(GTX 1080), installed the same copy of OS and softwares. But when I run my tensorflow program(an RNN model), the speed are very different. One is about 1.5x faster than the other.
Here are the key specs of the two:
SystemA: Asus Z170-P, i7 6700T, 32GB Ram, GTX 1080.
SystemB: Asus X99 E-WS, i7 5930K, 128G Ram, GTX 1080. (Problem one)
Both are installed with(using the same method):
OS: Ubuntu 16.04
GPU driver version: 378.13
Cuda version: 8.0
cuDNN version: 5.1
Tensorflow: installed using method pip install tensorflow-gpu==1.0.1
Python: Anaconda 3.6
Sample code:
import tensorflow as tf
import numpy as np
from tqdm import trange
h,w = 3000, 2000
steps = 1000
x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
x0 = np.random.random(size=[h, w])
sess = tf.Session()
for i in trange(steps):
x0 = sess.run(m, feed_dict={x: x0})
SystemA performs 75 iter/sec and systemB only has 50 iter/sec, yes the poorer one is actually faster.
Key observations:
SystemB have a much larger page fault while running the program.
By monitoring the Volatile GPU-Util from nvidia-smi, systemA stably seat at about 40% while systemB is about 30%.
Things I have tried on systemB:
Upgrade BIOS to the latest version and reset default settings.
Call Asus customer service for help.
Swap GPU card with system A.
Change PCI-e slot to make sure it running at x16 gen3.
Inject LD_PRELOAD="/usr/lib/libtcmalloc.so" to .bashrc file.
The main differences of the output of /usr/bin/time -v are:
# The first value is for systemB and the second is for systemA.
System time (seconds): 7.28 2.95
Percent of CPU this job got: 85% 106%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.41 0:14.89
Minor (reclaiming a frame) page faults: 684695 97853
Involuntary context switches: 164 91063
File system inputs: 0 24
File system outputs: 8 0
Can anybody point me to a direction of how to profile/debug this issue? Many thanks in advance!
There is a chance that you may not be using GPUs. To test this use
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
to display what devices you are using.
If indeed you are using CPU, then you can add the following before your tensorflow code
with tf.device('/gpu:0'): # NEW LINE
x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
If this isn't the case, add a comment with your results and I'll follow up to see what else I can do.
According to some sources tf.constant is a GPU memory hog. try replacing
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
with
t = tf.Variable(np.random.random(size=[w, w]), dtype=tf.float32)
trying a model without network traffic
import tensorflow as tf
import numpy as np
from tqdm import trange
h,w = 3000, 2000
steps = 1000
x = tf.random_normal( [h, w] , dtype=tf.float32 )
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
s = tf.reduce_mean( tf.reduce_mean( m ) )
sess = tf.Session()
for i in range(steps):
sess.run(s)
Results of Experimentation with Xer
After much discussion and trouble shooting, it has become apparent that indeed the two machines are different. The Nvidia cards were swapped which resulted in no change. They have 2 different CPUs, one with a graphics processor built in and 1 without. One with a higher CPU, one without. I suggested that machine with an onboard graphics on the i7 had the OSs graphic windowing system disabled to make sure that the test is unused GPU vs unused GPU. The problem persisted.
The original problem that was posted creates huge amounts of data traffic across the main BUS from the CPU to the Nvidia GPUs as can be seen here
Tx Throughput : 75000 KB/s
Rx Throughput : 151000 KB/s
We experimented with changing the size of the problem from W=2000, W=200, and W=1000 and found that when W was small enough that the two machines performed nearly identically. W though not only controls the size of the problem on the GPU but also the amount of traffic between the CPU and the GPU.
Although we did find a solution or an exact model, I believe that after much exploration with #Xer I can say with confidence that the two systems are not the same and their physical difference (BUS+CPU) makes the performance difference.