I'm training a network in pytorch and using ReduceLROnPlateau as scheduler.
I set verbose=True in the parameteres and my scheduler prints something like:
Epoch 159: reducing learning rate to 6.0000e-04.
Epoch 169: reducing learning rate to 3.0000e-04.
Epoch 178: reducing learning rate to 1.5000e-04.
Epoch 187: reducing learning rate to 7.5000e-05.
I would like to get the epochs in some way, in order to obtain a list with all the epochs in which the scheduler reduced the learning rate.
Something like: lr_decrease_epochs = ['159', '169', '178', '187']
Which is the simplest way to do that ?
I think the scheduler doesn't take track of this (at least I didn't see anything like this in the source code), but you can just keep track of this in your training loop.
Whenever the learning rate changes (scheduler.get_lr()) you simply record the current epoch.
Related
I have RTX 3070. Somehow using autocast slows down my code.
torch.version.cuda prints 11.1, torch.backends.cudnn.version() prints 8005 and my PyTorch version is 1.9.0. I’m using Ubuntu 20.04 with Kernel 5.11.0-25-generic.
That’s the code I’ve been using:
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
with torch.cuda.amp.autocast():
outputs = net(inputs)
oss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
end.record()
torch.cuda.synchronize()
print(start.elapsed_time(end))
Without torch.cuda.amp.autocast(), 1 epoch takes 22 seconds, whereas with autocast() 1 epoch takes 30 seconds.
It turns out, my model was not big enough to utilize mixed precision. When I increased the in/out channels of convolutional layer, it finally worked as expected.
I came across this post because I was trying the same code and seeing slower performance. BTW, to use the GPU, you need to port data into tensor core in each step:
inputs, labels = data[0].to(device), data[1].to(device)
Even I made my network 10 times bigger I did not see the performance.
Something else might be wrong at setup level.
I am going to try Pytorch lightening.
Currently, I am doing y Udemy Python course for data science. In there, there is the following example to train a model in Tensorflow:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
model = Sequential()
# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
My goal now was to get this to run on my GPU. For that, I altered the last part as follows (the epochs are low on purpose, I just want to see how long it takes per epoch before scaling up):
with tf.device("/gpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
and for comparison, also as follows:
with tf.device("/cpu:0"):
model.fit(x=X_train,
y=y_train,
epochs=3,
validation_data=(X_test, y_test), verbose=1
)
However, the result is very unexpected: Either, both versions occupy all memory of the GPU but seemingly don't do any calculations on it, and take the exact same time per epoch. Or, the GPU version simply crashes with the following error:
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value, from_value)
InternalError: Blas GEMM launch failed : a.shape=(32, 78), b.shape=(78, 78), m=32, n=78, k=78
[[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
Function call stack:
distributed_function
Sometimes it crashes, sometimes it kind of works but takes as long as the CPU. Sometimes even the CPU version takes 20 sec per epoch, other times it takes 40 sec. The code stays the same, all that changes is that I restart the Kernel in between. I really don't understand it.
When I test the GPU and conda environment using the following code, everything seems to work fine, reproducible and the GPU is about 20x as fast as the CPU:
# https:// gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
def test_device (device_name: str):
shape = (int(10000), int(10000))
startTime = datetime.now()
with tf.device(device_name):
random_matrix = tf.random.uniform(shape=shape, minval=0, maxval=1)
dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
sum_operation = tf.reduce_sum(dot_operation)
result = sum_operation
print("Shape:", shape, "Device:", device_name)
print("—"*50)
print(result)
print("Time taken:", datetime.now() - startTime)
print("\n" * 2)
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec
So, I am sure there is something I am doing wrong.
TLTR:
What would be the correct way to call model.fit on the GPU? How can different runs (without changing the code) result in so drastically different outcomes (Crash, vastly different calculation times)?
Any help is greatly appreciated, thx!
After a lot of try and error I finally found a working way to either force CPU or "mixed usage". GPU only doesn't seem to work, though. The with tf.device() method from my original post doesn't seem to do anything in this scenario. I have to hide the GPU if I want to use the CPU, only (Tensorflow 2.1.0):
CPU only
# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([], 'GPU') # hide the GPU
tf.config.set_visible_devices(cpus[0], 'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 3-4 sec per epoch and does not tax the GPU.
Restart the Kernel, then:
GPU only
# force GPU (make GPU visible)
# note: does not work without restarting the kernel, otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([], 'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0], 'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
That doesn't work as, apparently, the CPU is required by this model:
"NotFoundError: No CPU devices are available in this process"
Default (mixed CPU & GPU):
Restart the Kernel, then:
# test if CPU and GPU are visible
tf.config.get_visible_devices()
# [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
# PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
model.fit(x=X_train,
y=y_train,
epochs=25,
batch_size=256,
validation_data=(X_test, y_test), verbose=1
)
This results in 5-6 sec per epoch, consumes all the RAM of the GPU and uses a small amount of processing power of the GPU (<10%). Apparently, this is slower than using the CPU only for this model (8 GB video RAM vs. 16 GB System RAM??).
If the default mode (CPU & GPU) throws the following error, it seems the GPU is occupied by another process and restarting Windows helps:
"InternalError: Blas GEMM launch failed"
There are still lots of mysteries left for me:
Why is the "mixed" mode slower than CPU only?
Can you change visible devices without having to restart the Kernel to avoid the following error? "Visible devices cannot be modified after being initialized"
Why does the with tf.device() method not work for this model (no effect), whereas it works for the test_device() code?
If anybody can provide some insight, thank you very much :)
model.evaluate(..., verbose=1) display line that I can't understand, please can any one explain to me what this mean
278452/Unknown - 36360s 13ms/step -loss:0.783 - accuracy:0.708
those numbers increment and the process don't stop.
It can be because I don't use callbacks ?
This Line show complete detail about each epochs :
time of execution for epoch=36360
rate at which image is process=13ms/step
On average loss on each step/Image from its true prediction=0.783
accuracy just correct predict/total/observation=0.708
No callback Just provide efficient path to increase accuracy
Your model showing execution time is to high about 36360 for each epoch
One thing I notice "Unknown" which is unusual generally that seems to me incorrect here mention thing are (total image process/Total no of Images)
In my case
Epoch 1/20
187/187 [==============================] - 34s 181ms/step - loss: 1.6447 - accuracy: 0.6380
I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version
I want to tune the parameters of the "SVR()" regression function. It starts processing and doesn't stop, I am unable to figure out the problem. I am predicting a parameter using the SVM regression function SVR(). The results are not good with default values in Python.so I want to try tunning it with "GridSearchCv". The last part "grids.fit(Xtrain,ytrain)" start running without giving any error and doesn't stop.
SVR() tunning using GridSearch
Code:
from sklearn.model_selection import GridSearchCV.
param = {'kernel' : ('linear', 'poly', 'rbf', 'sigmoid'),'C' : [1,5,10],'degree' : [3,8],'coef0' : [0.01,10,0.5],'gamma' : ('auto','scale')},
modelsvr = SVR(),
grids = GridSearchCV(modelsvr,param,cv=5)
grids.fit(Xtrain,ytrain)
It Continues to process without stopping.
Yes, you are right. I have come across the same scenario, when I try to run GridsearchCV for SVR(). The possible reasons are, 1) Your Processor memory(RAM) must be less, 2) The train data sample size is more, equal chance of consuming more time to run Gridsearch since your processor is low memory, so without any error the Job running time will be more.
For your info: I have run Gridsearch with train sample size of 30K using 16GB RAM memory space, it elapsed 210mins to finish the run. So, here patience is must.
Happy Analyzing !!
Maybe you should add two more options to your GridSearch (n_jobs and verbose) :
grid_search = GridSearchCV(estimator = svr_gs, param_grid = param,
cv = 3, n_jobs = -1, verbose = 2)
verbose means that you see some output about the progress of your process.
n_jobs is the numebr of used cores (-1 means all cores/threads you have available)