Why process of evaluation don't stop - tensorflow2.x

model.evaluate(..., verbose=1) display line that I can't understand, please can any one explain to me what this mean
278452/Unknown - 36360s 13ms/step -loss:0.783 - accuracy:0.708
those numbers increment and the process don't stop.
It can be because I don't use callbacks ?

This Line show complete detail about each epochs :
time of execution for epoch=36360
rate at which image is process=13ms/step
On average loss on each step/Image from its true prediction=0.783
accuracy just correct predict/total/observation=0.708
No callback Just provide efficient path to increase accuracy
Your model showing execution time is to high about 36360 for each epoch
One thing I notice "Unknown" which is unusual generally that seems to me incorrect here mention thing are (total image process/Total no of Images)
In my case
Epoch 1/20
187/187 [==============================] - 34s 181ms/step - loss: 1.6447 - accuracy: 0.6380

Related

Pytorch Scheduler: how to get decreasing LR epochs

I'm training a network in pytorch and using ReduceLROnPlateau as scheduler.
I set verbose=True in the parameteres and my scheduler prints something like:
Epoch 159: reducing learning rate to 6.0000e-04.
Epoch 169: reducing learning rate to 3.0000e-04.
Epoch 178: reducing learning rate to 1.5000e-04.
Epoch 187: reducing learning rate to 7.5000e-05.
I would like to get the epochs in some way, in order to obtain a list with all the epochs in which the scheduler reduced the learning rate.
Something like: lr_decrease_epochs = ['159', '169', '178', '187']
Which is the simplest way to do that ?
I think the scheduler doesn't take track of this (at least I didn't see anything like this in the source code), but you can just keep track of this in your training loop.
Whenever the learning rate changes (scheduler.get_lr()) you simply record the current epoch.

Mixed Precision(Pytorch Autocast) Slows Down the Code

I have RTX 3070. Somehow using autocast slows down my code.
torch.version.cuda prints 11.1, torch.backends.cudnn.version() prints 8005 and my PyTorch version is 1.9.0. I’m using Ubuntu 20.04 with Kernel 5.11.0-25-generic.
That’s the code I’ve been using:
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
with torch.cuda.amp.autocast():
outputs = net(inputs)
oss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
end.record()
torch.cuda.synchronize()
print(start.elapsed_time(end))
Without torch.cuda.amp.autocast(), 1 epoch takes 22 seconds, whereas with autocast() 1 epoch takes 30 seconds.
It turns out, my model was not big enough to utilize mixed precision. When I increased the in/out channels of convolutional layer, it finally worked as expected.
I came across this post because I was trying the same code and seeing slower performance. BTW, to use the GPU, you need to port data into tensor core in each step:
inputs, labels = data[0].to(device), data[1].to(device)
Even I made my network 10 times bigger I did not see the performance.
Something else might be wrong at setup level.
I am going to try Pytorch lightening.

pytorch loss.backward() keeps running for hours

I am using pytorch to train some x-ray images but I ran into the following issue:
in the line : loss.backward(), the program just keeps running and never end, and there is no error or warning.
loss, outputs = self.forward(images, targets)
loss = loss / self.accumulation_steps
print("loss calculated: " + str(loss))
if phase == "train":
print("running loss backwarding!")
loss.backward()
print("loss is backwarded!")
if (itr + 1 ) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
The loss calculated before this is something like tensor(0.8598, grad_fn=<DivBackward0>).
Could anyone help me with why this keeps running or any good ways to debug the backward() function?
I am using torch 1.2.0+cu92 with the compatible cuda 10.0.
Thank you so much!!
It's hard to give a definite answer but I have a guess.
Your code looks fine but from the output you've posted (tensor(0.8598, grad_fn=<DivBackward0>)) I conclude that you are operating on your CPU and not on the GPU.
One possible explanation is that the backwards pass is not running forever, but just takes very very long. Training a large network on a CPU is much slower than on a GPU. Check your CPU and memory utilization. It might be that your data and model is too big to fit into your main memory, forcing the operation system to use your hard disk, which would slow down execution by several additional magnitudes. If this is the case I generally recommend:
Use a smaller batch size.
Downscale your images (if possible).
Only open images that are currently needed.
Reduce the size of your model.
Use your GPU (if available) by calling model.cuda(); images = images.cuda() before starting your training.
If that doesn't solve your problem you could start narrowing down the issue by doing some of the following:
Create a minimal working example to reproduce the issue.
Check if the problem persists with other, very simple model architectures.
Check if the problem persists with different input data
Check if the problem persists with a different PyTorch version

NodeJS uptime and the epoch

The following code calculates the epoch time of boot.
console.log(Math.floor(new Date() /1000) - os.uptime())
It should be constant, but there is a 1-second anomaly.
Anyone knows why?
The OS doesn't necessarily start exactly on the second boundary. While you're getting OS uptime down to the second, it may have came up part of the way through a second.

Number of digits in Epoch time

I'm working on a machine which has some code running on it which sets the time when I set the password. The time set is epoch time, but it has 13 digits in it, and when I wrote a simple program to get the epoch time and ran it on my personal computer running linux, it returns the epoch time which has 10 digits. Would anyone know what the extra three digits signify?
Thanks in advance
Probably seconds vs milliseconds.
You'd have to consult the specific documentation though.

Resources