I have RTX3060 for deep learning.There is no problem when I run a model with 40 million parameters (batch size is 64) on the cpu. But when I run the same model on GPU, I get ResourceExhaustedError. And I bought this GPU just to be able to run deep learning algorithms faster, but it doesn't work, on the contrary, I can't train models at all.What can I do about this issue?
Related
Is there a way that I can manage how much memory to reserve for Pytorch? I've trying to finetune a BERT model from Huggingface and I'm already using a A100 Nvidia GPU with 80GiB of GPU from Paperspace and still managed to go "CUDA out of memory". I'm not training from scratch or anything and my dataset is composed by less than 250,000 datum. I don't know if I'm missing something or if there is a way to reduce the allocated memory for Pytorch.
(Seriously, I just wanna cry now.)
Error specs
I've tried reducing my batch size at training dataset and eval dataset, also turning fp16=Tue. I even tried with the gradient accumulation but nothing seems to work. I've been cleaning with gc.collect() and torch.cuda.empty_cache() and restarting my kernel machine. I even tried with a smaller model.
I have adapted the base transformer model, for my corpus of aligned Arabic-English sentences. As such the model has trained for 40 epochs and accuracy (SparseCategoricalAccuracy) is improving by a factor of 0.0004 for each epoch.
To achieve good results, my estimate is to attain final accuracy anywhere around 0.5 and accuracy after 40 epochs is 0.0592.
I am running the model on the tesla 2 p80 GPU. Each epoch is taking ~2690 sec.
This implies I need at least 600 epochs and training time would be 15-18 days.
Should I continue with the training or is there something wrong in the procedure as the base transformer in the research paper was trained on an ENGLISH-FRENCH corpus?
Key highlights:
Byte-pair(encoding) of sentences
Maxlen_len =100
batch_size= 64
No pre-trained embeddings were used.
Do you mean Tesla K80 on aws p2.xlarge instance.
If that is the case, these gpus are very slow. You should use p3 instances on aws with V100 gpus. You will get around 6-7 times speedup.
Checkout this for more details.
Also, if you are not using the standard model and have made some changes to model or dataset, then try to tune the hyperparameters. Simplest is to try decreasing the learning rate and see if you get better results.
Also, first try to run the standard model with standard dataset to benchmark the time taken in that case and then make your changes and proceed. See when the model starts converging in the standard case. I feel that it should give some results after 40 epochs also.
I have a keras LSTM model and want to run it under multiple GPUs for speed improvement. But I have some ambiguities:
1- I found that to really get the great speed on GPU I should define my network using CuDNNLSTM layer and not normal LSTM layer. To use multiple GPUs, I looked at Keras documentation and wanted to use multi_gpu_model() function to make distributed model. However, in the sample scripts they recommend to define the model on CPU for easy weight sharing, but my CuDNNLSTM model is not deployable on CPU and LSTM model will not benefit from the enhancements provided by GPU. What is the correct approach?
2- So I tried many configurations, including:
Group 1(using the normal (non-fast) LSTM layers): placing model on CPU and no copying to GPU; placing model on CPU and then use multi_gpu_model to create GPU copies; place model on default GPU and no copying to other GPU; placing model on default GPU and then use multi_gpu_model to create two GPU copies.
Group2 (using CuDNNLSTM layer and therefore no possibility to place model on CPU): defining a single model (which Tensorflow places it on the default GPU); using multi_gpu_model to create two GPU copies.
In all cases, data parallelism (using multi_gpu_model) resulted in lower speed of execution. I didn't change anything else in my code and input data pipeline or batch sizes. What is wrong with me?
3- In general, should I only use CuDNN-type layers to get high speed computation with GPUs when I am programming at high level of keras API?
I'm trying to train an LSTM network on a corpus of text(~7M), however it's taking extremely long per epoch even though it's on an Nvidia Tesla p100.
My model structure is 2 LSTM layers with 256 neurons each, interspersed by Dropout and a final fully connected layer. I am splitting it into 64 char chunk sentences.
Any reason for this insanely slow performance? It's almost 7.5 hours per epoch! Could it be due to the CPU computation warnings? I didn't think this would cause issues with GPU computations.
Since we have object detection models based on CNN such as Fast RCNN, Faster RCNN, YOLO (You only look once), ssd (single shot detector).
I have tried running Faster RCNN using CAFFE but backward path is not implemented for CPU mode. Is there any CNN based model which I can use it to train using CPU.
Any help will be appreciated.
faster-rcnn layers in CPU: https://github.com/neuleaf/faster-rcnn-cpu
SSD's original implementation already supports CPU training.