How do you estimate a minimum GPU requirement for your application ?
This is something I’ve never been able to clarify.
My application will use a model for inference, it’s a pytorch model for OCR (basically I’m using easyocr for this) and I’ll do a single image inference. I’m doing some test on Colab using the following functions after a single prediction:
torch.cuda.max_memory_allocated()
torch.cuda.max_memory_cached()
I always get around 2GB for both (1.9 for the first and 2.5 for the second one), the GPU is a Nvidia T4. Few related questions:
Is this the right way to do it ?
Can I safely assume a GPU with 4GB memory is my minimum requirement?
Will this still be valid on another GPU model ?
Related
Is there a way that I can manage how much memory to reserve for Pytorch? I've trying to finetune a BERT model from Huggingface and I'm already using a A100 Nvidia GPU with 80GiB of GPU from Paperspace and still managed to go "CUDA out of memory". I'm not training from scratch or anything and my dataset is composed by less than 250,000 datum. I don't know if I'm missing something or if there is a way to reduce the allocated memory for Pytorch.
(Seriously, I just wanna cry now.)
Error specs
I've tried reducing my batch size at training dataset and eval dataset, also turning fp16=Tue. I even tried with the gradient accumulation but nothing seems to work. I've been cleaning with gc.collect() and torch.cuda.empty_cache() and restarting my kernel machine. I even tried with a smaller model.
The docs (see also this) for autocast in PyTorch only discuss training. Does it speed things up if I also use autocast for inference?
Yes it could (may not in some cases though).
You are processing data with lower precision (e.g. float16 vs float32).
Your program has to read and process less data in this case.
This might help with cache locality and hardware specific software (e.g. tensor cores if using CUDA)
I have a CNN with 2 hidden layers. When i use keras on cpu with 8GB RAM, sometimes i have "Memory Error" or sometimes precision class was 0 but some classes at the same time were 1.00. If i use keras on GPU,will it solve my problem?
You probably don't have enough memory to fit all the images in the CPU during training. Using a GPU will only help if it has more memory. If this is happening because you have too many images or they're resolution is too high, you can try using keras' ImageDataGenerator and any of the flow methods to feed your data in batches.
Theoretical question here. I understand that when dealing with datasets that cannot fit into memory on a single machine, spark + EMR is a great way to go.
However, I would also like to use tensorflow instead of spark's ml lib algorithms to perform deep learning on these large datasets.
From my research I see that I could potentially use a combination of pyspark, elephas and EMR to achieve this. Alternatively there is BigDL and sparkdl.
Am I going about this the wrong way? What is best practice for deep learning on data that cannot fit into memory? Should I use online learning or batch training instead? This post seems to say that "most high-performance deep learning implementations are single-node only"
Any help to point me in the right direction would be greatly appreciated.
In TensorFlow, you can use tf.data.Dataset.from_generator so you can generate your dataset at runtime without any storage hassles.
See link for example https://www.codespeedy.com/what-is-tf-data-dataset-from_generator-in-tensorflow/
As you mention "fitting massive dataset to memory", I understand that you are trying to load all data to memory at once and start training. Hence, I give the reply based on this assumption.
General mentality is that if you cannot fit the data to your resources, divide data into smaller chunks and train in an iterative way.
1- Load data one by one instead of trying to load all at once. If you create an execution workflow as "Load Data -> Train -> Release Data (This can be done automatically by garbage collectors) -> Restart" , you can understand how much resource is needed to train single data.
2- Use mini-batches. As soon as you get the resource information from #1, you can make an easy calculation to estimate the mini-batch size. For example, if training single data consumes 1.5 GB of RAM, and your GPU has 8 GB of RAM, theoretically you may train mini-batches with size 5 at once.
3- If the resources are not enough to train even 1-sized single batch, in this case, you may think about increasing your PC capacity or decreasing your model capacity / layers / features. Alternatively, you can go for cloud computing solutions.
I have a CNN model. The requests of using this model, for example to classify a picture, come 1 time a second.
I would like to collect the requests as new unsuperised data, and keep training my model.
My question is: How can I handle the training task and classify task effictively?
I will explain why it becomes a problem:
Every training step takes a long time, at least severy seconds, using GPU and not interruptable. So, if my classify tasks use GPU too, I cannot response the requests in time. I would like to make classify tasks using CPU, but looks like theano not support two diffrent config.device in one process.
Multi-process is not acceptable, because my memory is limited and theano costs too much.
Any help or advice would be apreciated.
You could build two separate copies of the same CNN, one on the CPU and one on the GPU. I think this could be done under either the old GPU backend or the new one, but in different ways....some ideas:
Under the old backend:
Load Theano with device=cpu. Build your inference function and compile it. Then call theano.sandbox.cuda.use('gpu'), and build a new copy of your inference function and take gradients of that one to make any training functions. Now the inference function should execute on the CPU, and the training should happen on the GPU. (I've never done this on purpose but I had it happen to me on accident!)
Under the new backend:
As far as I know, you have to tell Theano about any GPUs right when importing, not later. In this case, you could use THEANO_FLAGS="contexts=dev0->cuda0", which doesn't force using one device over another. Then build the inference version of your function like normal, and for the training version, again put all the shared variables on the GPU, and the input variables to any of your training functions should also be GPU variables (e.g. input_var_1.transfer('dev0')). When all your functions are compiled, look at the programs using theano.printing.debugprint(function) to see what's on GPU vs CPU. (When compiling the CPU functions, it might give a warning that it cannot infer the context, and as far as I've seen, that lands it on the CPU...not sure if this behavior is safe to depend on.)
In either case, this will depend on your GPU-based functions do NOT RETURN ANYTHING TO THE CPU (make sure the output variables are GPU ones). This should allow the training function to run concurrently to your inference function, and later you grab what you need to the CPU. For example when you take a training step, just copy the new values over to your inference network parameters, of course.
Let us hear what you come up with!