Today i upgraded my account to Colab pro. Although it prints the ram as:
Your runtime has 27.3 gigabytes of available RAM
You are using a high-RAM runtime!
when I start training my model, it gives the error below.
RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 15.90 GiB total capacity; 14.75 GiB already allocated; 75.75 MiB free; 14.95 GiB reserved in total by PyTorch)
Hyperparameters of my model:
args_dict = dict(
#data_dir="", # path for data files
output_dir="", # path to save the checkpoints
model_name_or_path='t5-large',
tokenizer_name_or_path='t5-large',
max_seq_length=600,
learning_rate=3e-4,
weight_decay=0.0,
adam_epsilon=1e-8,
warmup_steps=0,
train_batch_size=4,
eval_batch_size=4,
num_train_epochs=2,
gradient_accumulation_steps=16,
n_gpu=1,
early_stop_callback=False,
fp_16=True, # if you want to enable 16-bit training then install apex and set this to true
opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
seed=42,
)
Colab pro not providing all ram. My code only works if train_batch_size = 1. What causes this? Any ideas?
Note: I get the same error when I run the code in Kaggle (16Gb). So, what I get with colab pro?
Looking at your error, the 16 GB are referring to the graphics card, not the ram.
As far as I know, using colab-pro enables you to use a graphics card with up to 16GB of VRAM.
You can check the VRAM amount by running the following code.
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
print('and then re-execute this cell.')
else:
print(gpu_info)
Maybe you use a smaller batch size than 4?
Related
I only used less than 7,000 pieces of data to pretrain the bart and use eight 3080 graphics cards,I should have enough memory,however,it report out of memory.Here is some code
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_bart.py \
--num-layers 12 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 1 \
--global-batch-size 8 \
As you can see, my batchsize has been set to very small,next is error report
RuntimeError: CUDA out of memory. Tried to allocate 42.00 MiB (GPU 2; 9.78 GiB total capacity; 7.63 GiB already allocated; 4.56 MiB free; 7.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This error occurs 8 times, once per GPU,I always feel like I'm still training with one GPU instead of eight together, so how do I solve this problem
My Google colab crashes immediately as soon as it starts training on tiny-imagenet with 0.1 million images and 200 classes of size 64*64
Colab log shows
WARNING:root:kernel 1fe0be22-c98a-4519-a16a-69c9fb4be1da restarted
KernelRestarter: restarting kernel (1/5), keep random ports
tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10754 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
I am using model.fit_generator with batch size (tried from 32 till 1024) and size of image ( tried from 16 till 64) but nothing works.
I tried resnet-18 architecture with (1.8*10^9 params) as well as custom model too with 0.8 million params but in vain.
I am pasting the link to my colab in case anybody needs some other info
https://colab.research.google.com/drive/1QG1mg1zOn6gZaaSv4rrI4F6erdxsxQ8V#scrollTo=Uy0M-VDHivOX
I am building a keras model to run some simple image recognition task. If i do everything in raw Keras, I don't hit OOM. However, strangely, when I do it through a mini framework I wrote, which is fairly simple and mainly so that I can keep track of the hyperparameters and setup I used, I hit OOM. Most of the executions should be the same as running the raw Keras. I am guessing somewhere I made some mistakes in my code. Note that this same mini framework had no issue running with CPU on my local laptop. I think I will need to debug. But before that, anyone has any general advice?
Here's a few lines of the errors I got:
Epoch 1/50
2018-05-18 17:40:27.435366: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-05-18 17:40:27.435906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 pciBusID: 0000:00:04.0 totalMemory: 11.17GiB freeMemory: 504.38MiB
2018-05-18 17:40:27.435992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-05-18 17:40:27.784517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-18 17:40:27.784675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-05-18 17:40:27.784724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-05-18 17:40:27.785072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 243 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
2018-05-18 17:40:38.569609: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB. Current allocation summary follows.
2018-05-18 17:40:38.569702: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (256): Total Chunks: 66, Chunks in use: 66. 16.5KiB allocated for chunks. 16.5KiB in use in bin. 2.3KiB client-requested in use in bin.
2018-05-18 17:40:38.569768: I tensorflow/core/common_runtime/bfc_allocator.cc:630] Bin (512): Total Chunks: 10, Chunks in use: 10. 5.0KiB allocated for chunks. 5.0KiB in use in bin. 5.0KiB client- etc. etc
2018-05-18 17:40:38.573706: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[18432,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
This is caused by running out of GPU memory as it is clear from the Warnings.
First workaround is that you can allow GPU memory to grow if possible by writing making this Config proto and passing to tf.session()
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
Then pass this config to the session that is causing this error. Like
tf.Session(config = config)
If this doesn't helps, you could disable the GPU for that particular session that is causing this error. Like this
config = tf.ConfigProto(device_count ={'GPU': 0})
sess = tf.Session(config=config)
If you are using keras, you can get the backends of keras and apply these configs by extracting the session.
I've used YOLO detection with trained model using my GPU - Nvidia 1060 3Gb, and everything worked fine.
Now I am trying to generate my own model, with param --gpu 1.0. Tensorflow can see my gpu, as I can read at start those communicates:
"name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705"
"totalMemory: 3.00GiB freeMemory: 2.43GiB"
Anyway, later on, when program loads data, and is trying to start learning i got following error:
"failed to allocate 832.51M (872952320 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY"
I've checked if it tries to use my other gpu (Intel 630) , but it doesn't.
As i run the train process without "--gpu" mode option, it works fine, but slowly.
( I've tried also --gpu 0.8, 0.4 ect.)
Any idea how to fix it?
Problem solved. Changing batch size and image size in config file didn't seem to help as they didn't load correctly. I had to go to defaults.py file and change them up there to lower, to make it possible for my GPU to calculate the steps.
Look like your custom model use to much memory and the graphic card cannot support it. You only need to use the --batch option to control the size of memory.
I have two computers with the same GPU(GTX 1080), installed the same copy of OS and softwares. But when I run my tensorflow program(an RNN model), the speed are very different. One is about 1.5x faster than the other.
Here are the key specs of the two:
SystemA: Asus Z170-P, i7 6700T, 32GB Ram, GTX 1080.
SystemB: Asus X99 E-WS, i7 5930K, 128G Ram, GTX 1080. (Problem one)
Both are installed with(using the same method):
OS: Ubuntu 16.04
GPU driver version: 378.13
Cuda version: 8.0
cuDNN version: 5.1
Tensorflow: installed using method pip install tensorflow-gpu==1.0.1
Python: Anaconda 3.6
Sample code:
import tensorflow as tf
import numpy as np
from tqdm import trange
h,w = 3000, 2000
steps = 1000
x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
x0 = np.random.random(size=[h, w])
sess = tf.Session()
for i in trange(steps):
x0 = sess.run(m, feed_dict={x: x0})
SystemA performs 75 iter/sec and systemB only has 50 iter/sec, yes the poorer one is actually faster.
Key observations:
SystemB have a much larger page fault while running the program.
By monitoring the Volatile GPU-Util from nvidia-smi, systemA stably seat at about 40% while systemB is about 30%.
Things I have tried on systemB:
Upgrade BIOS to the latest version and reset default settings.
Call Asus customer service for help.
Swap GPU card with system A.
Change PCI-e slot to make sure it running at x16 gen3.
Inject LD_PRELOAD="/usr/lib/libtcmalloc.so" to .bashrc file.
The main differences of the output of /usr/bin/time -v are:
# The first value is for systemB and the second is for systemA.
System time (seconds): 7.28 2.95
Percent of CPU this job got: 85% 106%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.41 0:14.89
Minor (reclaiming a frame) page faults: 684695 97853
Involuntary context switches: 164 91063
File system inputs: 0 24
File system outputs: 8 0
Can anybody point me to a direction of how to profile/debug this issue? Many thanks in advance!
There is a chance that you may not be using GPUs. To test this use
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
to display what devices you are using.
If indeed you are using CPU, then you can add the following before your tensorflow code
with tf.device('/gpu:0'): # NEW LINE
x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
If this isn't the case, add a comment with your results and I'll follow up to see what else I can do.
According to some sources tf.constant is a GPU memory hog. try replacing
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
with
t = tf.Variable(np.random.random(size=[w, w]), dtype=tf.float32)
trying a model without network traffic
import tensorflow as tf
import numpy as np
from tqdm import trange
h,w = 3000, 2000
steps = 1000
x = tf.random_normal( [h, w] , dtype=tf.float32 )
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)
s = tf.reduce_mean( tf.reduce_mean( m ) )
sess = tf.Session()
for i in range(steps):
sess.run(s)
Results of Experimentation with Xer
After much discussion and trouble shooting, it has become apparent that indeed the two machines are different. The Nvidia cards were swapped which resulted in no change. They have 2 different CPUs, one with a graphics processor built in and 1 without. One with a higher CPU, one without. I suggested that machine with an onboard graphics on the i7 had the OSs graphic windowing system disabled to make sure that the test is unused GPU vs unused GPU. The problem persisted.
The original problem that was posted creates huge amounts of data traffic across the main BUS from the CPU to the Nvidia GPUs as can be seen here
Tx Throughput : 75000 KB/s
Rx Throughput : 151000 KB/s
We experimented with changing the size of the problem from W=2000, W=200, and W=1000 and found that when W was small enough that the two machines performed nearly identically. W though not only controls the size of the problem on the GPU but also the amount of traffic between the CPU and the GPU.
Although we did find a solution or an exact model, I believe that after much exploration with #Xer I can say with confidence that the two systems are not the same and their physical difference (BUS+CPU) makes the performance difference.