I've been struggling to train Google's Big Bird model using the Huggingface transformers library due to out of memory errors. I have two Tesla V100 GPU's with 32 GB RAM each. I'm trying to train the google/bigbird-roberta-base model (https://huggingface.co/google/bigbird-roberta-base) on Spider (a natural language to SQL dataset) using the Huggingface trainer API. I'm using a batch size of 1 and the smallest version of this model, and still get OOM errors. According to the Big Bird paper (https://arxiv.org/abs/2007.14062), Big Bird can be trained on chips with 16 GB of memory, so I'm not sure why I'm running into OOM. Has anyone encountered trouble training Big Bird due to memory problems?
Here's the code that does the training:
rouge = datasets.load_metric("rouge")
training_args = Seq2SeqTrainingArguments(
predict_with_generate = True,
evaluation_strategy = "steps",
per_device_train_batch_size = batch_size,
per_device_eval_batch_size = batch_size,
output_dir = "./",
logging_steps = 2,
save_steps = 400,
eval_steps = 4
)
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid
return {
"rouge2_precision": round(rouge_output.precision, 4),
"rouge2_recall": round(rouge_output.recall, 4),
"rouge2_fmeasure": round(rouge_output.fmeasure, 4),
}
trainer = Seq2SeqTrainer(
model = model,
tokenizer = tokenizer,
args = training_args,
compute_metrics = compute_metrics,
train_dataset = train_data,
eval_dataset = val_data
)
trainer.train()
Here's the exact error I get:
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 31.75 GiB total capacity; 25.14 GiB already allocated; 21.50 MiB free; 26.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Thanks so much for sharing any experience you have with this!
i had same issue with a6000 which has 48gb ram and after setting batch size to 1 from 128 oom didn't occur I guess model is bigger than they said.
I guess it is the problem of huggingface. There is no OOM issue with the original code in https://github.com/google-research/bigbird.
Related
I am building Huggingface Longformer based classifier. My main code below
model = LongformerForSequenceClassification.from_pretrained('/mnt/longformer_official/',
gradient_checkpointing=False,
attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('/mnt/longformer_official/', max_length = 4000)
train_df_tuning_dataset_tokenized = train_df_tuning_dataset.map(tokenization, batched = True, batch_size = len(train_df_tuning_dataset))
training_args = TrainingArguments(
output_dir="xyz",
num_train_epochs = 5,# changed this from 5
per_device_train_batch_size = 4,#4,#8,#adding on 18 march from huggingface example notebook
gradient_accumulation_steps = 16,#16, #8 adding it back 18 march even though missing in huggingface example notebook as otherwise memory issues
per_device_eval_batch_size= 16,#16
evaluation_strategy = "epoch",
save_strategy = "epoch",#adding on 18 march from huggingface example notebook
learning_rate=2e-5,#adding on 18 march from huggingface example notebook
load_best_model_at_end=True,
greater_is_better=False,
disable_tqdm = False,
weight_decay=0.01,
optim="adamw_torch",#removing on 18 march from huggingface example notebook
run_name = 'longformer-classification-16March2022'
)
#class weights
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 3 labels with different weights)
loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 0.5243])).to(device)
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)).to(device)
return (loss, outputs) if return_outputs else loss
trainer = CustomTrainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_df_tuning_dataset_tokenized,
eval_dataset=val_dataset_tokenized
)
When I try max_length=1500 in the tokenizer, the code runs fine. It fails when run with max_length=4000
I even tried setting these parameters as
per_device_train_batch_size = 1, gradient_accumulation_steps = 1, per_device_eval_batch_size = 1
My questions:
is it okay to set per_device_train_batch_size = 1, gradient_accumulation_steps = 1, per_device_eval_batch_size = 1?
The error that I get is as below. Is there any way around this other than getting more memory?
RuntimeError: CUDA out of memory. Tried to allocate 720.00 MiB (GPU 0; 14.76 GiB total capacity; 12.77 GiB already allocated; 111.75 MiB free; 13.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
try setting
gradient_accumulation_steps = int(math.ceil(len(tr_inputs) / per_device_train_batch_size) / 1) * epochs
as gradient_aacumulation_steps should be derived on the basis of epochs and batch size
I was doing some work where I wanted to generate 10000 sentences from the GptNeo Model. I have a GPU of size 40GB and am running the model in the GPU but everytime the code runs out of memory. Is there a limitation to the number of sentences that I can generate. Below is a small snippet of my code.
tokenizer = GPT2Tokenizer.from_pretrained(model)
model = GPTNeoForCausalLM.from_pretrained(model , pad_token_id = tokenizer.eos_token_id)
model.to(device)
input_ids = tokenizer.encode(sentence, return_tensors=‘pt’)
gen_tokens = model.generate(
input_ids,
do_sample=True,
top_k=50,
num_return_sequences=10000
)
My Pytorch code allocates the same amount of memory on each GPU. It does so even if it doesn't max out the device's memory. I can control the batch size per device which determines the memory/device allocated. But can I allocate the memory progressively, in order to first max out the first GPU, then the second, and so on...?
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]='1,2,3,4'
max_length = 64
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2).to("cuda")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
training_args = TrainingArguments(
per_device_train_batch_size=64, # batch size per device during training
...
)
trainer = Trainer(
args=training_args, # training arguments, defined above
...
)
I am training a CNN AlexNet model on a quite large dataset (DeepFashion) containing around 300,000 images. I resize the images to 96x96x3 and I am using GPU nvidia tesla K80 and 4 vCPUs, 15 GB memory on GCP (google cloud). The thing is that when I start the training it runs really well for about 500 iterations but then speed dramatically drops. The GPU utilization in the start is around 50% but then drops to 0-7%. I really don't know what might be causing this.
Test set is 209,222 images
Val set is 40,000 images
train set is 40,000 images
Below is my simple code snippet, does anyone have any idea what can be causing this dramatic drop in computation speed and how I can fix it?
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(dataset_path, 'train'),
label_mode='categorical',
seed=seed,
image_size=(96, 96),
batch_size=64)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(dataset_path, 'val'),
label_mode='categorical',
seed=seed,
image_size=(96, 96),
batch_size=64)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(dataset_path, 'test'),
label_mode='categorical',
seed=seed,
image_size=(96, 96),
batch_size=64)
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
model = get_uncompiled_AlexNet(img_width=96, img_height=96, img_channel=3, num_classes=46)
model_optimizer = optimizers.Adam(learning_rate=learning_rate)
model_metrics_top_3 = tf.keras.metrics.TopKCategoricalAccuracy(k=3, name='top_3_accuracy')
model_metrics_top_5 = tf.keras.metrics.TopKCategoricalAccuracy(k=5, name='top_5_accuracy')
model.compile(
loss="categorical_crossentropy",
optimizer=model_optimizer,
metrics=["accuracy", model_metrics_top_3, model_metrics_top_5])
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=10)
I am trying to train a model on the Cityscapes dataset, for segmentation. I use torchvision deeplabv3_resnet50 model and it's Cityscapes dataset class and transforms. In case it matters, I am running the code in Jupyter notebook.
The datasets are working, as are the dataloaders. When I attempt to train, I always get this error, at the point when the first batch is trying to be put thru the network (y_ = net(xb) in one_epoch function).
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 6.00 GiB total capacity; 4.20 GiB already allocated; 6.87 MiB free; 4.20 GiB reserved in total by PyTorch)
What is strange, is that no matter what the batch size (bs) is, the the amount of memory free according to the error is a value a little less than the amount of memory that is trying to be allocated, e.g. for bs=16 I get:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 6.00 GiB total capacity; 2.90 GiB already allocated; 1.70 GiB free; 2.92 GiB reserved in total by PyTorch)
I have a much more complicated model running, that will work with bs=16. This model builds everything from scratch. But I really want to be able to use the simplicity that torchvision seems to have with it's model zoo and datasets.
My code is below, not much more than the bare essentials, enough to show if it is running ok on the GPU.
def one_epoch(net, loss, dl, opt=None, metric=None):
if opt:
net.train() # only affects some layers
else:
net.eval()
rq_stored = []
for p in net.parameters():
rq_stored.append(p.requires_grad)
p.requires_grad = False
L, M = [], []
dl_it = iter(dl)
for xb, yb in tqdm(dl_it, leave=False):
xb, yb = xb.cuda(), yb.cuda()
y_ = net(xb)
l = loss(y_, yb)
if opt:
opt.zero_grad()
l.backward()
opt.step()
L.append(l.detach().cpu().numpy())
if metric: M.append(metric(y_, yb).cpu().numpy())
if not opt:
for p,rq in zip(net.parameters(), rq_stored): p.requires_grad = rq
return L, M
accuracy = lambda y_,yb: (y_.max(dim=1)[1] == yb).float().mean()
def fit(net, tr_dl, val_dl, loss=nn.CrossEntropyLoss(), epochs=3, lr=3e-3, wd=1e-3):
opt = optim.Adam(net.parameters(), lr=lr, weight_decay=wd)
Ltr_hist, Lval_hist = [], []
for epoch in trange(epochs):
Ltr, _ = one_epoch(net, loss, tr_dl, opt)
Lval, Aval = one_epoch(net, loss, val_dl, None, accuracy)
Ltr_hist.append(np.mean(Ltr))
Lval_hist.append(np.mean(Lval))
print(f'epoch: {epoch+1}\ttraining loss: {np.mean(Ltr):0.4f}\tvalidation loss: {np.mean(Lval):0.4f}\tvalidation accuracy: {np.mean(Aval):0.2f}')
return Ltr_hist, Lval_hist
class To3ch(object):
def __call__(self, pic):
if pic.shape[0]==1: pic = pic.repeat(3,1,1)
return pic
bs = 1
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
transf = transforms.Compose([
transforms.ToTensor(),
To3ch(),
transforms.Normalize(*imagenet_stats)
])
train_ds = datasets.Cityscapes('C:/cityscapes_ds', split='train', target_type='semantic', transform=transf, target_transform=transf)
val_ds = datasets.Cityscapes('C:/cityscapes_ds', split='val', target_type='semantic', transform=transf, target_transform=transf)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True, num_workers=0)
val_dl = DataLoader(val_ds, batch_size=2*bs, shuffle=False, num_workers=0)
net = models.segmentation.deeplabv3_resnet50(num_classes=20)
fit(net.cuda(), train_dl, val_dl, loss=nn.CrossEntropyLoss(), epochs=1, lr=1e-4, wd=1e-4, plot=True)
You didn't specify, but if you're using the original Cityscapes, this OOM is completely expected.
The original Cityscapes dataset has large images (something like 1024x2048, IIRC), and it looks like you have a 6GB GPU. FYI, I cannot fit batch_size=2 in a 12GB GPU with inputs of this size.
When training DeepLab models, it is common to apply transformations on the input (e.g., random crops, resize, scaling, etc.), and it looks like you don't apply any.
When you say:
I have a much more complicated model running, that will work with bs=16.
Perhaps you're looking at a different kind of complexity, something that has less impact on memory requirements than you think.