Data collator not set in trainer class? - pytorch

I am training a language model using a Hugging face model. I am using a RoBERTa model and I am getting a problem when training. This is how I create the Trainer class using a DataCollatorForLanguageModeling as data_collator.
trainer = Trainer(
model=model,
args=training_args,
data_collator=collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer
#prediction_loss_only=True,
)
However, when I call trainer.get_train_dataloader().collate_fn it is using a RemoveColumnsCollator. I think this is the reason why the training is not working.

I found out this is a wrapper class for the data collator passed as an argument. It is possible to find it by doing
trainer.get_train_dataloader().collate_fn.data_collator

Related

Difference between Simple transformers NERModel pipeline and Transformers FromPreTrained Pipeline

I want to fine tune a BERT NER model on my dataset and custom labels and I can't understand the difference between using Simple Transformers NER model: https://simpletransformers.ai/docs/ner-model/ where I can easily specify which model I want to train on my dataset or using Transformers' From Pre Trained: https://huggingface.co/docs/transformers/model_doc/auto
1- Simple Transformers NER Model:
model = NERModel('bert', 'd4data/biomedical-ner-all', labels=custom_labels args=train_args,use_cuda=False)
model.train_model(train_data, eval_data=dev_data)
result, model_outputs, preds_list = model.eval_model(test_data)
model = NERModel('bert', 'outputs/best_model', labels=custom_labels, args=train_args,use_cuda=False)
2- From Pre Trained
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("d4data/biomedical-ner-all")
model = AutoModelForTokenClassification.from_pretrained("d4data/biomedical-ner-all")
I Tried using Both approaches but I really don't understand the core difference between them, can someone please explain when do I use any approach?

Loading a GPU trained BERTopic model on CPU?

I trained a BERTopic model on a GPU, and now for visualization purposes I want to load it on a CPU.
But when I tried to do that I got:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
When I tried to use the suggested fix I got the same problem?
Saw some fix that suggests to save the model without its embeddings model, but don't want to retrain an resave unless its the last option, and would also love if someone could explain what's this embedding model and what's going on under the hood.
topic_model = torch.load(args.model, map_location=torch.device('cpu'))
When you want to save the BERTopic model without the embedding model, you can run the following:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# Train the model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)
# Save the model without the embedding model
topic_model.save("my_model", save_embedding_model=False)
This should prevent any issues with GPU/CPU if you are not using any of the cuML sub-models in BERTopic.
Saw some fix that suggests to save the model without its embeddings model, but don't want to retrain an resave unless its the last option, and would also love if someone could explain what's this embedding model and what's going on under the hood.
The embedding model is typically a pre-trained model that actually is not learning from the input data. There are options to make it learn during training but that requires a custom component in BERTopic. In other words, when you use a pre-trained model, it is no problem removing that pre-trained model when saving the topic model as there would be no need to re-train the model.
In other words, we would first save our topic model in our GPU environment without the embedding model:
topic_model.save("my_model", save_embedding_model=False)
Then, we load in our saved BERTopic model in our CPU environment and then pass the pre-trained embedding model:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic.load("my_model", embedding_model=embedding_model )
You can learn more about the role of the embedding model here.

alternative for maximum entropy for NLP model

I am working on a NLP model where the model identifies the ARGs given the PREDICATE.
I am using MaxEnt for the model.
My model works fine. I trained the model with a Train dataset, created all the features, and then tested it with a Test dataset.
I wanted to try this with some other package and not with MaxEnt.
Can someone suggest what else can I use?

What are differences between AutoModelForSequenceClassification vs AutoModel

We can create a model from AutoModel(TFAutoModel) function:
from transformers import AutoModel
model = AutoModel.from_pretrained('distilbert-base-uncase')
In other hand, a model is created by AutoModelForSequenceClassification(TFAutoModelForSequenceClassification):
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification('distilbert-base-uncase')
As I know, both models use distilbert-base-uncase library to create models.
From name of methods, the second class( AutoModelForSequenceClassification ) is created for Sequence Classification.
But what are really differences in 2 classes? And how to use them correctly?
(I searched in huggingface but it is not clear)
The difference between AutoModel and AutoModelForSequenceClassification model is that AutoModelForSequenceClassification has a classification head on top of the model outputs which can be easily trained with the base model

How to view the changes in a huggingface model after training?

I trained a BART model (facebook-cnn) for summarization and compared summaries with a pretrained model
model_before_tuning_1 = AutoModelForSeq2SeqLM.from_pretrained(model_name)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_data,
eval_dataset=validation_data,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Summaries from model() and model_before_tuning_1() are different but when i compare the model config and/or print(model) it gives exact same things for both.
How to know, what exact parameters have this training changed?
You can compare state_dict of the models. I.e. model.state_dict() and model_before_tuning_1.state_dict().
State_dict contains learnable parameters that change during traning. For further details see: https://pytorch.org/tutorials/recipes/recipes/what_is_state_dict.html
Otherwise, printing the models or model config gives you the same results because the architecure does not change during training.

Resources