Accelerate BERT training with HuggingFace Model Parallelism - pytorch

I am currently using SageMaker to train BERT and trying to improve the BERT training time. I use PyTorch and Huggingface on AWS g4dn.12xlarge instance type.
However when I run parallel training it is far from achieving linear improvement. I'm looking for some hints on distributed training to improve the BERT training time in SageMaker.

You can use SageMaker Distributed Data Parallel (SMDDP) to run training on a multinode and multigpu setup. Please refer to the below links for BERT based training example
https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb
This is with HuggingFace - https://github.com/aruncs2005/pytorch-ddp-sm-example
please refer to the documentation here for step by step instructions.
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html

Related

BERT pre-training from scratch on custom text data

I want to pre-train BERT from scratch using Hugging face library. Originally, BERT was pre-trained on two tasks: MLM and NSP. I am successful in training it for MLM but been running into issues for weeks now:
Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors
Can anyone help me pre training BERT on those two tasks?
I have tried runmlm.py script from Hugging Face which just trains on MLM. I have also tried original BERT code which is not very intuitional to me so I stick with Hugging face library.

Scalling out sklearn models / xgboost

I wonder how / if it is possible to run sklearn models / xgboost training for a large dataset.
If I use a dataframe that contains several giga-bytes, the machine crashes during the training.
Can you assist me please?
Scikit-learn documentation has an in-depth discussion about different strategies to scale models to bigger data.
Strategies include:
Streaming instances
Extracting features
Incremental learning (see also the partial_fit entry in the glossary)

BERT weight calculation

I am trying to understand the BERT weight calculation. Please suggest me some article which can help me to understand the internal workings of BERT. I have read articles from Medium.
https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
I am doing a small project to understand the Bert pretraining and fine-tuning from different sources. My idea is to calculate the weights of each token in their own sources and find avg of all weights to get a global model. Then this global model can be used to fine-tune in different sources.
How can I find these weights, and how can average these weights from multiple sources?
can I visualise it? Then how?
Also, note that I am trying to use Tensorflow version of the Bert implementation and planning to fine-tune for the NER task.

Unsupervised finetuning of BERT for embeddings only?

I would like to fine-tuning BERT for a specific domain on unlabeled data and get the output layer to check the similarity between them. How can I do it? Do I need to fine-tuning first a classifier task (or question answer, etc..) and get the embeddings? Or can I just use a pre-trained Bert model without task and fine-tuning with my own data?
There is no need to fine-tune for classification, especially if you do not have any supervised classification dataset.
You should continue training BERT the same unsupervised way it was originally trained, i.e., continue "pre-training" using the masked-language-model objective and next sentence prediction. Hugginface's implementation contains class BertForPretraining for this.

How to train LSTM/GRU using unsupervised learning method before train it with supervised learning method in Keras?

Correct me if i'm wrong, but i saw an article somewhere that says that training neural net using unsupervised method before using it for supervised classification will give a better result. I assume it is like a transfer learning kind of stuff. But i'm wondering if i wanted to train the LSTM without the labels, it cannot be entered to the fit function of keras. Any idea how to do it?

Resources