I am trying to understand the BERT weight calculation. Please suggest me some article which can help me to understand the internal workings of BERT. I have read articles from Medium.
https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
I am doing a small project to understand the Bert pretraining and fine-tuning from different sources. My idea is to calculate the weights of each token in their own sources and find avg of all weights to get a global model. Then this global model can be used to fine-tune in different sources.
How can I find these weights, and how can average these weights from multiple sources?
can I visualise it? Then how?
Also, note that I am trying to use Tensorflow version of the Bert implementation and planning to fine-tune for the NER task.
Related
I'm using Windows 10 machine. Libraries: Keras with Tensorflow 2.0 Embeddings: Glove(100 dimensions).
I am trying to implement an LSTM architecture for multi-label text classification.
I am using different types of fine-tuning to achieve better results but with no luck so far.
The main problem I believe is the difference in class distributions of my dataset but after a lot of tries and errors, I couldn't implement stratified-k-split in Keras.
I am also experimenting with dropout layers, batch sizes, # of layers, learning rates, clip values, validation splits but I get a minimum boost or worst performance sometimes.
For metrics, I use mainly ROC and F1.
I also followed the suggestion from a StackOverflow member who said to delete some of my examples so I can balance my dataset but if I do that I will have a very low number of examples.
What would you suggest to me?
If someone can provide code based on my implementation for
stratified-k-split I would be grateful cause I have checked all the
online resources but can't implement it.
Any tips, suggestions will be really helpful.
Metrics Plots
Dataset form+Embedings form+train-test-split form
Dataset's labels distribution
My LSTM implementation
BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
I can give help.
First of all, MLM and NSP (which are the original pre-training objectives from NAACL 2019) are meant to train language encoders with prior language knowledge. Like a primary school student who read many books in the general domain. Before BERT, many neural networks would be trained from scratch, from a clean slate where the model doesn't know anything. This is like a newborn baby.
So my question is, "is it a good idea to start teaching a newborn baby when you can begin with a primary school student?" My answer is no. This is supported by numerous State-of-The-Arts achieved by the pre-trained models, compared to the old methods of training a neural network from scratch.
As someone who works in the field, I can assure you that it is a much better idea to fine-tune a pre-trained model. It doesn't matter if you have a 200k dataset or a 1mil datapoints. In fact, more fine-tuning data will only make the downstream results better if you use the right hyperparameters.
Though I recommend the learning rate between 2e-6 ~ 5e-5 for sentence classification tasks, you can explore. If your dataset is very, very domain-specific, it's up to you to fine-tune with a higher learning rate, which will deviate the model further away from its "pre-trained" knowledge.
And also, regarding your question on
can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
I'm negative about this idea. Even though you have a dataset with 200k instances, BERT is pre-trained on 3300mil words. BERT is too inefficient to be trained with 200k instances (both size-wise and architecture-wise). If you want to train a neural network from scratch, I'd recommend you look into LSTMs or RNNs.
I'm not saying I recommend LSTMs. Just fine-tune BERT. 200k is not even too big anyways.
All the best luck with your NLP studies :)
I don't have much experience with training neural networks. I have 4 variable vectors as input and I have respectively 3 variable output vector. I want to create a neural network that takes these inputs and outputs which have some unknown correlation(might not be linear) between them and train. So that when I put previously untrained data through it should predict the correlated output.
I was wondering,
What type of model should I use in such scenarios? Is it Restricted boltzmann machine, regression, GAN, etc?
What library is easiest to learn and implement for such a model? eg:- TensorFlow, PyTorch, etc
If images were involved which can be processed as fft arrays, would the model change.
I did find this answer, but I am not satisfied with it.
Please let me know if there are any functions or other points you would like me to know. Any help is much appreciated.
A multilayer perceprton is a good place to start.
Keras is the highest level/easiest to use library I have used.
If you are working with images or spatially structured data a convolutional neural network will probably work best.
I have a set of features and a keras model that was trained over a subset of these features by someone else.
I want to evaluate this model over a new set of data, but I don't know which features were used to train it. I have originally 32 features, but only 27 were used for the training.
My question is: is it possible to somehow obtain the list of input features to the model having only the keras model itself?
Keras models contain only the architecture and the weights, you can know how many features were used (and you already know that), but you can't know specificly what were thoses features.
You need to find an other way to get this information !
I have been following the http://deeplearning.net/tutorial/ tutorial on how to train an ANN to classify the MNIST numbers. I am now at the "Convolutional Neural Networks" chapter. I want to use the trained network on single examples (MNIST images) and get the predictions. Is there a way to do that?
I have looked ahead in the tutorial and on google but can't find anything.
Thanks a lot in advance for any kind of help!
The material in the Theano tutorial in the earlier chapters, before reaching the Convolutional Neural Networks (CNN) chapter, give a good overview of how Theano works and some of the components the CNN sample code uses. It might be reasonable to assume that students reaching this point have developed their understanding of Theano sufficiently to figure out how to modify the code to extract the model's predictions. Here's a few hints.
The CNN's output layer, called layer3, is an instance of the LogisticRegression class, introduced in an earlier chapter.
The LogisticRegression class has an attribute called y_pred. The comments next to the code which assigns that attribute's values says
symbolic description of how to compute prediction as class whose
probability is maximal
Looking for places where y_pred is used in the logistic regression sample will highlight a function called predict(). This does for the logistic regression sample what is desired of the CNN example.
If one follows the same approach, using layer3.y_pred as the output of a new Theano function, the model's predictions will become apparent.