Operations over transformers models weights - nlp

I want to blend some BERT pretrained models.
To do so I'm thinking about to do it by mix their weights.
Is it possible to make a BERT model from the weights of others?
I tried to read some documentation and forums but I didn't find any implementations of this.
I expect to find something like:
new_model.encoder.weights = 0.5 * old_model1.encoder.weights + 0.5 * old_model.encoder.weights

Related

sklearn.linear_model.SGDClassifier manual inference for multiclass classification

I have trained a model for 3-class classification using sklearn.linear_model.SGDClassifier. Now I'm looking for a way for manual inference of the model. The problem here is that the model contains three pairs of [coef_, intercept_] so I don't understand how can I do a prediction in C++.
The code for training looks like in the sklearn example:
clf = make_pipeline(StandardScaler(), SGDClassifier(max_iter=1000, tol=1e-3))
clf.fit(train_features, train_labels)
I tried to calculate values coef_ * sample + intercept_ for each of the classes but didn't understand how to determine the class using that numbers.

Replacing positional embedding with pre-calculated results in BERT leads to poor prediction result

I'm trying to use BERT for a NER task. To achieve better prediction results, I'm trying to replace the positional embedding in the embedding_postprocessor() function with some pre-calculated results, based on the principle of sinusoidal embedding, as presented in paper "Attention is all you need".
Although after about 20 hours training, the model seems achieving good convergence (loss drops to about 10^-2 or 10^-3), the tested results were pretty bad, with the accuracy rate around 20%-30%.
Has anyone tried to replace the positional embedding of BERT with other implementation methods? Will the idea of using sinusoidal embedding work in BERT? or we could only stick to the learned positional embedding in BERT?

How to adopt multiple different loss functions in each steps of LSTM in Keras

I have a set of sentences and their scores, I would like to train a marking system that could predict the score for a given sentence, such one example is like this:
(X =Tomorrow is a good day, Y = 0.9)
I would like to use LSTM to build such a marking system, and also consider the sequential relationship between each word in the sentence, so the training example shown above is transformed as following:
(x1=Tomorrow, y1=is) (x2=is, y2=a) (x3=a, y3=good) (x4=day, y4=0.9)
When training this LSTM, I would like the first three time steps using a softmax classifier, and the final step using a MSE. It is obvious that the loss function used in this LSTM is composed of two different loss functions. In this case, it seems the Keras does not provide the way to address my problem directly. In addition, I am not sure whether my method to build the marking system is correct or not.
Keras support multiple loss functions as well:
model = Model(inputs=inputs,
outputs=[lang_model, sent_model])
model.compile(optimizer='sgd',
loss=['categorical_crossentropy', 'mse'],
metrics=['accuracy'], loss_weights=[1., 1.])
Based on your explanation, I think you need a model that first, predict a token based on previous tokens, in NLP domain it usually called Language model, and then compute a score which I assume it is a sentiment (it is applicable to other domain).
To do so, you can train your language model with LSTM and pick the last output of LSTM for your ranking task. To this end, you need to define two loss function: categorical_crossentropy for the language model and MSE for the ranking task.
This tutorial would be helpful: https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

Is there a better approach for personality detection from twitter data?

I have tried different approaches like multinomialNB, SVM, MLPClassifier, CNN as well as LSTM network to train the dataset that consists of tweets and labels (big 5 classes - openness, conscientiousness, extraversion, agreeable, neuroticism). But the accuracy is at around 60% even after using word2vec, NRC features & MRC features. Is there something that I can do to improve the accuracy?
Would you please add few more details about the dataset you are using?
For example I would add:
Dataset size (number of samples)
Classes distribution (are they balanced or not)
Do you do any preprocessing?
Without the above information I would just guess but if I were you would try:
clean the tweets from noise e.g usernames,garbage symbols etc.
If the dataset is small
try random search on models (Naive Bayes ,SVM, Logistic regression) using various vectorizations strategies e.g bag of words, tf-idf and do hyper-parameter search
try applying transfer learning from a model trained on tweets, for example for sentiment analysis.
If the dataset is large enough
try neural network approach
Embedding(Glove, word2vec, fasttext) + RNN(LSTM, GRU) + Attention
try training own embedding
use pretrained ones such as those
Embedding + CNN + RNN
Bag of words + FNN
If classes are not balanced
use weighted loss
try to balance them
try stacking multiple models (ensemble)
Hope it helps!
Is the main premise of your project to do personality detection? If not, I would recommend using the Google Sentiment API to calculate sentiment of Twitter data.

Spark: Distributed, incremental model training?

Looking for a distributed, incremental model training in Spark. For example:
A model_1 is trained to classify web text.
Model_1 is saved to a file system.
New texts are classified. Human experts very classification results and select texts that were correctly classified.
Model_2 is trained using old model_1 and selected, correctly classified texts on previous step.
Can this be done with Spark MLLib? Other ways to do this?
In Spark you can't incrementally retrain or add examples to the training set.
After expert classify you can create a new dataset (with old + new examples) and retrain the model from the beginning.
You can also create an ensemble with old + new model and weigh them accordingly
As far as I know (I hope someone proves me wrong) there isn't any framework that provides incremental learning out-of-the-box. So you need to implement an incremental mechanism by yourself. In most simple cases ensemble is a weighted sum of the prediction of a set of models.
Example: You have two binary classifiers that return two probabilities and predictions.
(probability of negative; probability of positive) => prediction
The first classifier: (0.40; 0.60) => 1
The second classifier: (0.30; 0.70) => 1
suppose your ensemble weights both models with equal weights, 0.5
The ensemble of both classifiers: (0.35; 0.65) => 1
where:
probability of negative = probability of negative of the first model * weight of first model + probability of negative of the second model * weight of the second model

Resources