Spark: Distributed, incremental model training? - apache-spark

Looking for a distributed, incremental model training in Spark. For example:
A model_1 is trained to classify web text.
Model_1 is saved to a file system.
New texts are classified. Human experts very classification results and select texts that were correctly classified.
Model_2 is trained using old model_1 and selected, correctly classified texts on previous step.
Can this be done with Spark MLLib? Other ways to do this?

In Spark you can't incrementally retrain or add examples to the training set.
After expert classify you can create a new dataset (with old + new examples) and retrain the model from the beginning.
You can also create an ensemble with old + new model and weigh them accordingly
As far as I know (I hope someone proves me wrong) there isn't any framework that provides incremental learning out-of-the-box. So you need to implement an incremental mechanism by yourself. In most simple cases ensemble is a weighted sum of the prediction of a set of models.
Example: You have two binary classifiers that return two probabilities and predictions.
(probability of negative; probability of positive) => prediction
The first classifier: (0.40; 0.60) => 1
The second classifier: (0.30; 0.70) => 1
suppose your ensemble weights both models with equal weights, 0.5
The ensemble of both classifiers: (0.35; 0.65) => 1
where:
probability of negative = probability of negative of the first model * weight of first model + probability of negative of the second model * weight of the second model

Related

Multilabel text classification with BERT and highly imbalanced training data

I'm trying to train a multilabel text classification model using BERT. Each piece of text can belong to 0 or more of a total of 485 classes. My model consists of a dropout layer and a linear layer added on top of the pooled output from the bert-base-uncased model from Hugging Face. The loss function I'm using is the BCEWithLogitsLoss in PyTorch.
I have millions of labeled observations to train on. But the training data are highly unbalanced, with some labels appearing in less than 10 observations and others appearing in more than 100K observations! I'd like to get a "good" recall.
My first attempt at training without adjusting for data imbalance produced a micro recall rate of 70% (good enough) but a macro recall rate of 45% (not good enough). These numbers indicate that the model isn't performing well on underrepresented classes.
How can I effectively adjust for the data imbalance during training to improve the macro recall rate? I see we can provide label weights to BCEWithLogitsLoss loss function. But given the very high imbalance in my data leading to weights in the range of 1 to 1M, can I actually get the model to converge? My initial experiments show that a weighted loss function is going up and down during training.
Alternatively, is there a better approach than using BERT + dropout + linear layer for this type of task?
In your case it might be helpful to balance the labels in the training data. You have a lot of data, so you could afford to loose a part of it by balancing. But before you do this, I recommend to read this answer about balancing classes in traing data.
If you really only care about recall, you could try to tune your model maximizing recall.

How to normalize output from BERT classifier

I've trained a BERT classifier using HuggingFace transformers.TFBertForSequenceClassification classifier. It's working fine, but when using the model.predict() method, it gives a tuple as output which are not normalized between [0, 1]. E.g. I trained the model to classify news articles into fraud and non-fraud category. Then I fed the following 4 test data to the model for prediction:
articles = ['He was involved in the insider trading scandal.',
'Johnny was a good boy. May his soul rest in peace',
'The fraudster stole money using debit card pin',
'Sun rises in the east']
The outputs are:
[[-2.8615277, 2.6811066],
[ 2.8651822, -2.564444 ],
[-2.8276567, 2.4451752],
[ 2.770451 , -2.3713884]]
For me label-0 is for non-fraud, and label-1 is for fraud, so that's working fine. But how do I prepare the scoring confidence from here? Does normalization using softmax make sense in this context? Also, if I want to look at those predictions where the model is kind of indecisive, how would I do that? In that case would both the values be very close to each other?
Yes. You can use softmax. To be more precise, use an argmax over softmax to get label predictions like 0 or 1.
y_pred = tf.nn.softmax(model.predict(test_dataset))
y_pred_argmax = tf.math.argmax(y_pred, axis=1)
This blog was helpful for me when I had the same query..
To answer your second question, I would ask you to focus on what test instances that your classification model had misclassified than trying to find where the model went indecisive.
Because, argmax is always going to return 0 or 1 and never 0.5. And, I would say that a label 0.5 will be the appropriate value for claiming your model to be indecisive..

Would training a BERT Multi-Label Classifier for 100 labels decrease accuracy a lot?

I am trying to train a text classifier which would be able to classify a sentence as being of a certain query type. I have used the BERT Model and trained a Multi-Label classifier which does the job with 90% accuracy for about 20 labels.
My question is that if I have to train the model for 100/200 labels would the accuracy be impacted severely?
If your class distributions does not have a large overlap and you have the good amount of train data representing each class, your accuracy should not be severely impacted. For data hungry model like BERT its all about data. If you have large amount of data represent your 100/200 class you are good to go.

is there way to ensemble the prediction other than the take the mean average?

right now I'm just taking the mean average of 3 models predictions
predictions_model = [y_pred_xceptionAug,y_pred_Dense121_Aug,y_pred_resnet50Aug]
predictions = np.mean(predictions_model,axis=0)
is there a better way to ensemble other than just take a mean average?
One neural network based approach is to use the 3 model predictions as input to a further neural network.
More advanced approaches include bootstrap aggregating, where each model trains on a subset of the entire dataset before predictions are aggregated across models.

Anomaly detection in Text Classification

I have built a text classifier using OneClassSVM.
I have the training set which corresponds to only one label i.e("Yes") and I don't have the other("NO") label data. My task is to build a classifier which classifies the new unseen sentence(test data) as 1 if it is very similar to the training data. Else, it classifies as -1 i.e,(anomaly).
I have used Word2Vec to build the word embeddings for my training data. Then, I am using word-vector averaging with OneClassSVM to build a anomaly detector classifier.
This classifier is currently giving accuracy of about 50%-55%. I have to enhance this further to build a robust classifier.
Any suggestions to this problem would be helpful...
I'd suggest a very different approach since you have no training examples for the negative class at all.
You could train a language model on your training data. At inference time, you score the input with the language model, and classify it according to some threshold on the perplexity of the input sentence according to the LM.

Resources