How to get current batch prediction inside keras training cycle? - keras

I am using _loss, _acc = model.train_on_batch(x, y) training approach. Now I want to get current batch model predictions(outputs) to find incorrect predictions and save same to use later in hard negative mining. How to get current outputs?

Related

Can I use BERT as a feature extractor without any finetuning on my specific data set?

I'm trying to solve a multilabel classification task of 10 classes with a relatively balanced training set consists of ~25K samples and an evaluation set consists of ~5K samples.
I'm using the huggingface:
model = transformers.BertForSequenceClassification.from_pretrained(...
and obtain quite nice results (ROC AUC = 0.98).
However, I'm witnessing some odd behavior which I don't seem to make sense of -
I add the following lines of code:
for param in model.bert.parameters():
param.requires_grad = False
while making sure that the other layers of the model are learned, that is:
[param[0] for param in model.named_parameters() if param[1].requires_grad == True]
gives
['classifier.weight', 'classifier.bias']
Training the model when configured like so, yields some embarrassingly poor results (ROC AUC = 0.59).
I was working under the assumption that an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers. So, where do I got it wrong?
From my experience, you are going wrong in your assumption
an out-of-the-box pre-trained BERT model (without any fine-tuning) should serve as a relatively good feature extractor for the classification layers.
I have noticed similar experiences when trying to use BERT's output layer as a word embedding value with little-to-no fine-tuning, which also gave very poor results; and this also makes sense, since you effectively have 768*num_classes connections in the simplest form of output layer. Compared to the millions of parameters of BERT, this gives you an almost negligible amount of control over intense model complexity. However, I also want to cautiously point to overfitted results when training your full model, although I'm sure you are aware of that.
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers. The one instance in which it can be helpful to disable at least partial layers would be the embedding component, depending on the model's vocabulary size (~30k for BERT-base).
I think the following will help in demystifying the odd behavior I reported here earlier –
First, as it turned out, when freezing the BERT layers (and using an out-of-the-box pre-trained BERT model without any fine-tuning), the number of training epochs required for the classification layer is far greater than that needed when allowing all layers to be learned.
For example,
Without freezing the BERT layers, I’ve reached:
ROC AUC = 0.98, train loss = 0.0988, validation loss = 0.0501 # end of epoch 1
ROC AUC = 0.99, train loss = 0.0484, validation loss = 0.0433 # end of epoch 2
Overfitting, train loss = 0.0270, validation loss = 0.0423 # end of epoch 3
Whereas, when freezing the BERT layers, I’ve reached:
ROC AUC = 0.77, train loss = 0.2509, validation loss = 0.2491 # end of epoch 10
ROC AUC = 0.89, train loss = 0.1743, validation loss = 0.1722 # end of epoch 100
ROC AUC = 0.93, train loss = 0.1452, validation loss = 0.1363 # end of epoch 1000
The (probable) conclusion that arises from these results is that working with an out-of-the-box pre-trained BERT model as a feature extractor (that is, freezing its layers) while learning only the classification layer suffers from underfitting.
This is demonstrated in two ways:
First, after running 1000 epochs, the model still hasn’t finished learning (the training loss is still higher than the validation loss).
Second, after running 1000 epochs, the loss values are still higher than the values achieved with the non-freeze version as early as the 1’st epoch.
To sum it up, #dennlinger, I think I completely agree with you on this:
The entire idea of BERT is that it is very cheap to fine-tune your model, so to get ideal results, I would advise against freezing any of the layers.

Beginner: loss.backwards() doesn't work if input had no impact on output

I am a beginner with pytorch and have the following problem.
I want to optimize a complex problem that uses torch.min() multiple times, but even with a simple toy example I can't get it to work. My code has these lines:
output = net(input)
loss = g(output)
loss = torch.min(loss, torch.ones(1))
loss.backward()
To minimize this loss the net should ascertain that the output minimizes g:R^2->R. Here g is a very simple function that has a zero at {-1,-2} and if I delete the third line the neural network finds the solution just fine. However with the posted code and bad initial weights the minimum is attained by 1. This leads to the backward() function not updating the weights and no learning happening at all over arbitrary many epochs.
Is there a way to detect/fix this behaviour in more complex cases? My task uses the minimum function multiple times and in more complex ways, so I think it would be quite hard to track every single one and ascertain that learning actually takes place.
Edit: If I start the optimizer multiple times it rarely happens that the optimization works just fine (e.g. converges to {-1,-2}). My interpretation is that in those cases the inital weights randomly lead to the minimum beeing attained in the first component.
Rewrite your code:
output = net(input)
loss_fn = torch.min
y_hat = g(output)
y = torch.ones(1)
loss = loss_fn(y_hat, y)
loss.backward()
You wrote:
This leads to the backward() function not updating the weights and no learning happening at all over arbitrary many epochs.
Once we calculate the loss we call loss.backward(). The loss.backward() will calculate the gradients automatically. Gradients are needed in the next phase, when we use the optimizer.step() function to improve our model parameters (weights).
You need to have the optimizer something like this:
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Since PyTorch training loop can be 100 lines of code, I will just reference some material in here perfect for beginners.

Forecasting/prediction using ARIMA in python - how does it work?

I am very confused about how to predict/forecast using ARIMA.
Lets assume we have a series called y_orig that we split into y_train and y_test. Assuming that y_orig is not stationary, we could fit ARIMA using the code below
# fit ARIMA model
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(y_train, order=(2,1,2))
model_fit = model.fit(disp=0)
print(model_fit.summary())
After fitting the model, we can predict using the code below
n_periods = len(`y_test`)
fc, -, - = model_fit.forecast(n_periods, alpha=0.05) # 95% conf
The value fc should give a forecast which i then compare to y_test. Please note that as expected, y_test is not used in the training phase. Also note that i am not looking for a rolling forecast but for a long term forecast where the parameters (once trained) are fixed.
I am very confused because y_test is not used at all in the forecasting phase.
For instance, if we were to use other prediction models (like in Keras or tensorflow). we would be coding it that way.
First, we fit the model in the training phase which i dont show- it does not matter for my question. Then we predict and see how good our fit is in sample using the code below.
y_pred_train=model.predict(y_train)
then we test the model out of sample as below:
y_pred_test=model.predict(y_test)
In this situation, the parameters are not re-estimated and y_test is used in the testing phase to forecast the next value (with fixed parameters).
Hence my confusion with ARIMA. Why do we not do the same with ARIMA model?
Please help me understand as i am very confused.
Thanks so much!!
I think you're a bit confused by the .fit and the y_train in the ARIMA code block. y_train is just a poorly named variable here, it should just be y, the data I want to forecast. The ARIMA model has no training/test phase, it's not self-learning. It does a statistical analysis of the input data, and does a forecast. If you want to do another forecast (on y_test), you need to do another statistical analysis (using model.fit) and do another forecast (using model.forecast). The ARIMA model does not have any weights it trains in a training phase, nothing related to any previous data 'fitted' on is saved in the model. You can't use a "fitted" ARIMA model to forecast other data samples.

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann
I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

keras metric different during training

I have implemented a custom metric based on SIM and when i try the code it works. I have implemented it using tensors and np arrays and both give the same results. However when I start fitting the model the values given back are a lot higher then the values I get when i load the weights generated by the training and applying the same function.
My function is:
def SIM(y_true,y_pred):
n_y_true=y_true/(K.sum(y_true)+K.epsilon())
n_y_pred=y_pred/(K.sum(y_pred)+K.epsilon())
return K.mean(K.sum( K.minimum(n_y_true, n_y_pred)))
When I compile the Keras model I add this to the metrics and during training it gives for example SIM: 0.7092.
When i load the weights and try it the SIM score is around 0.3. The correct weights are loaded (when restarting training with these weights the same values popup). Does anybody know if I am doing anything wrong?
Why are the metrics given back during training so much higher compared to running the function over a batch?

Resources