Interpreting results from lightFM - svm

I built a recommendation model on a user-item transactional dataset where each transaction is represented by 1.
model = LightFM(learning_rate=0.05, loss='warp')
Here are the results
Train precision at k=3: 0.115301
Test precision at k=3: 0.0209936
Train auc score: 0.978294
Test auc score : 0.810757
Train recall at k=3: 0.238312330233
Test recall at k=3: 0.0621618086561
Can anyone help me interpret this result? How is it that I am getting such good auc score and such bad precision/recall? The precision/recall gets even worse for 'bpr' Bayesian personalized ranking.
Prediction task
users = [0]
items = np.array([13433, 13434, 13435, 13436, 13437, 13438, 13439, 13440])
model.predict(users, item)
Result
array([-1.45337546, -1.39952552, -1.44265926, -0.83335167, -0.52803332,
-1.06252205, -1.45194077, -0.68543684])
How do I interpret the prediction scores?
Thanks

When it comes to the difference between precision#K at AUC, you may want to have a look at my answer here: Evaluating the LightFM Recommendation Model.
The scores themselves do not have a defined scale and are not interpretable. They only make sense in the context of defining a ranking over items for a given user, with higher scores denoting a stronger predicted preference.

Related

Why is EmbeddingSimilarityEvaluator used to evaluate the similarity between two embeddings?

TL;DR How can the Pearson correlation coefficient between ground truth labels and cosine similarity scores evaluate the performance of a sentence embedding model? A positive/negative linear relationship between the two doesn't necessarily indicate that a model is accurate, just that they move together, which to me doesn't seem like a good way to evaluate the performance of a sentence embedding model.
I'm training a model to be able to tell if two questions are similar or not. I first continue pre-training using MLM (masked language modeling) and finally fine-tune on the STS dataset. For fine-tuning, I'm using this example python file https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py. At the end of the file, it says to "load the stored model and evaluate its performance on STS benchmark dataset", and it uses this file to evaluate the performance of the model https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/evaluation/EmbeddingSimilarityEvaluator.py.
The second file has a few metrics for evaluation (cosine similarity being one of them), and it uses the Pearson correlation coefficient and Spearman correlation coefficient for each metric to evaluate the performance of the model. What I'm not understanding is: how does calculating the relationship (correlation coefficient) between the ground truth labels and the cosine similarity contribute to measuring the performance of the model? Even if the two have similar movement patterns i.e. a high correlation coefficient, that doesn't mean the model is performing well, does it?

OOB evaluation with metrics other than accuracy, e.g., F1 or AUC

I am training random forest on the imbalanced dataset, accuracy isn't informative. I want to avoid cross validation and use out of bag (OOB) evaluation instead. Is it possible in sklearn (or in python in general) to evaluate out of bag (OOB) F1 or AUC instead of OOB accuracy?
I didn't find a way to do it on these pages:
https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html#sphx-glr-auto-examples-ensemble-plot-ensemble-oob-py
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Or should I just calculate f1 and AUC for the averaged predictions (or majority vote for classification) in oob_decision_function_?
From the source, you can see that the accuracy calculation is hard-coded, so you won't be able to just get another score by setting some parameter.
But, as you say, the oob predictions are available, so making the final computation yourself isn't difficult.

How to normalize output from BERT classifier

I've trained a BERT classifier using HuggingFace transformers.TFBertForSequenceClassification classifier. It's working fine, but when using the model.predict() method, it gives a tuple as output which are not normalized between [0, 1]. E.g. I trained the model to classify news articles into fraud and non-fraud category. Then I fed the following 4 test data to the model for prediction:
articles = ['He was involved in the insider trading scandal.',
'Johnny was a good boy. May his soul rest in peace',
'The fraudster stole money using debit card pin',
'Sun rises in the east']
The outputs are:
[[-2.8615277, 2.6811066],
[ 2.8651822, -2.564444 ],
[-2.8276567, 2.4451752],
[ 2.770451 , -2.3713884]]
For me label-0 is for non-fraud, and label-1 is for fraud, so that's working fine. But how do I prepare the scoring confidence from here? Does normalization using softmax make sense in this context? Also, if I want to look at those predictions where the model is kind of indecisive, how would I do that? In that case would both the values be very close to each other?
Yes. You can use softmax. To be more precise, use an argmax over softmax to get label predictions like 0 or 1.
y_pred = tf.nn.softmax(model.predict(test_dataset))
y_pred_argmax = tf.math.argmax(y_pred, axis=1)
This blog was helpful for me when I had the same query..
To answer your second question, I would ask you to focus on what test instances that your classification model had misclassified than trying to find where the model went indecisive.
Because, argmax is always going to return 0 or 1 and never 0.5. And, I would say that a label 0.5 will be the appropriate value for claiming your model to be indecisive..

What do sklearn.cross_validation scores mean?

I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.

does training on total dataset improves confidence scores

I'm using SVC(kernel="linear", probability=True) in multiclass classification. when I'm using 2/3rd of my data for training purpose, I'm getting ~72%. And when I tried to predict in production, Confidence scores I'm getting are very less. Does training on the total dataset helps to improve confidence scores?
Does training on the total dataset helps to improve confidence scores?
It might. In general, the more data the better. However evaluating performance should be done on data that the model has not seen before. One way to do this is to set aside a part of the data, a test set, as you have done. Another approach is to use cross-validation, see below.
And when I tried to predict in production, Confidence scores I'm getting are very less.
This means that your model does not generalize well. In other words when presented with data it has not seen before the model starts to make more or less random predictions.
To get a better sense of how well your model generalizes you may want to use cross-validation:
from sklearn.model_selection import cross_val_score
clf = SVC()
scores = cross_val_score(clf, X, Y)
This will train and evaluate your classifier on the full dataset using folds of the full data. A fold For each split the classifier is trained and validation on an exclusive subset of the data. For each split the scores result contains the validation score (for SVC, the accuracy). If you need more control over which metrics to evaluate, use the cross_validation function.
to predict in production
In order to improve your model's performance, there are several methods to consider:
Use more training data
Use an ensemble model to reduce prediction variance
Use a different model (algorithm)

Resources