Spark: Measuring performance of ALS

Spark: Measuring performance of ALS - apache-spark

I am using the ALS model from spark.ml to create a recommender system
using implicit feedback for a certain collection of items. I have noticed
that the output predictions of the model are much lower than 1 and they usually range in the interval of [0,0.1]. Thus, using MAE or MSE does not make any
sense in this case.
Therefore I use the areaUnderROC (AUC) to measure the performance. I do that by using the spark's BinaryClassificationEvaluator and I do get something close to 0.8. But, I cannot understand clearly how that is possible, since most of the values range in [0,0.1].
To my understanding after a certain point the evaluator will be considering all of the predictions to belong to class 0. Which essentially would mean that AUC would be equal to the percentage of negative samples?
In general, how would you treat such low values if you need to test your model's performance compared to let's say Logistic Regression?
I train the model as follows:
rank = 25
alpha = 1.0
numIterations = 10
als = ALS(rank=rank, maxIter=numIterations, alpha=alpha, userCol="id", itemCol="itemid", ratingCol="response", implicitPrefs=True, nonnegative=True)
als.setRegParam(0.01)
model = als.fit(train)

What #shuaiyuancn explained about BinaryClassificationEvaluator isn't completely correct. Obviously using that kind of evaluator if you don't have binary ratings and a proper threshold isn't correct.
Thus, you can consider a recommender system as a binary classification when your systems considers binary ratings (click-or-not, like-or-not).
In this case, the recommender defines a logistic model, where we assume that the rating (-1,1) that user u gives item v is generated on a logistic response model :
where scoreuv is the score given by u to v.
For more information about Logistic Models, you can refer to Hastie et al. (2009) - section 4.4
This said, a recommender system can also be considered as multi-class classification problem. And this always depends on your data and the problem in hand but it can also follow some kind of regression model.
Sometimes, I choose to evaluate my recommender system using RegressionMetrics even thought text books recommend using RankingMetrics-like evaluations to compute metrics such as average precision at K or MAP, etc. It always depends on the task and data at hand. There is no general recipe for that.
Nevertheless, I strongly advise you to read the Evaluation Metrics official documentation. It will help you understand better what you are trying to measure regarding what you are trying to achieve.
References
Statistical Methods for Recommender Systems - Deepak K. Agarwal, Bee-Chung Chen.
The Elements of Statistical Learning - Hastie et al.
Spark official documentation - Evaluation Metrics.
EDIT: I ran into this answer today. It's an example implementation of a Binary ALS in python. I strongly advise you to take a look at it.

Using BinaryClassificationEvaluator on a recommender is wrong. Usually a recommender select one or a few items from a collection as prediction. But BinaryClassificationEvaluator only deals with two labels, hence Binary.
The reason you still get a result from BinaryClassificationEvaluator is that there is a prediction column in your result dataframe which is then used to compute the ROC. The number doesn't mean anything in your case, don't take it as a measurement of your model's performance.
I have noticed that the output predictions of the model are much lower than 1 and they usually range in the interval of [0,0.1]. Thus, using MAE or MSE does not make any sense in this case.
Why MSE doesn't make any sense? You're evaluating your model by looking at the difference (error) of predicted rating and the true rating. [0, 0.1] simply means your model predicts the rating to be in that range.

Related

How can I specify confidence in training data?

I am classifying data with categorical variables. It is data where people have provided information.
My training dataset is of varying quality. I have a greater confidence in some of the data i.e. I have a higher confidence that people have provided correct information whereas in some the data I am not so sure.
How can I pass this information into a classification algorithm such as Naive Bayes or K nearest neighbour?
Or should I instead look to another algorithm?

I think what you want to do, is to provide individual weights (for the importance/confidence) for each data point you have.
For instance, if you are very certain that one data point is of higher quality and should have a higher weight than others, in which you are less confident in, you can specify that when fitting your classifier.
Sklearn provides for instance the Gaussian Naive Bayes classifier (GaussianNB) for that.
Here, you can specify sample_weights when calling the fit() method.

Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?

Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)

Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.

The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

logistic regression with sparse predictor variables

I am currently modeling some data using a binary logistic regression. The dependent variable has a good number of positive cases and negative cases - it is not sparse. I also have a large training set (> 100,000) and the number of main effects I'm interested in is about 15 so I'm not worried about a p>n issue.
What I'm concerned about is that many of my predictor variables, if continuous, are zero most of the time, and if nominal, are null most of the time. When these sparse predictor variables take a value > 0 (or not null), I know because of familiarity with the data that they should be of importance in predicting my positive cases. I have been trying to look for information on how the sparseness of these predictors could be affecting my model.
In particular, I would not want the effect of a sparse but important variable to be not included in my model if there is another predictor variable that is not sparse and is correlated but actually doesn't do as good a job of predicting the positive cases. To illustrate an example, if I were trying to model whether or not someone ended up being accepted at a particular ivy league university and my three predictors were SAT score, GPA, and "donation > $1M" as a binary, I have reason to believe that "donation >$1M", when true, is going to be very predictive of acceptance - more so than a high GPA or SAT - but it is also very sparse. How, if at all, is this going to effect my logistic model and do I need to make adjustments for this? Also, would another type of model (say decision tree, random forest, etc) handle this better?
Thanks,
Christie

information criteria for confusion matrices

One can measure goodness of fit of a statistical model using Akaike Information Criterion (AIC), which accounts for goodness of fit and for the number of parameters that were used for model creation. AIC involves calculation of maximized value of likelihood function for that model (L).
How can one compute L, given prediction results of a classification model, represented as a confusion matrix?

It is not possible to calculate the AIC from a confusion matrix since it doesn't contain any information about the likelihood. Depending on the model you are using it may be possible to calculate the likelihood or quasi-likelihood and hence the AIC or QIC.
What is the classification problem that you are working on, and what is your model?
In a classification context often other measures are used to do GoF testing. I'd recommend reading through The Elements of Statistical Learning by Hastie, Tibshirani and Friedman to get a good overview of this kind of methodology.
Hope this helps.

Information-Based Evaluation Criterion for Classifier's Performance by Kononenko and Bratko is exactly what I was looking for:
Classification accuracy is usually used as a measure of classification performance. This measure is, however, known to have several defects. A fair evaluation criterion should exclude the influence of the class probabilities which may enable a completely uninformed classifier to trivially achieve high classification accuracy. In this paper a method for evaluating the information score of a classifier''s answers is proposed. It excludes the influence of prior probabilities, deals with various types of imperfect or probabilistic answers and can be used also for comparing the performance in different domains.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string