This question already has an answer here:
Spark ALS recommendation system have value prediction greater than 1
(1 answer)
Closed 3 years ago.
Experimenting with Spark mllib ALS("trainImplicit") for a while now.
Would like to understand
1.why Im getting ratings value more than 1 in the predictions?
2.Is there any need for normalizing the user-product input?
sample result:
[Rating(user=316017, product=114019, rating=3.1923),
Rating(user=316017, product=41930, rating=2.0146997092620897)
]
In the documentation, it is mentioned that the predicted rating values will be somewhere around 0-1.
I know that the ratings values can still be used in recommendations but it would be great if I know the reason.
The cost function in ALS trainImplicit() doesn't impose any condition on predicted rating values as it takes the magnitude of difference from 0/1. So, you may also find some negative values there. That is why it says the predicted values are around [0,1] not necessarily in it.
There is one option to set non-negative factorization only, so that you never get a negative value in predicted rating or feature matrices, but that seemed to drop the performance for our case.
Related
This question already has answers here:
sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()
(10 answers)
Closed 2 years ago.
Supervised learning with numeric data applying KNN classifier giving an error while I am calling metrics.confusion_matrix(y_test, y_pred). The conducting error message is: Found input variables with inconsistent numbers of samples: [40000, 2000]. Thanks in advance for showing lights.
The confusion matrix compares the values in both arrays and basically tells how many samples were labelled same and how many differed in both of them. For this, the number of elements in both arrays should be the same. This is what the error says.
So make sure that both of them have same number of elements.
Perhaps if you include previous code that deals with y_test and y_pred, it would be easier to see why the sizes are different.
I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.
I'm trying to implement collaborative Filtering by using sklearn truncatedSVD method. However, I receive very high rmse and it is because I receive very low ratings for every recommendation.
I perform truncatedSVD on a sparse matrix and I was wondering if this low recommendations are because the truncatedSVD accepts non-rated movies as 0 rated movies? If not, do you know what might cause low recommendations? Thanks!
So, it turned out to be that, if your data set's numeric values don't meaningfully start with zero you cannot apply truncatedSVd, without some adjustments. In case of movie ratings, which are from 1 to 5, you need to mean center the data, where you assign a meaning to zeros. Mean centering the data worked for me and I started to get reasonable rmse values.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working myself through a book on machine learning right now.
Working on a NaiveBayesClassifier the author is very much in favour of the cross-validation method.
He proposes to split the data into ten buckets (files) and train on nine of them each time withholding a different bucket.
Up to now the only approach I am familiar with is to split the data into a training set and a test set in the ratio of 50%/50% and simply train the classifier all at once.
Could someone explain what are possible advantages of using cross-validation?
Cross-validation is a way to address the tradeoff between bias and variance.
When you obtain a model on a training set, your goal is to minimize variance. You can do this by adding more terms, higher order polynomials, etc.
But your true objective is to predict outcomes for points that your model has never seen. That's what holding out the test set simulates.
You'll create your model on a training set, then try it out on a test set. You will find that there's a minimum combination of variance and bias that will give your best results. The simplest model that minimizes both should be your choice.
I'd recommend "An Intro to Statistical Learning" or "Elements of Statistical Learning" by Hastie and Tibshirani for more details.
The general objective of machine learning is that the more the data you have for training, the better results you would get. This was important to state before i start answering the question.
Cross-validation helps us to avoid overfitting of the model and it also helps to increase the generalization accuracy which is the accuracy of the model on unseen future point. Now when you divide your dataset into dtrain and dtest there is one problem with that which is if your function that would be determined once you have trained your model needs both training and testing data, then you cannot say your accuracy on future unseen point would be same as the accuracy you got on your test data. This above argument can be stated by taking the example of k-nn where nearest neighbour is determined with the help of training data while the value of k is determined by test data.
But if you use CV then k could be determined by CV data and your test data can be considered as the unseen data point.
Now suppose you divide your dataset into 3 parts Dtrain(60%), Dcv(20%) and Dtest(20%). Now you have only 60% of data to train with. Now suppose you want to use all the 80% of your data to train then you can do this with the help of m-fold cross validation. In m-fold CV you divide your data into two parts Dtrain and Dtest (lets say 80 and 20).
lets say the value of m is 4 so you divide the training data into 4 equal parts randomly (d1,d2,d3,d4). Now start training the model by taking d1,d2,d3 as dtrain and d4 as cv and calculate the accuracy, in the next go take d2,d3,d4 as dtrain and d1 as cv and likewise take all possiblity for m=1, then for m=2 continue the same procedure. With this you use your entire 80% of your dtrain and your dtest can be considered as future unseen dataset.
The advantages are better and more use of your dtrain data, reduce the overfit and helps to give you sured generalization accuracy.
but on the downside the time complexity is high.
In your case the value of m is 10.
Hope this helps.
The idea is to have maximum no. of points to train the model to achieve accurate results.For every data point chosen in the train set, it is excluded from the test set. So we use the concept of k and k-1 where we firstly divide the data set into equal k sized bins and we take one bin make it a test set and the remaining k-1 bins represent train set. We repeat the process till all the bins are selected once as test set(k) and the remaining as training(k-1).Doing this no data point in missed out for training purpose
I'm using the SVM classifier in the machine learning scikit-learn package for python.
My features are integers. When I call the fit function, I get the user warning "Scaler assumes floating point values as input, got int32", the SVM returns its prediction, I calculate the confusion matrix (I have 2 classes) and the prediction accuracy.
I've tried to avoid the user warning, so I saved the features as floats. Indeed, the warning disappeared, but I got a completely different confusion matrix and prediction accuracy (surprisingly much less accurate)
Does someone know why it happens? What is preferable, should I send the features as float or integers?
Thanks!
You should convert them as floats but the way to do it depends on what the integer features actually represent.
What is the meaning of your integers? Are they category membership indicators (for instance: 1 == sport, 2 == business, 3 == media, 4 == people...) or numerical measures with an order relationship (3 is larger than 2 that is in turn is larger than 1). You cannot say that "people" is larger than "media" for instance. It is meaningless and would confuse the machine learning algorithm to give it this assumption.
Categorical features should hence be transformed to explode each feature as several boolean features (with value 0.0 or 1.0) for each possible category. Have a look at the DictVectorizer class in scikit-learn to better understand what I mean by categorical features.
If there are numerical values just convert them as floats and maybe use the Scaler to have them loosely in the range [-1, 1]. If they span several order of magnitudes (e.g. counts of word occurrences) then taking the logarithm of the counts might yield better results. More documentation on feature preprocessing and examples in this section of the documentation: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: also read this guide that has many more details for features representation and preprocessing: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf