When using the LogisitcRegression.predict_proba the docs say that, when multiclass = 'ovr' they return the normalized probability for each class.
Is there a way, without having to calculate it using the model.coef_
like
pred_proba_manually = [1/(1+np.exp(-(intercept + tf_idf_val#coef))) for
coef,intercept in zip(logreg.coef_,logreg.intercept_)]
to get the unnormalized probabilities for each class?
Related
I am aware of the standard process of finding the optimal value of alpha/lambda using Cross Validation technique through GridSearchCV class in sklearn.model_selection library.Here's my code to find that .
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3, random_state=100)
hyper_param = {'alpha':alphas}
model = Lasso()
model_cv = GridSearchCV(estimator = model,
param_grid=hyper_param,
scoring='r2',
cv=cv,
verbose=1,
return_train_score=True
)
model_cv.fit(X_train,y_train)
#checking the bestscore
model_cv.best_params_
This gives me alpha=0.01
Now, looking on LassoCV , as per my understanding , this library creates model by selecting best optimal alpha by the passed alphas list, and please note , I have used the same cross validation scheme for both of them. But when trying sklearn.linear_model.LassoCV with RepeatedKFold cross validation scheme.
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=100)
ls_cv_m=LassoCV(alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
ls_cv_m.fit(X_train_reduced,y_train)
print('Alpha Value %d'%ls_cv_m.alpha_)
print('The coefficients are {}',ls_cv_m.coef_)
I get alpha=0 for the same data and this alpha value in not present in the list of decimal values passed in alphas argument for this.
This has confused me about the actual implementation of LassoCV.
and my doubts are ..
Why do I get optimal alpha as 0 in LassoCV when the list passed to the argument does not has zero in it.
What is the difference between LassoCV and Lasso then, if I have to anyways find most suitable alpha from GridSearchCV only?
First you should pass your alphas as keywords parameters rather then positional parameters since the first positional parameter for LassoCV is eps.
ls_cv_m=LassoCV(alphas=alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
Then, the model is returning as optimal parameter one of the alphas that you previously defined, however you are simply printing it as an integer number casting the float to int. Replace %d with %f to print it in the float format:
print('Alpha Value %f'%ls_cv_m.alpha_)
Have a look here for more details about Python printing formats and styles.
As for your second question, Lasso is the linear model while LassoCV is an iterative process that allows you to find the optimal parameters for a Lasso model using Cross-validation.
I am using XGBClassifier for multiclass classification(5 classes - [1,2,3,4,5]). I have set objective parameter as 'multi:softmax' but still when I predict using my model I am getting continuous values instead of integers.
I tried with specifying num_class parameter too, but still it predicts continuous values.
model = XGBClassifier(learning_rate = 0.1,n_estimators = 200, objective='multi:softmax')
model.fit(x1, y1, eval_set=[(x1,y1),(x2, y2)], eval_metric='mlogloss')
Expected output = [1,2,3,3,2,3,4,4,5,5,1.... etc] #integers values
Actual Output = [2.334, 1.455, 2.122, 1.76 .... etc] #continuous values
I have a Naive Bayes classifier that I wrote in Python using a Pandas data frame and now I need it in PySpark. My problem here is that I need the feature importance of each column. When looking through the PySpark ML documentation I couldn't find any info on it. documentation
Does anyone know if I can get the feature importance with the Naive Bayes Spark MLlib?
The code using Python is the following. The feature importance is retrieved with .coef_
df = df.fillna(0).toPandas()
X_df = df.drop(['NOT_OPEN', 'unique_id'], axis = 1)
X = X_df.values
Y = df['NOT_OPEN'].values.reshape(-1,1)
mnb = BernoulliNB(fit_prior=True)
y_pred = mnb.fit(X, Y).predict(X)
estimator = mnb.fit(X, Y)
# coef_: For a binary classification problems this is the log of the estimated probability of a feature given the positive class. It means that higher values mean more important features for the positive class.
feature_names = X_df.columns
coefs_with_fns = sorted(zip(estimator.coef_[0], feature_names))
If you're interested in an equivalent of coef_, the property, you're looking for, is NaiveBayesModel.theta
log of class conditional probabilities.
New in version 2.0.0.
i.e.
model = ... # type: NaiveBayesModel
model.theta.toArray() # type: numpy.ndarray
The resulting array is of size (number-of-classes, number-of-features), and rows correspond to consecutive labels.
It is, probably, better to evaluate a difference
log(P(feature_X|positive)) - log(P(feature_X|negative))
as a feature importance.
Because, we are interested in the Discriminative power of each feature_X (sure-sure NB is a generative model).
Extreme example: some feature_X1 has the same value across all + and - samples, so no discriminative power.
So, the probability of this feature value is high for both + and - samples, but the difference of log probabilities = 0.
I am using sklearn's cross_val_predict for training like so:
myprobs_train = cross_val_predict(LogisticRegression(),X = x_old, y=y_old, method='predict_proba', cv=10)
I am happy with the returned probabilities, and would like now to score up a brand-new dataset. I tried:
myprobs_test = cross_val_predict(LogisticRegression(), X =x_new, y= None, method='predict_proba',cv=10)
but this did not work, it's complaining about y having zero shape. Does it mean there's no way to apply the trained and cross-validated model from cross_val_predict on new data? Or am I just using it wrong?
Thank you!
You are looking at a wrong method. Cross validation methods do not return a trained model; they return values that evaluate the performance of a model (logistic regression in your case). Your goal is to fit some data and then generate prediction for new data. The relevant methods are fit and predict of the LogisticRegression class. Here is the basic structure:
logreg = linear_model.LogisticRegression()
logreg.fit(x_old, y_old)
predictions = logreg.predict(x_new)
I have the same concern as #user3490622. If we can only use cross_val_predict on training and testing sets, why y (target) is None as the default value? (sklearn page)
To partially achieve the desired results of multiple predicted probability, one could use the fit then predict approach repeatedly to mimic the cross-validation.
If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.