Custom cross-validation for Ridge in sklearn - scikit-learn

I have written the following algorithm to implement a Ridge regression and estimate its parameter via cross validation. In particular, I wanted to achieve the following:
For the purpose of cross-validation, the train set is divided into 10 folds. The first time, the model is estimated on fold 1 and validated on fold 2; the second time it is estimated on folds 1-2 and validated on fold 3, ..., the 9th time it is estimated on folds 1-9 and validated on fold 10.
For each of the 9 estimations above, I want the train features to be z-scored and the validation features to be z-scored using the mean and variance of the fold(s) used in training.
I am doing something wrong in the pipeline to implement point 2 but I can't figure out what. Could I have your opinion on the implementation below?
# Create two lists of the indexes of the train and test sets as per point 1
n_splits=10
kf = KFold(n_splits=n_splits, shuffle=False)
folds = [idx for _, idx in kf.split(df_train)]
indexes_train = [folds[0]]
indexes_test = [folds[1]]
for i in range(1,n_splits-1):
indexes_train.append(np.concatenate((np.array(indexes_train[i-1]), folds[i])))
indexes_test.append(folds[i+1])
# Tune the model as per point 2
pipe = Pipeline(steps = [('scaler', StandardScaler()), ('model', Ridge(fit_intercept=True))])
alpha_tune = {'model__alpha': self.alpha_values}
cross_validation = [i for i in zip(indexes_train, indexes_test)]
model = GridSearchCV(estimator=pipe, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train, labels_train)
best_alpha = model.best_params_['model__alpha']

Related

Why is the ROC_AUC from cross_val_score so much higher than manually using a StratfiedKFold with metrics.roc_auc_score for an XGB classifier?

Method 1 - StratifiedKFold cross validation
skf = StratifiedKFold(n_splits=5, shuffle=False)
roc_aucs_temp = []
for i, (train_index, test_index) in enumerate(skf.split(X_train_xgb, y_train_xgb)):
X_train_fold, X_test_fold = X_train_xgb.iloc[train_index], X_train_xgb.iloc[test_index]
y_train_fold, y_test_fold = y_train_xgb[train_index], y_train_xgb[test_index]
xgb_temp.fit(X_train_fold, y_train_fold)
y_pred=model.predict(X_test_fold)
roc_aucs_temp.append(metrics.roc_auc_score(y_test_fold, y_pred))
print(roc_aucs_temp)
[0.8622474747474748, 0.8497474747474747, 0.9045918367346939, 0.8670918367346939, 0.879591836734694]
Method 2 CrossValScore
# this uses the same CV object as method 1
print(cross_val_score(xgb, X_train_xgb, y_train_xgb, cv=skf, scoring='roc_auc'))
[0.9614899 0.94861111 0.96045918 0.97270408 0.96977041]
I might be misunderstanding the functionality of cross_val_score, but from my understanding it creates K folds of training and test data. It then trains the model on K-1 folds, and tests on 1 fold, repeatedly. It should be around the same accuracy as manually creating K Folds with StratifiedKFold. Why isn't it?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
The documentation for roc_auc_score indicates its second argument is the label scores rather than the predicted labels. Like they show in their example, you probably want something like model.predict_proba(X_test_fold)[:, 1] instead of model.predict(X_test_fold).
cross_val_score with roc_auc is evaluating it that way, and that is why you are seeing the difference.

Custom loss for single-label, multi-class problem

I have a single-label, multi-class classification problem, i.e., a given sample is in exactly one class (say, class 3), but for training purposes, predicting class 2 or 5 is still okay to not penalise the model that heavily.
For example, the ground truth for 1 sample is [0,1,1,0,1] of 5 classes, instead of a one-hot vector. This implies that, the model predicting any one (not necessarily all) of the above classes (2,3 or 5) is fine.
For every batch, the predicted output dimension is of the shape bs x n x nc, where bs is the batch size, n is the number of samples per point and nc is the number of classes. The ground truth is also of the same shape as the predicted tensor.
For every batch, I'm expecting my loss function to compare n tensors across nc classes and then average it across n.
Eg: When dimensions are 32 x 8 x 5000. There are 32 batch points in a batch (for bs=32). Each batch point has 8 vector points, and each vector point has 5000 classes. For a given batch point, I wish to compute loss across all (8) vector points, compute their average and do so for the rest of the batch points (32). Final loss would be loss over all losses from each batch point.
How can I approach designing such a loss function? Any help would be deeply appreciated
P.S.: Let me know if the question is ambiguous
One way to go about this was to use a sigmoid function on the network output, which removes the implicit interdependency between class scores that a softmax function has.
As for the loss function, you can then calculate the loss based on the highest prediction for any of your target classes and ignore all other class predictions. For your example:
# your model output
y_out = torch.tensor([[0.1, 0.2, 0.95, 0.1, 0.01]], requires_grad=True)
# class labels
y = torch.tensor([[0,1,1,0,1]])
since we only care about the highest class probability, we set all other class scores to the maximum value achieved for one of the classes:
class_mask = y == 1
max_class_score = torch.max(y_out[class_mask])
y_hat = torch.where(class_mask, max_class_score, y_out)
From which we can use a regular Cross-Entropy loss function
loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(y_hat, y.float())
loss.backward()
when inspecting the gradients, we see that this only updates the prediction that achieved the highest score as well ass all predictions outside of any of the classes.
>>> y_out.grad
tensor([[ 0.3326, 0.0000, -0.6653, 0.3326, 0.0000]])
Predictions for other target classes do not receive a gradient update. Note that if you have a very high ratio of possible classes, this might slow down your convergence.

Viewing model coefficients for a single prediction

I have a logistic regression model housed in a scikit-learn pipeline using the following:
pipeline = make_pipeline(
StandardScaler(),
LogisticRegressionCV(
solver='lbfgs',
cv=10,
scoring='roc_auc',
class_weight='balanced'
)
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
I can view the model's coefficients for predictions as a whole with this code ...
# Look at model's coefficients to see what features are most important
plt.rcParams['figure.dpi'] = 50
model = pipeline.named_steps['logisticregressioncv']
coefficients = pd.Series(model.coef_[0], X_train.columns)
plt.figure(figsize=(10,12))
coefficients.sort_values().plot.barh(color='grey');
Which returns a bar plot of the features and their coefficients.
What I'm trying to do is be able to see how different input values for a single observation impact its prediction. The idea is to be able to run predictions on a sample population and examine the group with "low" predictions ... for example if I run predictions for 10 observations, I'd like to see how different input values impacted each of those 10 predictions, individually.
Recalled that I can achieve this via Shap Values using something along the following (but using LinearExplainer instead of TreeExplainer):
# Instantiate model and encoder outside of pipeline for
# use with shap
model = RandomForestClassifier( random_state=25)
# Fit on train, score on val
model.fit(X_train_encoded, y_train2)
y_pred_shap = model.predict(X_val_encoded)
# Get an individual observation to explain.
row = X_test_encoded.iloc[[-3]]
# Why did the model predict this?
# Look at a Shapley Values Force Plot
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
base_value=explainer.expected_value[1],
shap_values=shap_values[1],
features=row
)```

How to add more features in multi text classification?

I have a retail dataset with product_description, price, supplier, category as columns.
I used product_description as feature:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?
I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.
The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.
Prepare your additional features as a NumPy array of shape: number of examples × number of features.
Use np.concatenate with axis=1.
Fit the classifier as you did before.
It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.

Sklearn logistic regression - adjust cutoff point

I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.

Resources