I have large imbalance data set and I will apply under sampling + boosting ,sampling + bagging
to balance my data .
my question how apply base algorithms such as Knn on result of balancing
ex:
rusboost = RUSBoostClassifier(random_state=0)
rusboost.fit(X_train, y_train)
RUSBoostClassifier(...)
y_pred = rusboost.predict(X_test)
how take the result from this code as label and feature and inter it in another training
balanced_accuracy_score(y_test, y_pred)
Related
I have a large dataset of around 1M records and testing whether a feature in a linear regression is useful to the model.
I am using python and statsmodels however the p-value for variable1 or variable2 (or any other variable I include) are always low (0.000) due to the large dataset. What's the best assessment metric to use please?
import statsmodels.api as sm
x = df[[Variable1, Variable2]]
y = df[Response]
x= sm.add_constant(x)
model = sm.OLS(y,x).fit()
print_model = model.summary()
print(print_model)
Are the hyper-parameters in Gaussian Process Regresor optimized during the fitting in scikit-learn?
In the page
https://scikit-learn.org/stable/modules/gaussian_process.html
it is said:
"The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer"
So, it is not required, for instance, to optimize it by using grid earch?
A hyperparameter is something that you need to specify, usually, the best way to do it is within a pipeline ( series of steps) in which you try many hyperparameters and get the best one. Here is an example of just trying different hyperparameters for a k-means in which you give a list of hyperparameters (n_neighbors for K-Means) in order to see which ones work best! Hope it helps you!
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors= k)
# Fit the classifier to the training data
knn.fit(X_train,y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
knn.predict(X_test)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
I have a task in which I input a 500x500x1 image and get out a 500x500x1 binary segmentation. When working, only a small fraction of the 500x500 should be triggered (small "targets"). I'm using a sigmoid activation at the output. Since such a small fraction is desired to be positive, the training tends to stall with all outputs at zero, or very close. I've written my own loss function that partially deals with it, but I'd like to use binary cross entropy with a class weighting if possible.
My question is in two parts:
If I naively apply binary_crossentropy as the loss to my 500x500x1 output, will it apply on a per pixel basis as desired?
Is there a way for keras to apply class weighting with the single sigmoid output per pixel?
To answer your questions.
Yes, binary_cross_entropy will work per-pixel based, provided you feed to your image segmentation neural network pairs of the form (500x500x1 image(grayscale image) + 500x500x1 (corresponding mask to your image).
By feeding the parameter 'class_weight' parameter in model.fit()
Suppose you have 2 classes with 90%-10% distribution. Then you may want to penalise your algorithm 9 times more when it makes a mistake for the less well represented class(the class with 10% in this case). Suppose you have 900 examples of class 1 and 100 examples of class 2.
Then your class weights dictionary(there are multiple ways to compute it, what is important is to assign a greater weight to the less well represented class),
class_weights = {0:1000/900,1:1000/100}
Example : model.fit(X_train, Y_train, epochs = 30, batch_size=32, class_weight=class_weight)
NOTE: This is available only on 2d cases(class_weight). For 3D or higher dimensional spaces, one should use 'sample_weights'. For segmentation purposes, you would rather use sample_weights parameter.
The biggest gain you will have is by means of other loss functions. Other losses, apart from binary_crossentropy and categorical_crossentropy, inherently perform better on unbalanced datasets. Dice Loss is such a loss function.
Keras implementation:
smooth = 1.
def dice_coef(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
intersection = K.sum(y_true_f * y_pred_f)
return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)
def dice_coef_loss(y_true, y_pred):
return 1 - dice_coef(y_true, y_pred)
You can also use as a loss function the sum of binary_crossentropy
and other losses if it suits you : i.e. loss = dice_loss + bce
I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.
I am using scikit-learn ensemble classifiers for classification.I have separate training and testing data sets.When I use the same data sets and classify using machine learning algorithms I am getting consistent accuracies. Inconsistency is only in case of ensemble classifiers. I have even set random_state to 0.
bag_classifier = BaggingClassifier(n_estimators=10,random_state=0)
bag_classifier.fit(train_arrays,train_labels)
bag_predict = bag_classifier.predict(test_arrays)
bag_accuracy = bag_classifier.score(test_arrays,test_labels)
bag_cm = confusion_matrix(test_labels,bag_predict)
print("The Bagging Classifier accuracy is : " ,bag_accuracy)
print("The Confusion Matrix is ")
print(bag_cm)
You will normally find different results for same model because every time when the model is executed during training, the train/test split is random. You can reproduce the same results by giving the seed value to the train/test split.
train, test = train_test_split(your data , test_size=0.3, random_state=57)
Keep the same random_state value in each turn of training.