How do you assess the significance of features in a regression model where the p-values are always low due to the large dataset? - statistics

I have a large dataset of around 1M records and testing whether a feature in a linear regression is useful to the model.
I am using python and statsmodels however the p-value for variable1 or variable2 (or any other variable I include) are always low (0.000) due to the large dataset. What's the best assessment metric to use please?
import statsmodels.api as sm
x = df[[Variable1, Variable2]]
y = df[Response]
x= sm.add_constant(x)
model = sm.OLS(y,x).fit()
print_model = model.summary()
print(print_model)

Related

Binary Classification Evaluator AUC Score in Pyspark

I have a dataset with 2 classes (churners and non-churners) in the ratio 1:4. I used Random Forests algorithm via Spark MLlib. My model is terrible at predicting churn class and does nothing.
I use BinaryClassificationEvaluator to evaluate my model in Pyspark. The default metric for the BinaryClassificationEvaluator is AreaUnderRoc.
My code
from pyspark.ml.classification import RandomForestClassifier
evaluator = BinaryClassificationEvaluator()
# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="indexedFeatures", numTrees=1000,impurity="entropy")
# Train model with Training Data
rfModel = rf.fit(train_df)
rfModel.featureImportances
# Make predictions on test data using the Transformer.transform() method.
predictions = rfModel.transform(test_df)
# AUC Evaluate best model
evaluator.evaluate(predictions)
print('Test Area Under Roc',evaluator.evaluate(predictions))
Test Area Under Roc 0.8672196520652589
and here is the confusion matrix.
confusion matrix
Since TP=0, how could be that score possible? Could this value be wrong?
I have other models which works fine,but this score makes me wonder if the others are wrong as well.
Your data might be heavily biased towards one of the classes, I would recommend using Precision or F-Measure, since it's a better metric in such situations.
Try using this:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
val metrics = new BinaryClassificationMetrics(predictions)
val f1Score = metrics.fMeasureByThreshold
f1Score.collect.foreach { case (t, f) =>
println(s"Threshold: $t, F-score: $f, Beta = 1")
}
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html

What is the accuracy of a clustering algorithm?

I have a set of points that I have clustered using a clustering algorithm (k-means in this case). I also know the ground-truth labels and I want to measure how accurate my clustering is. What I need is to find the actual accuracy. The problem, of course, is that the labels given by the clustering do not match the ordering of the original one.
Is there a way to measure this accuracy? The intuitive idea would be to compute the score of the confusion matrix of every combination of labels, and only keep the maximum. Is there a function that does this?
I have also evaluated my results using rand scores and adjusted rand score. How close are these two measures to actual accuracy?
Thanks!
First of all, what does The problem, of course, is that the labels given by the clustering do not match the ordering of the original one. mean?
If you know the ground truth labels then you can re-arrange them to match the order of the X matrix and in that way, the Kmeans labels will be in accordance with the true labels after prediction.
In this situation, I suggest the following.
If you have the ground truth labels and you want to see how accurate your model is, then you need metrics such as the Rand index or mutual information between the predicted and true labels. You can do that in a cross-validation scheme and see how the model behaves i.e. if it can predict correctly the classes/labels under a cross-validation scheme. The assessment of prediction goodness can be calculated using metrics like the Rand index.
In summary:
Define a Kmeans model and use cross-validation and in each iteration estimate the Rand index (or mutual information) between the assignments and the true labels. Repeat that for all iterations and finally, take the mean of the Rand index scores. If this score is high, then the model is good.
Full example:
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut
import numpy as np
# some data
data = load_iris()
X = data.data
y = data.target # ground truth labels
loo = LeaveOneOut()
rand_index_scores = []
for train_index, test_index in loo.split(X): # LOOCV here
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# the model
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_train) # fit using training data
predicted_labels = kmeans.predict(X_test) # predict using test data
rand_index_scores.append(adjusted_rand_score(y_test, predicted_labels)) # calculate goodness of predicted labels
print(np.mean(rand_index_scores))
Since clustering is an unsupervised learning problem, you have specific metrics for it: https://scikit-learn.org/stable/modules/classes.html#clustering-metrics
You can refer to the discussion in the scikit-learn User Guide to have an idea of the differences between the different metrics for clustering: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
For instance, the adjusted Rand index will compare a pair of points and check that if the labels are the same in the ground-truth, it will be the same in the predictions. Unlike the accuracy, you cannot make strict label equality.
you can use sklearn.metrics.accuracy as documented in link mentioned below
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
an example can be seen in link mentioned below
sklearn: calculating accuracy score of k-means on the test data set

Sklearn logistic regression - adjust cutoff point

I have a logistic regression model trying to predict one of two classes: A or B.
My model's accuracy when predicting A is ~85%.
Model's accuracy when predicting B is ~50%.
Prediction of B is not important however prediction of A is very important.
My goal is to maximize the accuracy when predicting A. Is there any way to adjust the default decision threshold when determining the class?
classifier = LogisticRegression(penalty = 'l2',solver = 'saga', multi_class = 'ovr')
classifier.fit(np.float64(X_train), np.float64(y_train))
Thanks!
RB
As mentioned in the comments, procedure of selecting threshold is done after training. You can find threshold that maximizes utility function of your choice, for example:
from sklearn import metrics
preds = classifier.predict_proba(test_data)
tpr, tpr, thresholds = metrics.roc_curve(test_y,preds[:,1])
print (thresholds)
accuracy_ls = []
for thres in thresholds:
y_pred = np.where(preds[:,1]>thres,1,0)
# Apply desired utility function to y_preds, for example accuracy.
accuracy_ls.append(metrics.accuracy_score(test_y, y_pred, normalize=True))
After that, choose threshold that maximizes chosen utility function. In your case choose threshold that maximizes 1 in y_pred.

Scikit learn deviations in accuracy

I am using scikit-learn ensemble classifiers for classification.I have separate training and testing data sets.When I use the same data sets and classify using machine learning algorithms I am getting consistent accuracies. Inconsistency is only in case of ensemble classifiers. I have even set random_state to 0.
bag_classifier = BaggingClassifier(n_estimators=10,random_state=0)
bag_classifier.fit(train_arrays,train_labels)
bag_predict = bag_classifier.predict(test_arrays)
bag_accuracy = bag_classifier.score(test_arrays,test_labels)
bag_cm = confusion_matrix(test_labels,bag_predict)
print("The Bagging Classifier accuracy is : " ,bag_accuracy)
print("The Confusion Matrix is ")
print(bag_cm)
You will normally find different results for same model because every time when the model is executed during training, the train/test split is random. You can reproduce the same results by giving the seed value to the train/test split.
train, test = train_test_split(your data , test_size=0.3, random_state=57)
Keep the same random_state value in each turn of training.

Difference in SGD classifier results and statsmodels results for logistic with l1

As a check on my work, I've been comparing the output of scikit learn's SGDClassifier logistic implementation with statsmodels logistic. Once I add some l1 in combination with categorical variables, I'm getting very different results. Is this a result of different solution techniques or am I not using the correct parameter?
Much bigger differences on my own dataset, but still pretty large using mtcars:
df = sm.datasets.get_rdataset("mtcars", "datasets").data
y, X = patsy.dmatrices('am~standardize(wt) + standardize(disp) + C(cyl) - 1', df)
logit = sm.Logit(y, X).fit_regularized(alpha=.0035)
clf = SGDClassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1,
n_iter=1000, fit_intercept=False)
clf.fit(X, y)
gives:
sklearn: [-3.79663192 -1.16145654 0.95744308 -5.90284803 -0.67666106]
statsmodels: [-7.28440744 -2.53098894 3.33574042 -7.50604097 -3.15087396]
I've been working through some similar issues. I think the short answer might be that SGD doesn't work so well with only a few samples, but is (much more) performant with larger data. I'd be interested in hearing from sklearn devs. Compare, for example, using LogisticRegression
clf2 = LogisticRegression(penalty='l1', C=1/.0035, fit_intercept=False)
clf2.fit(X, y)
gives very similar to l1 penalized Logit.
array([[-7.27275526, -2.52638167, 3.32801895, -7.50119041, -3.14198402]])

Resources