Cross-validation gives unexpected results on Boston Housing without shuffling - scikit-learn

I am getting surprising results for Boston Housing. The following code produces very different results when I apply cross-validation to the original Boston Housing dataset and to its randomly shuffled version:
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.utils import shuffle
boston = load_boston()
knn = KNeighborsRegressor(n_neighbors=1)
print(cross_val_score(knn, boston.data, boston.target))
X, y = shuffle(boston.data, boston.target, random_state=0)
print(cross_val_score(knn, X, y))
The output is:
[-1.07454938 -0.50761407 0.00351173]
[0.30715435 0.36369852 0.51817514]
Even if the order of the original dataset is not random, why are the 1 Nearest Neighbor predictions so poor for it? Thank you.

The order of the original dataset is not random at all. Since I do not normalize the dataset, two features essentially determine the Euclidean distances: feature 9 (with mean around 408) and feature 11 (mean around 357); I think these are PTRATIO (pupil-teacher ratio by town) and LSTAT (% lower status of the population). The graph of PTRATIO is
.
Roughly the last third of all samples have very high values of this feature; so the third fold (and scikit-learn uses 3 folds by default) is highly anomalous. The graph for LSTAT looks less striking, but nevertheless the last third is still highly anomalous. So it's not surprising that the 1NN results are so poor without shuffling.

Related

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

Bunch object not callable - scikit-learn rcv1 dataset

I want to split the train and test set for RCV1 inbuilt dataset and apply k-means algorithm, however while trying to split the data, an error is shown saying bunch object not callable
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
x_train = rcv1(subset='train')
Indeed it is not; neither it is a dataframe - see the docs. Some extra info is included in the DESCR attribute:
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()
print(rcv1.DESCR)
Result:
.. _rcv1_dataset:
RCV1 dataset
------------
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually
categorized newswire stories made available by Reuters, Ltd. for research
purposes. The dataset is extensively described in [1]_.
**Data Set Characteristics:**
============== =====================
Classes 103
Samples total 804414
Dimensionality 47236
Features real, between 0 and 1
============== =====================
:func:`sklearn.datasets.fetch_rcv1` will load the following
version: RCV1-v2, vectors, full sets, topics multilabels::
>>> from sklearn.datasets import fetch_rcv1
>>> rcv1 = fetch_rcv1()
It returns a dictionary-like object, with the following attributes:
``data``:
The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
A nearly chronological split is proposed in [1]_: The first 23149 samples are
the training set. The last 781265 samples are the testing set. This follows
the official LYRL2004 chronological split. The array has 0.16% of non zero
values::
>>> rcv1.data.shape
(804414, 47236)
``target``:
The target values are stored in a scipy CSR sparse matrix, with 804414 samples
and 103 categories. Each sample has a value of 1 in its categories, and 0 in
others. The array has 3.15% of non zero values::
>>> rcv1.target.shape
(804414, 103)
``sample_id``:
Each sample can be identified by its ID, ranging (with gaps) from 2286
to 810596::
>>> rcv1.sample_id[:3]
array([2286, 2287, 2288], dtype=uint32)
``target_names``:
The target values are the topics of each sample. Each sample belongs to at
least one topic, and to up to 17 topics. There are 103 topics, each
represented by a string. Their corpus frequencies span five orders of
magnitude, from 5 occurrences for 'GMIL', to 381327 for 'CCAT'::
>>> rcv1.target_names[:3].tolist() # doctest: +SKIP
['E11', 'ECAT', 'M11']
The dataset will be downloaded from the `rcv1 homepage`_ if necessary.
The compressed size is about 656 MB.
.. _rcv1 homepage: http://jmlr.csail.mit.edu/papers/volume5/lewis04a/
.. topic:: References
.. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004).
RCV1: A new benchmark collection for text categorization research.
The Journal of Machine Learning Research, 5, 361-397.
So, if you want to stick to the original training & test subsets, as described above, you should simply do:
X_train = rcv1.data[0:23149,]
X.train.shape
# (23149, 47236)
X_test = rcv1.data[23149:,]
X_test.shape
# (781265, 47236)
and similarly for your y_train and y_test, using rcv1.target.
If you want to use a different training & test partition, use:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
rcv1.data, rcv1.target, test_size=0.33, random_state=42)
adjusting your test_size accordingly.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

Python Simple Regression Plot

I'm learning python and I want to perform a simple linear regression on a .csv dataset. I've successfully imported the data file. If I have data for 8 five year periods and I want to do simple linear regression how would I do this? The data is by county/state. So my headers are county, state, 1980,1985 etc.. Appreciate any help.
Please specify what target label you have in mind.
Anyways, use the sklearn library and pandas.
val= pd.DataFrame(your.data);
val.columns = your.headers;
Assuming you have a target header called "Price".
from sklearn.linear_model import LinearRegression
X = val.drop('Price',axis=1)
X contains the data on which LR will be performed.
Now create a Linear Regression object.
lm = LinearRegression()
Start the fitting:
lm.fit()
Predict your targets:
lm.predict(x)
That's it.
Almost all real world problems that you are going to encounter will have more than two variables, so let's skip the basic linear regression example. Linear regressions involving multiple variables is called "multiple linear regression". The steps to perform multiple linear regression are almost similar to that of simple linear regression. The difference lies in the evaluation. You can use it to find out which factor has the highest impact on the predicted output and how different variables relate to each other.
In this section we will use multiple linear regression to predict the gas consumptions (in millions of gallons) in 48 US states based upon gas taxes (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population that has a driver’s license.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
dataset = pd.read_csv('C:/your_Path_here/petrol_consumption.csv')
dataset.head()
dataset.describe()
Result:
Index ... Consumption_In_Millions
count 48.00 ... 48.000000
mean 24.50 ... 576.770833
std 14.00 ... 111.885816
min 1.00 ... 344.000000
25% 12.75 ... 509.500000
50% 24.50 ... 568.500000
75% 36.25 ... 632.750000
max 48.00 ... 968.000000
Preparing the Data
The next step is to divide the data into attributes and labels as we did previously. However, unlike last time, this time around we are going to use column names for creating an attribute set and label. Execute the following script:
X = dataset[['Petrol_Tax', 'Average_Income', 'Paved_Highways', 'ProportionWithLicenses']]
y = dataset['Consumption_In_Millions']
Execute the following code to divide our data into training and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Training the Algorithm
And finally, to train the algorithm we execute the same code as before, using the fit() method of the LinearRegression class:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
As said earlier, in case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
Result:
Coefficient
Petrol_Tax -40.016660
Average_Income -0.065413
Paved_Highways -0.004741
ProportionWithLicenses 1341.862121
This means that for a unit increase in "petrol_tax", there is a decrease of 24.19 million gallons in gas consumption. Similarly, a unit increase in proportion of population with a driver’s license results in an increase of 1.324 billion gallons of gas consumption. We can see that "Average_income" and "Paved_Highways" have a very little effect on the gas consumption.
Making Predictions
To make pre-dictions on the test data, execute the following script:
y_pred = regressor.predict(X_test)
To compare the actual output values for X_test with the predicted values, execute the following script:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
Result:
Actual Predicted
29 534 469.391989
4 410 545.645464
26 577 589.668394
30 571 569.730413
32 577 649.774809
37 704 646.631164
34 487 511.608148
40 587 672.475177
7 467 502.074782
10 580 501.270734
Evaluating the Algorithm
The final step is to evaluate the performance of algorithm. We'll do this by finding the values for MAE, MSE and RMSE. Execute the following script:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Results:
Mean Absolute Error: 56.822247479
Mean Squared Error: 4666.34478759
Root Mean Squared Error: 68.3106491522
You can see that the value of root mean squared error is 60.07, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.
There are many factors that may have contributed to this inaccuracy, a few of which are listed here:
1. Need more data: Only one year worth of data isn't that much, whereas having multiple years worth could have helped us improve the accuracy quite a bit.
2. Bad assumptions: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.
3. Poor features: The features we used may not have had a high enough correlation to the values we were trying to predict.
Note: the data set used in this example is available from here.
http://people.sc.fsu.edu/~jburkardt/datasets/regression/x16.txt
Finally, see the two links below for more info on this topic.
https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html

Accelerating the prediction

Prediction with the SVM model created with 5 features and 3000 samples using default parameters is taking unexpectedely longer time (more than hour) with 5 features and 100000 samples. Is there way of accelerating the prediction?
A few issues to consider here:
Have you standardized your input matrix X? SVM is not scale-invariant, so it could be difficult for the algo to do classification if they takes a large number of raw inputs without proper scaling.
The choice of parameter C: Higher C allows a more complicated non-smooth decision boundary and it takes much more time to fit under this complexity. So decreasing the value C from default 1 to a lower value could accelerate the process.
It's also recommended to choose a proper value of gamma. This could be done via Grid-Search-Cross-Validation.
Here is the code to do grid-search cross validation. I ignore the test set here for simplicity.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
# generate some artificial data
X, y = make_classification(n_samples=3000, n_features=5, weights=[0.1, 0.9])
# make a pipeline for convenience
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
# set up parameter space, we want to tune SVC params C and gamma
# the range below is 10^(-5) to 1 for C and 0.01 to 100 for gamma
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
# choose your customized scoring function, popular choices are f1_score, accuracy_score, recall_score, roc_auc_score
my_scorer = make_scorer(roc_auc_score, greater_is_better=True)
# construct grid search
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
gscv.fit(X, y)
# what's the best estimator
gscv.best_params_
Out[20]: {'svc__C': 1.0, 'svc__gamma': 0.21544346900318834}
# what's the best score, in our case, roc_auc_score
gscv.best_score_
Out[22]: 0.86819366014152421
Note: the SVC is still not running very fast. It takes more than 40s to compute 50 possible combinations of params.
%time gscv.fit(X, y)
CPU times: user 42.6 s, sys: 959 ms, total: 43.6 s
Wall time: 43.6 s
Because the number of features is relatively low, I would start with decreasing the penalty parameter. It controls the penalty for mislabeled samples in the train data, and as your data contains 5 features, I guess it is not exactly linearly separable.
Generally, this parameter (C) allows the classifier to have larger margin on the account of higher accuracy (see this for more information)
By default, C=1.0. Start with svm = SVC(C=0.1) and see how it goes.
One reason might be that the parameter gamma is not the same.
By default sklearn.svm.SVC uses RBF kernel and gamma is 0.0, in which case 1/n_features will be used instead. So gamma is different given different number of features.
In terms of suggestions, I agree with Jianxun's answer.

Resources