Getting same feature transformation via PCA for test set fails - python-3.x

In an ML project you first separate out your train and test data set and you carry out all your transformation on the train data set to to make sure information leakage doesn't take place. To be more precise:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111)
Once you done the above you carry out all your:
OverSampling, UnderSampling Scale Dimensional Reduction using X_train and y_train
example: from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
pca = PCA(n_components = 0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
I was under the impression that creating the PCA object from from sklearn.decomposition import PCA and first call on the train data set:
pca = PCA(n_components = 0.95)
pca.fit_transform(X_train)
and then on the test set
pca.transform(X_test)
Would or should get me the same dimensions the model was trained but unfortunately when I try calculate test error - model unseen data. I get the following:
X has 39 features, but DecisionTreeClassifier is expecting 10 features as input
Which is really puzzling to me because using the same PCA object it should transform the X_test into exact same dimensions. What am I missing here?
This is how the test error been calculated:
y_pred = tree_model.predict(X_test)
y_pred = tree_model.predict_proba(X_test)[:, 1]

I had this problem for the longest, most frustrating time and a restart to the computer and thus kernel solved it!

Related

evaluate scaled RMSE

I'm new to machine learning and wanted to understand how to evaluate the RMSE when the data is scaled.
I used the California housing dataset and trained it with SVR:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = housing["data"]
y = housing["target"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
I then scaled the data for the SVR and trained the model:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.svm import LinearSVR
lin_svr = LinearSVR(random_state=42)
lin_svr.fit(X_train_scaled, y_train)
When I wanted to evaluate the RMSE the result was scaled so it didn't make a lot of sense to me:
from sklearn.metrics import mean_squared_error
y_pred = lin_svr.predict(X_train_scaled)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
rmse was 0.976993881287582
How do I make sense of the result? (the y column is in tens of thousands of dollars)
I tried to y_pred by unscaling the data but the result did not make sense:
y_pred = lin_svr.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
np.sqrt(mse)
So the question is, how do I interpret the RMSE when the data is scaled and is there a correct way to unscale it in order to make sense of it
Thanks!
Here you don't scale the target variable, so the unit of the rmse is just the same as the target variable. Because the target variable is in units of 100,000 dollars, rmse a measuring to define the difference between observed and predicted data. That means rmse = 0.976993881287582 => 97,699 dollars.

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Prediction with linear regression is very inaccurate

This is the csv that im using https://gist.github.com/netj/8836201 currently, im trying to predict the variety which is categorical data with linear regression but somehow the prediction is very very inaccurate. While you know, the actual label is just combination of 0.0 and 1. but the prediction is 0.numbers and 1.numbers even with minus numbers which in my opinion is very inaccurate, what part did i make the mistake and what is the solution for this inaccuracy? this is the assignment my teacher gave me, he said we could predict the categorical data with linear regression not only logistic regression
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn import metrics
path= r"D:\python projects\iris.csv"
df = pd.read_csv(path)
array = df.values
X = array[:,0:3]
y = array[:,4]
le = preprocessing.LabelEncoder()
ohe = preprocessing.OneHotEncoder(categorical_features=[0])
y = le.fit_transform(y)
y = y.reshape(-1,1)
y = ohe.fit_transform(y).toarray()
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=0)
sc = preprocessing.StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
model = LinearRegression(n_jobs=-1).fit(X_train, y_train)
y_pred = model.predict(X_test)
df = pd.DataFrame({'Actual': X_test.flatten(), 'Predicted': y_pred.flatten()})
the output :
y_pred
Out[46]:
array([[-0.08676055, 0.43120144, 0.65555911],
[ 0.11735424, 0.72384335, 0.1588024 ],
[ 1.17081347, -0.24484483, 0.07403136],
X_test
Out[61]:
array([[-0.09544771, -0.58900572, 0.72247648],
[ 0.14071157, -1.98401928, 0.10361279],
[-0.44968663, 2.66602591, -1.35915595],
Linear Regression is used to predict continuous output data. As you correctly said, you are trying to predict categorical (discrete) output data. Essentially, you want to be doing classification instead of regression - linear regression is not appropriate for this.
As you also said, logistic regression can and should be used instead as it is applicable to classification tasks.

Using python 3 how to get co-variance/variance

I have a simple linear regression model and i need to count the variance and the co-variance. How to calculate variance and co-variance using linear regression ?
Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([2,3,4,5])
y = np.array([4,3,2,9] )
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model using the training sets
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(X_predict)
Try this for the output vector that you get for variance and co-variance:
y_variance = np.mean((y_predict - np.mean(y_predict))**2)
y_covariace = np.mean(y_predict - y_true_values)
Note: Co-variance here is mean of change of predictions with respect to there true values.

SVR predict average value but train perfectly

I have been using SVR to predict the value of a time series. My dataset is split into two which is train and test and using SVR with RBF kernel to predict the test dataset. While SVR has been perfectly modeled the train data set but always predict the average value of the test dataset.
Have been trying StandardScaller, Normalization and so on but always failed.
here is my code
X = np.array(x).reshape(-1,1)
Y = np.array(y).reshape(-1,1)
sc_y = StandardScaler()
Y = sc_y.fit_transform(Y)
Y = np.array(Y).ravel()
# Fit regression model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.4, random_state=0, shuffle=False)
from sklearn.model_selection import cross_val_score
svr_rbf = SVR(kernel='rbf', C=10, gamma=9.9999999999999995e-08, epsilon=0.1)
print(X_train.shape)
svr_rbf.fit(X_train, Y_train)
y_rbf = svr_rbf.predict(X_train)
y_rbf1 = svr_rbf.predict(X_test)
and here is my result
The prediction is at the end where a constant value is shown.
Do you know what should I do to make the prediction better?

Resources