Scaling features by using polynomial features - python-3.x

I understand the concept of polynomial regression and the use of PolynomialFeatures by sklearn.
But to get a concrete hold over the concept of using PolynomialFeature expansion, I just wanted to ask about a scenario.
Here, I want to consider a scenario of polynomial features with KNN instead of regression.
This is how we use polynomialFeatures for regression.
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X_T, y_T, random_state = 0)
poly = PolynomialFeatures(degree=3)
X_T_poly = poly.fit_transform(X_T)
X_train, X_test, y_train, y_test = train_test_split(X_T_poly, y_T, random_state = 0)
fin = LinearRegression().fit(X_train, y_train)
Now, what if I do this:
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X_T, y_T, random_state = 0)
poly = PolynomialFeatures(degree=3)
X_T_poly = poly.fit_transform(X_T)
X_train, X_test, y_train, y_test = train_test_split(X_T_poly, y_T, random_state = 0)
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
In former we are doing regression (I get it. OK) and in later we are using KNN.
Does it make sense to apply KNN here and is right(in the first place)?
How are the features being transformed in case of KNN?
After going into .fit_transform() I found this:
The function has noting to do with coefficients and intercepts.
def fit_transform(self, X, y=None, **fit_params):
"""Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params
and returns a transformed version of X.
Parameters
----------
X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns
-------
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
"""
# non-optimized default implementation; override when a better
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)
So is it ok to do such practice of using PolynomialFeatures with KNN as above?

Related

Sklearn incorrect support value (number of samples in each class) for classification report

I am fitting an SVM to some data using sklearn. I have 24 samples in total (10 negative, 14 positive).
# Set model
clf = svm.SVC(kernel = 'linear', C = 1)
# Create train, test splits and fit SVM to data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.3, stratify = y)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
I stratified by y to ensure I have an equal number of each class in my test set, which seems to have worked (see image below), however, the classification report says there are no negative samples:
The signature for classification_report is (y_true, y_pred, ...); you've reversed the inputs.
Here's one of the places where using explicit keyword arguments is a good practice.

TabNetRegressor not working with reshaped data

I am using the PyTorch implementation of tabnet and cannot figure out why I'm still getting this error. I import the data to a dataframe, I use this function to get my X, and y then my train-test split
def get_X_y(df):
''' This function takes in a dataframe and splits it into the X and y variables
'''
X = df.drop(['is_goal'], axis=1)
y = df.is_goal
return X,y
X,y = get_X_y(df)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
Then I use this to reshape my y_train
y_train.values.reshape(-1,1)
Then create an instance of the model and try to fit it
reg = TabNetRegressor()
reg.fit(X_train, y_train)
and I get this error
ValueError: Targets should be 2D : (n_samples, n_regression) but y_train.shape=(639912,) given.
Use reshape(-1, 1) for single regression.
I understand why I need to reshape it as this is pretty common, but I cannot understand why it's still giving me this error. I've restarted the kernel in notebooks so I don't think it's persistence memory issues either.
You have to re-assign it:
y_train = y_train.values.reshape(-1,1)
Otherwise, it won't change.

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

creating baseline regression model with average and min values in python

I want to compare results of my regression analysis with encoded categorical variables with two baseline models where the baseline predictions are specified as the average or min values of the groups. I've chosen Rsquare and MAE for comparison. Below is a simplified example of my code for illustration. It works in that it gives me an output which I think achieves my goal. Is this the correct and/or best way to do this?
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
df = pd.DataFrame([['a1','c1',10],
['a1','c2',15],
['a1','c3',20],
['a1','c1',15],
['a2','c2',20],
['a2','c3',15],
['a2','c1',20],
['a2','c2',15],
['a3','c3',20],
['a3','c3',15],
['a3','c3',15],
['a3','c3',20]], columns=['aid','cid','T'])
df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
df_dummies
X = df_dummies
y = df_dummies['T']
# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# Baseline model with group average as prediction
y_pred = df.groupby('aid').agg({'T': ['mean']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
# Baseline model with group min as prediction
y_pred = df.groupby('aid').agg({'T': ['min']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
First of all, I would rename y_predall the time to not get confused.
In general:
y_pred = df.groupby('aid').agg({'T': ['mean']})
will give you the mean of the column 'aid'.
And y_pred = df.groupby('aid').agg({'T': ['min']}) will give you the minimum.
There is an interessting package for you: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html
This is helpful for dummy regression and has also other methods inside.
In your case it should work like this:
df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X = df_dummies
y = df['T']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
dummy_min=DummyRegressor(strategy='constant',constant=min_value)
dummy_min.fit(X_train,y_train)

Using python 3 how to get co-variance/variance

I have a simple linear regression model and i need to count the variance and the co-variance. How to calculate variance and co-variance using linear regression ?
Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([2,3,4,5])
y = np.array([4,3,2,9] )
#train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
# Train the model using the training sets
model = LinearRegression()
model.fit(x_train, y_train)
y_predict = model.predict(X_predict)
Try this for the output vector that you get for variance and co-variance:
y_variance = np.mean((y_predict - np.mean(y_predict))**2)
y_covariace = np.mean(y_predict - y_true_values)
Note: Co-variance here is mean of change of predictions with respect to there true values.

Resources