TabNetRegressor not working with reshaped data - python-3.x

I am using the PyTorch implementation of tabnet and cannot figure out why I'm still getting this error. I import the data to a dataframe, I use this function to get my X, and y then my train-test split
def get_X_y(df):
''' This function takes in a dataframe and splits it into the X and y variables
'''
X = df.drop(['is_goal'], axis=1)
y = df.is_goal
return X,y
X,y = get_X_y(df)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
Then I use this to reshape my y_train
y_train.values.reshape(-1,1)
Then create an instance of the model and try to fit it
reg = TabNetRegressor()
reg.fit(X_train, y_train)
and I get this error
ValueError: Targets should be 2D : (n_samples, n_regression) but y_train.shape=(639912,) given.
Use reshape(-1, 1) for single regression.
I understand why I need to reshape it as this is pretty common, but I cannot understand why it's still giving me this error. I've restarted the kernel in notebooks so I don't think it's persistence memory issues either.

You have to re-assign it:
y_train = y_train.values.reshape(-1,1)
Otherwise, it won't change.

Related

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Scaling features by using polynomial features

I understand the concept of polynomial regression and the use of PolynomialFeatures by sklearn.
But to get a concrete hold over the concept of using PolynomialFeature expansion, I just wanted to ask about a scenario.
Here, I want to consider a scenario of polynomial features with KNN instead of regression.
This is how we use polynomialFeatures for regression.
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X_T, y_T, random_state = 0)
poly = PolynomialFeatures(degree=3)
X_T_poly = poly.fit_transform(X_T)
X_train, X_test, y_train, y_test = train_test_split(X_T_poly, y_T, random_state = 0)
fin = LinearRegression().fit(X_train, y_train)
Now, what if I do this:
from sklearn.preprocessing import PolynomialFeatures
X_train, X_test, y_train, y_test = train_test_split(X_T, y_T, random_state = 0)
poly = PolynomialFeatures(degree=3)
X_T_poly = poly.fit_transform(X_T)
X_train, X_test, y_train, y_test = train_test_split(X_T_poly, y_T, random_state = 0)
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
In former we are doing regression (I get it. OK) and in later we are using KNN.
Does it make sense to apply KNN here and is right(in the first place)?
How are the features being transformed in case of KNN?
After going into .fit_transform() I found this:
The function has noting to do with coefficients and intercepts.
def fit_transform(self, X, y=None, **fit_params):
"""Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params
and returns a transformed version of X.
Parameters
----------
X : numpy array of shape [n_samples, n_features]
Training set.
y : numpy array of shape [n_samples]
Target values.
Returns
-------
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
"""
# non-optimized default implementation; override when a better
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)
So is it ok to do such practice of using PolynomialFeatures with KNN as above?

How can I fix the train set and test set in python?

Suppose I have a data set with 1000 rows. I want to split it into train and test set. I want to split first 800 row into train set then rest 200 row into test set. Is it possible?
My python test code for train and test splitting is like this:
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20)
There are multiple ways to do this, I will run by a few of them.
Slicing is a powerful method in python and accepts the arguments as data[start:stop:step] in your case if you just want the first 800 copies and your data frame is named as train for input features and Y for output features you can use
X_train = train[0:800]
X_test = train[800:]
y_train = Y[0:800]
y_test = Y[800:]
Iloc function is associated with a dataFrame and is associated with an Index, if your Index is numeric then you can use
X_train = train.iloc[0:800]
X_test = train.iloc[800:]
y_train = Y.iloc[0:800]
y_test = Y.iloc[800:]
If you just have to split the data into two parts, you can even use the df.head() and df.tail() to do it,
X_train = train.head(800)
X_test = train.tail(200)
y_train = Y.head(800)
y_test = Y.tail(200)
There are other ways to do it too, I would recommend using the first method as it is common across multiple datatypes and will also work if you were working with a numpy array. To learn more about slicing I would suggest that you checkout. Understanding slice notation here it is explained for a list, but it works with almost all forms.
You want to set shuffle= False.
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, shuffle = False)

ValueError: Expected 2D array, got 1D array instead:

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you
You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)
A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.
If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.
Use
y_pred = regressor.predict([[x_test]])
I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))
Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

How to run only one fold of cross validation in sklearn?

I have he following code to run a 10-fold cross validation in SkLearn:
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
scores = model_selection.cross_val_score(MyEstimator(), x_data, y_data, cv=cv, scoring='mean_squared_error') * -1
For debugging purposes, while I am trying to make MyEstimator work, I would like to run only one fold of this cross-validation, instead of all 10. Is there an easy way to keep this code but just say to run the first fold and then exit?
I would still like that data is split into 10 parts, but that only one combination of that 10 parts is fitted and scored, instead of 10 combinations.
No, not with cross_val_score I suppose. You can set n_splits to minimum value of 2, but still that will be 50:50 split of train, test which you may not want.
If you want maintain a 90:10 ration and test other parts of code like MyEstimator(), then you can use a workaround.
You can use KFold.split() to get the first set of train and test indices and then break the loop after first iteration.
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x_data[train_index], x_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
break
Now use this X_train, y_train to train the estimator and X_test, y_test to score it.
Instead of :
scores = model_selection.cross_val_score(MyEstimator(),
x_data, y_data,
cv=cv,
scoring='mean_squared_error')
Your code becomes:
myEstimator_fitted = MyEstimator().fit(X_train, y_train)
y_pred = myEstimator_fitted.predict(X_test)
from sklearn.metrics import mean_squared_error
# I am appending to a scores list object, because that will be output of cross_val_score.
scores = []
scores.append(mean_squared_error(y_test, y_pred))
Rest assured, cross_val_score will be doing this only internally, just some enhancements for parallel processing.

Resources