I have an image array (loaded from a npy file) of shape (30000, 128,128,3) and a labels data frame of shape (30000, 1). How can I split these into training, test and validation sets so that I can proceed to build a CNN Model?
You can use the package sklearn. If your imagew array is 'X' and labels is 'Y', use:
>> from sklearn.model_selection import train_test_split
This package split dateset in train,test and validation:
>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
ref : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Happy Coding !!
Related
I am using the PyTorch implementation of tabnet and cannot figure out why I'm still getting this error. I import the data to a dataframe, I use this function to get my X, and y then my train-test split
def get_X_y(df):
''' This function takes in a dataframe and splits it into the X and y variables
'''
X = df.drop(['is_goal'], axis=1)
y = df.is_goal
return X,y
X,y = get_X_y(df)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
Then I use this to reshape my y_train
y_train.values.reshape(-1,1)
Then create an instance of the model and try to fit it
reg = TabNetRegressor()
reg.fit(X_train, y_train)
and I get this error
ValueError: Targets should be 2D : (n_samples, n_regression) but y_train.shape=(639912,) given.
Use reshape(-1, 1) for single regression.
I understand why I need to reshape it as this is pretty common, but I cannot understand why it's still giving me this error. I've restarted the kernel in notebooks so I don't think it's persistence memory issues either.
You have to re-assign it:
y_train = y_train.values.reshape(-1,1)
Otherwise, it won't change.
I have a network with two input branches to a neural network. I want to use sklearn's train_test_split function to split my dataset into train, test and validation set. I know if I have one input array then I can do the split as follows:
from sklearn.model_selection import train_test_split
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.2)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
But if I have two inputs X1 and X2 how can I split the data then provided that data is split in unison. Insights will be appreciated.
The first thing I can think of, is zipping both inputs, use train_test_split, and then separate those:
X = np.array(list(zip(X1, X2)))
X_train, X_test, y_train, y_test = train_test_split(X, y)
X1_train, X2_train = X_train[:, 0], X_train[:, 1]
However this can consume a lot of memory due the amount of data you have. Another approach in case you are using tensorflow, is to implement train_test_split using tf.data.Dataset, check this question
I have some label data and I am using the classification ML model (SVM, kNN) to train and test the dataset.
My input features look like:
(442, 443, 0.608923884514436), (444, 443, 0.6418604651162789)
The label looks like:
0, 1
Then I used sklearn to train and test (after splitting the dataset 80% for train and 20% for the test). Code sample is given below:
classifiers = [
SVC(),
KNeighborsClassifier(n_neighbors=5)]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
trainingData = X_train
trainingScores = y_train
for item in classifiers:
print(item)
clf = item
clf.fit(trainingData, trainingScores)
y_pred = clf.predict(X_test)
print("Accuracy Scor:")
print(accuracy_score(y_pred, y_test))
print("Confusion Matrix:")
print(confusion_matrix(y_pred, y_test))
print("Classification Report:")
print(classification_report(y_pred, y_test))
The SVC Accuracy Scor: 0.6639580602883355
The kNN Accuracy Scor: 0.7171690694626475
I can guess that the model is predicting some data correctly. My questions are
How can I save the prediction data including the label given by the model in a CSV file.
Is it possible to use the cross-validation concept here? For example, if I want to apply 5 cross-validations. Then, how can I do that?
import pandas as pd
labels_df = pd.DataFrame(y_pred ,columns=["predicted_label"])
labels_df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv',index = False)
for second question check this
I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Everywhere I go I see this code. Need help understanding this.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
what does X_train, X_test, y_train, y_test mean in this context which should I put in fit() and predict()
As the documentation says, what train_test_split does is: Splits arrays or matrices into random train and test subsets. You can find it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I believe the right keyword argument is test_size instead of testsize and it represents the proportion of the dataset to include in the test split if it is float or the absolute number of test samples if is is an int.
X and y are the sequence of indexables with same length / shape[0], so basically the arrays/lists/matrices/dataframes to be split.
So, all in all, the code splits X and y into random train and test subsets (X_train and X_test for X and y_train and y_test for y). Each test subset should contain 20% of the original array entries as test samples.
You should pass the _train subsets to fit() and the _test subsets to predict(). Hope this helps~
In simple terms, train_test_split divides your dataset into training dataset and validation dataset.
The validation set is used to evaluate a given model.
So in this case validation dataset gives us idea about model performance.
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
The above line splits the data into 4 parts
X_train - training dataset
y_train - o/p of training dataset
X_test - validation dataset
y_test - o/p of validation dataset
and testsize = 0.2 means you'll have 20% validation data and 80% training data
`Basically this code split your data into two part.
is used for training
is for testing
And with the help of the test_size variable you can set the size of testing data
After dividing data into two part you have to fit training data into your model with fit() method.
`