Sklearn's train_test_split split with two inputs and one output - python-3.x

I have a network with two input branches to a neural network. I want to use sklearn's train_test_split function to split my dataset into train, test and validation set. I know if I have one input array then I can do the split as follows:
from sklearn.model_selection import train_test_split
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.2)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
But if I have two inputs X1 and X2 how can I split the data then provided that data is split in unison. Insights will be appreciated.

The first thing I can think of, is zipping both inputs, use train_test_split, and then separate those:
X = np.array(list(zip(X1, X2)))
X_train, X_test, y_train, y_test = train_test_split(X, y)
X1_train, X2_train = X_train[:, 0], X_train[:, 1]
However this can consume a lot of memory due the amount of data you have. Another approach in case you are using tensorflow, is to implement train_test_split using tf.data.Dataset, check this question

Related

RandomForestClassifier can't give me repdoucible results

My issue is that even though I implemented a Random_state for the RandomForestClassifier itself and for the Train-Test-Split (even I don't think this isn't necessary there because I am working with the shuffle = False - due to Time Series Data). Please find below my code and I already tried the solution for the following question but it didn't worked: Python sklearn RandomForestClassifier non-reproducible results
Data Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=13)
X_train, X_test, y_train, y_test = np.array(X_train), np.array(X_test), np.array(y_train), np.array(y_test)
print(f"Train and Test Size {len(X_train)}, {len(X_test)}")
Random Forest Classifier
forest = RandomForestClassifier(n_jobs=-1,
class_weight=cwts(df),
max_depth = 5,
random_state = random.seed(1234))
forest.fit(X_train, y_train)
My y-variable are 1 or 0 for the time series data because I am programing a trading strategy that can only go flat or long. Furthemore, in the next step I am using the BorutaPy wrapper and when looking for the best possible features it always changes the best features because the RandomForestClassifier isn't constant. Any of you guys know the solution to this issue?
The function numpy.random.seed sets a seed for numpy calculations, but returns None, so you haven't actually set a fixed seed for consecutive runs of the classifier. Just use an integer for random_state.

Split image array and labels dataframe into train, test and validataion sets

I have an image array (loaded from a npy file) of shape (30000, 128,128,3) and a labels data frame of shape (30000, 1). How can I split these into training, test and validation sets so that I can proceed to build a CNN Model?
You can use the package sklearn. If your imagew array is 'X' and labels is 'Y', use:
>> from sklearn.model_selection import train_test_split
This package split dateset in train,test and validation:
>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
ref : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Happy Coding !!

How to split dataset into train validate test sets correctly, in simple clear way?

I have a dataset with 100 samples, I want to split it into 75%, 25%, 25% for both Train Validate, and Test respectively, then I want to do that again with different ratios such as 80%, 10%, 10%.
For this purpose, I was using the code down, but I think that it's not splitting the data correctly on the second step, because it will split the data from 85% to (85% x 85%), and (15% x 15%).
My question is that:
Is there a nice clear way to do the splitting in the correct way for any given ratios?
from sklearn.model_selection import train_test_split
# Split Train Test Validate
X_, X_val, Y_, Y_val = train_test_split(X, Y, test_size=0.15, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(X_, Y_, test_size=0.15, random_state=42)
You could always do it manually. A bit messy but you can create a function
def my_train_test_split(X, y, ratio_train, ratio_val, seed=42):
idx = np.arange(X.shape[0])
np.random.seed(seed)
np.random.shuffle(idx)
limit_train = int(ratio_train * X.shape[0])
limit_val = int((ratio_train + ratio_val) * X.shape[0])
idx_train = idx[:limit_train]
idx_val = idx[limit_train:limit_val]
idx_test = idx[limit_val:]
X_train, y_train = X[idx_train], y[idx_train]
X_val, y_val = X[idx_val], y[idx_val]
X_test, y_test = X[idx_test], y[idx_test]
return X_train, X_val, X_test, y_train, y_val, y_test
Ratio test is assumed to be 1-(ratio_train+ratio_val).

using sklearn.train_test_split for Imbalanced data

I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Specify Index Range on Train Split SciKit-Learn

I am trying to wrap my head around the concept of using the last 30% of the entries in the dataset as the test samples. Nothing Random (Intentional). Is this possible?
Split dataset into train / test:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.3,random_state=0)
Is it possible to explicitly control the split in such a manner that the test split only selects entries from the end of the dataset?
You will achieve your goal if you substitute the line:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.3,random_state=0)
with:
idx_train = int((1-.3)* x.shape[0]) # train is (1-.3) of your data
x_train = x[:idx_train,:]
x_test = x[idx_train:, :]
y_train = y[:idx_train]
y_test = y[idx_train:]

Resources