I am trying to wrap my head around the concept of using the last 30% of the entries in the dataset as the test samples. Nothing Random (Intentional). Is this possible?
Split dataset into train / test:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.3,random_state=0)
Is it possible to explicitly control the split in such a manner that the test split only selects entries from the end of the dataset?
You will achieve your goal if you substitute the line:
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.3,random_state=0)
with:
idx_train = int((1-.3)* x.shape[0]) # train is (1-.3) of your data
x_train = x[:idx_train,:]
x_test = x[idx_train:, :]
y_train = y[:idx_train]
y_test = y[idx_train:]
Related
I have a network with two input branches to a neural network. I want to use sklearn's train_test_split function to split my dataset into train, test and validation set. I know if I have one input array then I can do the split as follows:
from sklearn.model_selection import train_test_split
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X, Y, test_size=0.2)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
But if I have two inputs X1 and X2 how can I split the data then provided that data is split in unison. Insights will be appreciated.
The first thing I can think of, is zipping both inputs, use train_test_split, and then separate those:
X = np.array(list(zip(X1, X2)))
X_train, X_test, y_train, y_test = train_test_split(X, y)
X1_train, X2_train = X_train[:, 0], X_train[:, 1]
However this can consume a lot of memory due the amount of data you have. Another approach in case you are using tensorflow, is to implement train_test_split using tf.data.Dataset, check this question
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, shuffle=False)
This is how i split the data.
I try to check about overfitting or underfitting so i want to compare the train_error to the test_error.
I calculate the test_error with metrics.mean_squared_error(Y_test,Y_pred)
How could i find the train error? Is there a way using sklearn?
I tried to metrics.mean_squared_error(Y_train,Y_pred) i have a bug because the sets have different size, which i do understand.
I have a dataset with 100 samples, I want to split it into 75%, 25%, 25% for both Train Validate, and Test respectively, then I want to do that again with different ratios such as 80%, 10%, 10%.
For this purpose, I was using the code down, but I think that it's not splitting the data correctly on the second step, because it will split the data from 85% to (85% x 85%), and (15% x 15%).
My question is that:
Is there a nice clear way to do the splitting in the correct way for any given ratios?
from sklearn.model_selection import train_test_split
# Split Train Test Validate
X_, X_val, Y_, Y_val = train_test_split(X, Y, test_size=0.15, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(X_, Y_, test_size=0.15, random_state=42)
You could always do it manually. A bit messy but you can create a function
def my_train_test_split(X, y, ratio_train, ratio_val, seed=42):
idx = np.arange(X.shape[0])
np.random.seed(seed)
np.random.shuffle(idx)
limit_train = int(ratio_train * X.shape[0])
limit_val = int((ratio_train + ratio_val) * X.shape[0])
idx_train = idx[:limit_train]
idx_val = idx[limit_train:limit_val]
idx_test = idx[limit_val:]
X_train, y_train = X[idx_train], y[idx_train]
X_val, y_val = X[idx_val], y[idx_val]
X_test, y_test = X[idx_test], y[idx_test]
return X_train, X_val, X_test, y_train, y_val, y_test
Ratio test is assumed to be 1-(ratio_train+ratio_val).
I am preparing input to feed into a Keras Neural network for a multiclass problem as:
encoder = LabelEncoder()
encoder.fit(y)
encoded_Y = encoder.transform(y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
X_train, X_test, y_train, y_test = train_test_split(X, dummy_y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.06, random_state=42)
After having trained the model, I try to run the following lines to obtain a prediction that reflects the original class names:
y_pred = model.predict_classes(X_test)
y_pred = encoder.inverse_transform(y_pred)
y_test = np.argmax(y_test, axis = 1)
y_test = encoder.inverse_transform(y_test)
However, I obtain surpisingly low levels of accuracy (0.36), as oppoes to training and validations, that reach 0.98. Is this the right way of transforming classes back into the original labels?
I compute accuracies as:
# For training
history.history['acc']
# For testing
accuracy_score(y_test, y_pred)
I have he following code to run a 10-fold cross validation in SkLearn:
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
scores = model_selection.cross_val_score(MyEstimator(), x_data, y_data, cv=cv, scoring='mean_squared_error') * -1
For debugging purposes, while I am trying to make MyEstimator work, I would like to run only one fold of this cross-validation, instead of all 10. Is there an easy way to keep this code but just say to run the first fold and then exit?
I would still like that data is split into 10 parts, but that only one combination of that 10 parts is fitted and scored, instead of 10 combinations.
No, not with cross_val_score I suppose. You can set n_splits to minimum value of 2, but still that will be 50:50 split of train, test which you may not want.
If you want maintain a 90:10 ration and test other parts of code like MyEstimator(), then you can use a workaround.
You can use KFold.split() to get the first set of train and test indices and then break the loop after first iteration.
cv = model_selection.KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in cv.split(x_data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = x_data[train_index], x_data[test_index]
y_train, y_test = y_data[train_index], y_data[test_index]
break
Now use this X_train, y_train to train the estimator and X_test, y_test to score it.
Instead of :
scores = model_selection.cross_val_score(MyEstimator(),
x_data, y_data,
cv=cv,
scoring='mean_squared_error')
Your code becomes:
myEstimator_fitted = MyEstimator().fit(X_train, y_train)
y_pred = myEstimator_fitted.predict(X_test)
from sklearn.metrics import mean_squared_error
# I am appending to a scores list object, because that will be output of cross_val_score.
scores = []
scores.append(mean_squared_error(y_test, y_pred))
Rest assured, cross_val_score will be doing this only internally, just some enhancements for parallel processing.