Found input variables with inconsistent numbers of samples: [799996, 199999] - python-3.x

I am splitting a single df so why is it giving Inconsistent no of samples in X_train, X_test (if that is what the error means)?
X_train, X_test = train_test_split(df[categorical_cols+ numeric_cols], test_size=0.2, random_state=4)
regression = LinearRegression().fit(X_train, X_test)
regression.score(X)

In your example, the method will do something roughly equivalent to the following:
Generate a random number between 0 and 1 for each record
Put records where the random number is below .2 in the test set
Put the rest in the training set
There is some randomness to how many actually get put in the train/test sets because the number of random numbers under .2 won't always be exactly 20%.

Related

What is different between train and test

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=87)
plt.scatter(x_train[:, 0], x_train[:,1], c=y_train)
Can someone explain to me about the code, what is the different between train and test and how does [:, 0] and [:,1] about?
train_test_split() divides the data X (independent variables) and y (dependent variable) into a 80/20 split (train_size = 0.8, test_size = 0.2, but you only specify test_size). For example, your dataset consists of 100 rows then your x_train and y_train would consist of 80 random rows from the original 100 rows. The rest belongs to the x_test and y_test data (e.g. 20 rows). By setting the random_state, you ensure that the random splits are always reproducible when you enter the same number. This is done to prevent data leakage or spill-over during model training. More info can be found here.
The plot function then creates a scatterplot, where all the rows from the first column (of x_train) are used to create the x-axis coordinates and all the rows from the second column are used to create the y-axis coordinates, where the datapoints are coloured based on y_train values.

GridSearchCv Pipeline MultiOutputClassifier with XGBoostClassifier - how to pass early_stopping_rounds and eval_set?

I want to do multioutput prediction of labels and continuous data. My data consists of time series, one 10 time-points series of 30 observables per sample. I want to predict 10 labels that are binary, and 5 that are continuous, based on this data.
For the sake of simplicity I have flattened the time series data - ending up with one row per sample.
Since there are many labels to predict about the same system, and since there exists relationships between these, I want to use MutliOutputPrediction to do so. My idea is to divide the task into two parts; one for MultiOutputClassification, another for MultiOutputRegression.
I generally like XGBoost and wish to use it for this task, but of course I want to prevent overfitting when doing so. So I have a piece of code as follows, and I wish to pass the early_stopping_rounds to the fit method of the XGBClassifier, but don't know how to.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
pipeline = Pipeline([
('imputer', SimpleImputer()), # XGBoost can deal with NaNs, but MultiOutputClassifier cannot
('classifier', MultiOutputClassifier(XGBClassifier()))
])
param_grid = dict(
classifier__estimator__n_estimators=[100], # this works
# classifier__estimator__early_stopping_rounds=[30], # needs to be passed to .fit
# classifier__estimator__scale_pos_weight=[scale_pos_weight], # XGBoostError: Invalid Parameter format for scale_pos_weight expect float
)
clf = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', refit='roc_auc', cv=5, n_jobs=-1)
clf.fit(X_train, y_train[CLASSIFICATION_LABELS])
y_hat_proba = np.array(clf.predict_proba(X_test))
y_hat = pd.DataFrame(np.array([y_hat_proba[:, i, 0] for i in range(y_hat_proba.shape[1])]), columns=CLASSIFICATION_LABELS)
auc_roc_scores = np.array([roc_auc_score(y_test[label], (y_hat[label] > 0.5).astype(int)) for label in y_hat.columns])
print(f'average ROC AUC score: {np.mean(auc_roc_scores).round(3)}+/-{np.std(auc_roc_scores).round(3)}')
>>> average ROC AUC score: 0.499+/-0.002
I tried passing it to fit as follows:
classifier__estimator__early_stopping_rounds=30
classifier__early_stopping_rounds=30
I get AUC ROC scores of 0.5 on the labels, which means this clearly isn't working and hence why I want to pass the early_stopping_rounds parameter and the eval_set. I suppose that being able to pass scale_pos_weight could also be useful, but probably doesn't work for MultiOutput prediction. At the moment I get the feeling that this is not the way to go to solve this, and in case you agree I would appreciate alternative suggestions.

Imbalance Learn "balanced_batch_generator" for > 2 dimensional data

I am using imbalanced learn's "balanced_batch_generator" to try to perform undersampling on image array which has 4 dimensions. I ran the code below:
training_generator, steps_per_epoch = balanced_batch_generator(x_train, y_train, sampler=NearMiss(), batch_size=10)
and got the following error:
ValueError: Found array with dim 4. Estimator expected <= 2.
I am aware that this function does not accept > 2 dimensional data however I am wondering if there is an work around to this. I would perform under/over sampling myself by just manually splitting the data, however i want to make use of keras's nicely implemented functions such as NearMiss to intelligently sample my data.
You need to reshape the array for your x_train:
x.train.shape #So you know the values for all 4 dims (1st dim, 2nd dim, 3rd dm, 4th dm)
x_train = x_train.reshape(x_train.shape[0],-1)
Then use the balanced_batch_generator. Afterwards:
x_train = x_train.reshape(x_train.shape[0], initial Value for the 2nd dim, initial Value for the 3rd, initial Value for the 4th)
for x_train, y_train in training_generator:
break
And your x_train contains balanced batches with its initial dimensions. You don't need to reshape y_train because it only consists of the labels. (dim <=2)
If you do not want to type in the values of the dims all the time, I would use variables that save the values.

What does this Code mean? (Train Test Split Scikitlearn)

Everywhere I go I see this code. Need help understanding this.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
what does X_train, X_test, y_train, y_test mean in this context which should I put in fit() and predict()
As the documentation says, what train_test_split does is: Splits arrays or matrices into random train and test subsets. You can find it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I believe the right keyword argument is test_size instead of testsize and it represents the proportion of the dataset to include in the test split if it is float or the absolute number of test samples if is is an int.
X and y are the sequence of indexables with same length / shape[0], so basically the arrays/lists/matrices/dataframes to be split.
So, all in all, the code splits X and y into random train and test subsets (X_train and X_test for X and y_train and y_test for y). Each test subset should contain 20% of the original array entries as test samples.
You should pass the _train subsets to fit() and the _test subsets to predict(). Hope this helps~
In simple terms, train_test_split divides your dataset into training dataset and validation dataset.
The validation set is used to evaluate a given model.
So in this case validation dataset gives us idea about model performance.
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
The above line splits the data into 4 parts
X_train - training dataset
y_train - o/p of training dataset
X_test - validation dataset
y_test - o/p of validation dataset
and testsize = 0.2 means you'll have 20% validation data and 80% training data
`Basically this code split your data into two part.
is used for training
is for testing
And with the help of the test_size variable you can set the size of testing data
After dividing data into two part you have to fit training data into your model with fit() method.
`

Keras - steps_per_epoch calculation not matching with the ImageDataGenerator output

I am working on a basic Classification task with Keras and I seem to have stumbled upon a problem where I need some assistance.
I have 200 samples for training and a 100 for validation, I intend to use a ImageDataGenerator to increase the number of training samples for my task. I want to make sure of the total number of training images that are passed to the fit_generator().
I know that the steps_per_epoch defines the total number of batches we get from a generator and ideally it should be number of samples divided by the batch size.
However, this is where things do not add up for me. Here is a snippet of my code:
num_samples = 200
batch_size = 10
gen = ImageDataGenerator(horizontal_flip = True,
vertical_flip = True,
width_shift_range = 0.1,
height_shift_range = 0.1,
zoom_range = 0.1,
rotation_range = 10
)
x,y = shuffle(img_data,img_label, random_state=2)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.333, random_state=2)
generator = gen.flow(x_train, y_train, save_to_dir='check_images/sample_run')
new_network.fit_generator(generator, steps_per_epoch=len(x_train)/batch_size, validation_data=(x_test, y_test), epochs=1, verbose=2)
I am saving the augmented images to see how the images turn out from the ImageDataGenerator and also to ascertain the number of images that are generated from it.
After running this code for a single epoch, I get 600 images in my directory, a number which I cannot arrive at, or maybe I am making a mistake.
Any assistance in making me understand the calculation in this code would be deeply appreciated. Has anyone come across similar problems ?
TIA
gen.flow() creates a NumpyArrayIterator internally and that in turn uses Iterator to calculate the steps_per_epoch. Ideally if steps_per_epoch is None, then the calculation is done as steps_per_epoch = (x.shape[0] + batch_size - 1) // batch_size which is approximately same as your calculation.
Not sure why you see more number of samples. Could you compute the x.shape[0] and double check if your code is as per what you explained?

Resources