Random subsets of a dataset - python-3.x

I would like to compare the classification performance (accuracy) of different classifiers (e.g. CNN, SVM.....), depending on the size of the training data set.
Given is a dataset of images (e.g., MNIST), from which 80% of the images are randomly determined but in compliance with class balance. Subsequently, 80% of the images for the next smaller subset are to be determined from this subset in the same way again. This is repeated until finally a small training amout of about 1000 images is reached.
Each of the classifiers should now be trained with each these subsets.
The aim is to be able to make a statement like for example that from a training size of 5000 images the classifier A is significantly better than classifier B.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size= 0.2, stratify=y)
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train, random_state=0, test_size= 0.2, stratify=y_train)
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_train_2, y_train_2, random_state=0, test_size= 0.8, stratify=y_train_2)
.....
.....
.....
My problem is that I am not sure if this is really random sampling when I use the above code. Would it better to get the subsets, e.g. using numpy.random.randint?
For any help, I would be very grateful.

Related

How to reduce false positives in xgboost?

My dataset is evenly split between 0 and 1 classifiers. 100,000 data points total with 50,000 being classified as 0 and another 50,000 classified as 1. I did an 80/20 split to train/test the data and returned a 98% accuracy score. However, when looking at the confusion matrix I have an awful lot of false positives. I'm new to xgboost and decision trees in general. What settings can I change in the XGBClassifier to reduce the number of false positives or is it even possible? Thank you.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0, stratify=y) # 80% training and 20% test
model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=9,
min_child_weight=1, missing=None, monotone_constraints='()',
n_estimators=180, n_jobs=4, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=None)
model.fit(X_train,
y_train,
verbose = True,
early_stopping_rounds=10,
eval_metric = "aucpr",
eval_set = [(X_test, y_test)])
plot_confusion_matrix(model,
X_test,
y_test,
values_format='d',
display_labels=['Old Forests', 'Not Old Forests'])
Yes
If you are looking for a simple fix, you lower the value of scale_pos_weight. This will lower false positive rate even though your dataset is balanced.
For a more robust fix, you will need to run hyperparamter tuning search. Especially you should try different values of : scale_pos_weight, alpha, lambda, gamma and min_child_weight. Since they are the ones with the most impact on how conservative the model is going to be.

Models evaluation and parameter tuning with CV

I try to compare three models SVM RandomForest and LogisticRegression.
I have an imbalance dataset. First i split it to with a 80% - 20% ratio to train and test set. I set the stratify=y.
Next, i used StratifiedKfold only on train set. What i try to do now is fit the models and choose the best one. Also i want to use grid search for each one of the models to find the best parameters.
My code until now is the next
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=True, stratify=y, random_state=42)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=21)
for train_index, test_index in skf.split(X_train, y_train):
X_train_folds, X_test_folds = X_train[train_index], X_train[test_index]
y_train_folds, y_test_folds = y_train[train_index], y_train[test_index]
X_train_2, X_test_2, y_train_2, y_test_2 = X[train_index], X[test_index], y[train_index], y[test_index]
How can i fit a model usin all the folds? How can i gridsearch? Should i have a doulbe loop? can you help?
You can use scikit-learn's GridSearchCV.
You will find an example here of how to evaluate the performance of the various models and assess the statistical significance of the results.

What does this Code mean? (Train Test Split Scikitlearn)

Everywhere I go I see this code. Need help understanding this.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
what does X_train, X_test, y_train, y_test mean in this context which should I put in fit() and predict()
As the documentation says, what train_test_split does is: Splits arrays or matrices into random train and test subsets. You can find it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. I believe the right keyword argument is test_size instead of testsize and it represents the proportion of the dataset to include in the test split if it is float or the absolute number of test samples if is is an int.
X and y are the sequence of indexables with same length / shape[0], so basically the arrays/lists/matrices/dataframes to be split.
So, all in all, the code splits X and y into random train and test subsets (X_train and X_test for X and y_train and y_test for y). Each test subset should contain 20% of the original array entries as test samples.
You should pass the _train subsets to fit() and the _test subsets to predict(). Hope this helps~
In simple terms, train_test_split divides your dataset into training dataset and validation dataset.
The validation set is used to evaluate a given model.
So in this case validation dataset gives us idea about model performance.
X_train, X_test, y_train, y_test = train_test_split(X,y,testsize = 0.20)
The above line splits the data into 4 parts
X_train - training dataset
y_train - o/p of training dataset
X_test - validation dataset
y_test - o/p of validation dataset
and testsize = 0.2 means you'll have 20% validation data and 80% training data
`Basically this code split your data into two part.
is used for training
is for testing
And with the help of the test_size variable you can set the size of testing data
After dividing data into two part you have to fit training data into your model with fit() method.
`

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

ML Model not predicting properly

I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:
X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')
For the plot, I'm getting the following:
I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar() and scale().
I used this as a reference
What should I do?
As a rule of thumb, we usually follow this convention:
For little number of features, go with Logistic Regression.
For a lot of features but not a lot of data, go with SVM.
For a lot of features and a lot of data, go with Neural Network.
Because your dataset is a 10K cases, it'd be better to use Logistic Regression because SVM will take forever to finish!.
Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of train_test_split which doesn't guarantee balanced classes in the splits.
Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!
So the full implementation is as follows:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np
def getDataset(path, x_attr, y_attr):
"""
Extract dataset from CSV file
:param path: location of csv file
:param x_attr: list of Features Names
:param y_attr: Y header name in CSV file
:return: tuple, (X, Y)
"""
df = pd.read_csv(path)
X = X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
Y = np.array(df[y_attr])
return X, Y
def stratifiedSplit(X, Y):
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, Y))
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
return X_train, X_test, Y_train, Y_test
def run(X_data, Y_data):
X_train, X_test, Y_train, Y_test = stratifiedSplit(X_data, Y_data)
param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
model = LogisticRegression(random_state=0)
clf = GridSearchCV(model, param_grid, cv=StratifiedKFold(n_splits=10))
clf.fit(X_train, Y_train)
print(accuracy_score(Y_train, clf.best_estimator_.predict(X_train)))
print(accuracy_score(Y_test, clf.best_estimator_.predict(X_test)))
X_data, Y_data = getDataset("data - Sheet1.csv", ['distance'], 'orders')
run(X_data, Y_data)
Despite all the attempts with all different algorithms, the accuracy didn't exceed 36%!!.
Why is that?
If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.
And that's exactly what is in your dataset!
Simply, run print(len(np.unique(X_data))) and print(len(np.unique(Y_data))) and you'll find that the numbers are so weird, in a nutshell you have:
Number of Cases: 10000 !!
Number of Classes: 118 !!
Number of Unique Inputs (i.e. Features): 66 !!
All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!
In other words, you have no informative features which lead to a lack in the uniqueness of each class model!
What to do?
I believe you are not allowed to remove some classes, so the only two solutions you have are:
Either live with this very valid result.
Or add more informative feature(s).
Update
Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.
I recommend you do the following:
Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).
Check what features are most important to the Orders Classes, you can achieve that by using of Forests of Trees to evaluate the importance of features. Here is a complete and simple example of how to do that in Scikit-Learn.
Create a new version of the dataset but this time hold Orders as the Y response, and the above-found features as the X variables.
Follow the same GrdiSearchCV and StratifiedKFold procedure that I showed you in the implementation above.
Hint
As per mentioned by Vivek Kumar in the comment below, stratify parameter has been added in Scikit-learn update to the train_test_split function.
It works by passing the array-like ground truth, so you don't need my workaround in the function stratifiedSplit(X, Y) above.

Resources