cross Validation in Sklearn using a Custom CV - python-3.x

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.

EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

Related

Pipeline with XGBoost - Imputer and Scaler prevent Model from learning

I'm trying to build a pipeline for data preprocessing for my XGBoost model. The data contains NaNs and needs to be scaled. This is the relevant code:
xgb_pipe = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', preprocessing.StandardScaler()),
('regressor', xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective = "reg:squarederror"))])
xgb_pipe.fit(train_x.values, train_y.values,
regressor__early_stopping_rounds=20,
regressor__eval_metric = "rmse",
regressor__eval_set = [[train_x.values, train_y.values],[test_x.values, test_y.values]])
The loss immediately increases and the training stops after 20 iterations.
If I remove the imputer and the scaler from the pipeline, it works and trains for the full 100 iterations. If I manually preprocess the data it also works as intended, so I know that the problem is not the data.
What am I missing?
The problem is that the preprocessing doesn't get applied to your eval sets, and so the model performs quite badly on them, and early stopping kicks in very early.
I'm not sure there's a simple way to do this that would keep everything in one pipeline, unfortunately. You need to apply the preprocessing steps of the pipeline to the eval sets, so those need to be fitted before setting that parameter.
Separate preprocessing
As two objects it's no problem:
preproc = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', preprocessing.StandardScaler()),
])
reg = xgboost.XGBRegressor(n_estimators=100, eta=0.1, objective="reg:squarederror")
train_x_preproc = preproc.fit_transform(train_x.values, train_y.values)
test_x_preproc = preproc.transform(test_x)
reg.fit(train_x.values, train_y.values,
regressor__early_stopping_rounds=20,
regressor__eval_metric = "rmse",
regressor__eval_set = [[train_x_preproc, train_y.values], [test_x_preproc, test_y.values]],
)
After fitting you could put these now-fitted estimators together into a pipeline (pipelines don't clone their estimators) for prediction if you'd like.
Custom estimator
There are a lot of ways to go about this, but inheriting from Pipeline means you can initialize the same way you do your current setup, and we just assume the last step is an xgboost model, and the rest are preprocessing that need to apply to the eval sets as well as fitting and predicting sets. I think everything else can be left to the inherited methods from Pipeline?
class PreprocEarlyStoppingXGB(Pipeline):
def fit(self, X, y, eval_set):
preproc = self.steps[:-1]
X_preproc = preproc.fit_transform(X, y)
eval_preproc = []
for eval in eval_set:
eval_preproc.append([preproc.transform(eval[0]), eval[1]])
self.steps[-1].fit(X_preproc, y, eval_set=eval_preproc)
return self
To your usecase from the comments, what happens when you cross-validate with this object? On each training fold, the preprocessing steps are fitted. Those are then applied to the training fold, and all eval sets (the entire training set as well as the external test set), and finally when scoring the test fold. The xgboost model trains on the preprocessed training fold, and watches the score on the entire training set and the external testing set (both having been preprocessed), the latter getting used for early stopping.

Keras fitting setting in TensorFlow Extended (TFX)

I try to construct a TFX pipeline with a trainer component with a Keras model defined like this:
def run_fn(fn_args: components.FnArgs):
transform_output = TFTransformOutput(fn_args.transform_output)
train_dataset = input_fn(fn_args.train_files,
fn_args.data_accessor,
transform_output,
num_batch)
eval_dataset = input_fn(fn_args.eval_files,
fn_args.data_accessor,
transform_output,
num_batch)
history = model.fit(train_dataset,
epochs=num_epochs,
steps_per_epoch=fn_args.train_steps,
validation_data=eval_dataset,
validation_steps=fn_args.eval_steps)
This works. However, if I change fitting to the following, this doesn't work:
history = model.fit(train_dataset,
epochs=num_epochs,
batch_size=num_batch,
validation_split=0.1)
Now, I have two questions:
Why does fitting work only with steps_per_epochs only? I couldn't find any explicit statement supporting this but this is the only way. Somehow I conclude that it must be something TFX specific (TFX handles input data only in a generator-like way?).
Let's say my train_dataset contains 100 instances and steps_per_epoch=1000 (with epochs=1). Is that mean that my 100 input instances are feed 10x each in order to reach the defined 1000 step? Isn't that counter-productive from training perspective?

Keras XOR not training

We have been trying for a while to get this to work. This is probably the easiest example to create and so now we need help. We've been changing the number of epochs in the fit function and that's giving us different results, but never anything good, and when we increase them too much they will always converge on 0.5.
#%%
inputValues = numpy.array([[0,0],[0,1],[1,0],[1,1]])
inputResults = numpy.array([[0],[1],[1],[0]])
print(inputValues)
print(inputResults)
#%%
model = keras.Sequential([
keras.layers.Flatten(input_shape=(2,)),
keras.layers.Dense(units=2, activation=("relu")),
keras.layers.Dense(units=2, activation=("softmax"))
])
model.compile(loss = keras.losses.SparseCategoricalCrossentropy(), optimizer = tensorflow.optimizers.Adam(), metrics=['accuracy'])
model.fit(inputValues, inputResults, epochs=2500)
model.summary()
print(model.weights)
#%%
print(model.predict_proba(inputValues))
print("End of file.")
From my understanding of ANN's we should have 2 inputs in the first layer, specifically for the XOR example. And two outputs for the output (either a 0, or a 1). I assume that since it is not required to say what these outputs are (0 or 1), tensor flow is dealing with this automatically by comparing the results in the fit function? Lastly, we have tried with both a hidden layer (of 2) and without and still don't seem to get any better results.
Could someone let us know what we have done wrong?
Your problem is essentially a binary classification problem, because the output can be either 0 or 1. For this you don't need two ouput neurons; one will do, with a sigmoid function that will return either a 0 or a 1 as output (sigmoid generally works well for binary classification, because its characteristic S-shape will get you values either close to 0 or close to 1).
Another adjustment you need to make is to set the loss function to binary crossentropy; your choice, sparse categorical crossentropy, is suitable for classifications into more than 2 categories.
So the code that I tried is:
model = keras.Sequential([
keras.layers.Flatten(input_shape=(2,)),
keras.layers.Dense(units=4, activation=("sigmoid")),
keras.layers.Dense(units=1, activation=("sigmoid"))
])
model.compile(loss = keras.losses.BinaryCrossentropy(from_logits=False), optimizer = optimizers.Adam(), metrics=['accuracy'])
model.fit(inputValues, inputResults, epochs=2500)
With these settings I got training accuracy to 1.0000. It took a while to get there, and I suppose that could be sped up by playing around with the learning rate, but it should be enough to get the job done.

How to tune weights in Voting Classifier (Sklearn)

I am trying to do the following:
vc = VotingClassifier(estimators=[('gbc',GradientBoostingClassifier()),
('rf',RandomForestClassifier()),('svc',SVC(probability=True))],
voting='soft',n_jobs=-1)
params = {'weights':[[1,2,3],[2,1,3],[3,2,1]]}
grid_Search = GridSearchCV(param_grid = params, estimator=vc)
grid_Search.fit(X_new,y)
print(grid_Search.best_Score_)
In this, I want to tune the parameter weights. If I use GridSearchCV, it is taking a lot of time. Since it needs to fit the model for each iteration. Which is not required, I guess. Better would be use something like prefit used in SelectModelFrom function from sklearn.model_selection.
Is there any other option or I am misinterpreting something?
The following code (in my repo) would do this.
It contains a class VotingClassifierCV. It first makes cross-validated predictions for all classifiers. Then loops over all weights, choosing the best combination, and using pre-calculated predictions.
A compute friendlier way would be to first parameter tune each classifier separately on your training data. Then weight each classifier proportional to your target metric (say accuracy_score) from your validate data.
# parameter tune
models = {
'rf': GridSearchCV(rf_params, RandomForestClassifier()).fit(X_trian, y_train),
'svc': GridSearchCV(svc_params, SVC()).fit(X_train, y_train),
}
# relative weights
model_scores = {
name: sklearn.metrics.accuracy_score(
y_validate,
model.predict(X_validate),
normalized=True
)
for name, model in models.items()
}
total_score = sum(model_scores.values())
# combine the parts
combined_model = VotingClassifier(
list(models.items()),
weights=[
model_scores[name] / total_score
for name in models.keys()
]
).fit(X_learn, y_learn)
Finally, you may fit the combined model with your learning (train + validate) data & evaluate with your test data.

How to apply random forest properly?

I am new to machine learning and python. Now I am trying to apply random forest to predict binary results of a target. In my data I have 24 predictors (1000 observations) where one of them is categorical(gender) and all the others numerical. Among numerical ones, there are two types of values which are volume of money in euros (very skewed and scaled) and numbers (number of transactions from an atm). I have transformed the big scale features and did the imputation. Last, I have checked correlation and collinearity and based on that removed some features (as a result I had 24 features.) Now when I implement RF it is always perfect in the training set while the ratios not so good according to crossvalidation. And even applying it in the test set it gives very very low recall values. How should I remedy this?
def classification_model(model, data, predictors, outcome):
# Fit the model:
model.fit(data[predictors], data[outcome])
# Make predictions on training set:
predictions = model.predict(data[predictors])
# Print accuracy
accuracy = metrics.accuracy_score(predictions, data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
# Perform k-fold cross-validation with 5 folds
kf = KFold(data.shape[0], n_folds=5)
error = []
for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train, :])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
# Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test, :], data[outcome].iloc[test]))
print("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
# Fit the model again so that it can be refered outside the function:
model.fit(data[predictors], data[outcome])
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20)
predictor_var = train.drop('Sold', axis=1).columns.values
classification_model(model,train,predictor_var,outcome_var)
#Create a series with feature importances:
featimp = pd.Series(model.feature_importances_, index=predictor_var).sort_values(ascending=False)
print(featimp)
outcome_var = 'Sold'
model = RandomForestClassifier(n_estimators=20, max_depth=20, oob_score = True)
predictor_var = ['fet1','fet2','fet3','fet4']
classification_model(model,train,predictor_var,outcome_var)
In Random Forest it is very easy to overfit. To resolve this you need to do parameter search a little more rigorously to know the best parameter to use. [Here](http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html
) is the link on how to do this: (from the scikit doc).
It is overfitting and you need to search for the best parameter that will work work on the model. The link provides implementation for Grid and Randomized search for hyper parameter estimation.
And it will also be fun to go through this MIT Artificial Intelligence lecture to get get deep theoretical orientation: https://www.youtube.com/watch?v=UHBmv7qCey4&t=318s.
Hope this helps!

Resources