Sklearn gridsearch with MLP where to put the random state?

Sklearn gridsearch with MLP where to put the random state? - scikit-learn

So I did the following:
MLP = MLPRegressor()
parameter_space = {
'hidden_layer_sizes': [(32,), (32,16), (32,16,8), (32,16,8,4), (32,16,8,4,2), (32,32), (32,32,32), (32,32,32,32), (32,32,32,32,32), (16,8,4,2)],
'activation': ['relu'],
'solver': ['adam'],
'learning_rate_init': [1, 0.1, 0.01, 0.001,0.0001,0.00001],
'max_iter': [5000],
'shuffle': [True, False],
'random_state': [0],
'early_stopping': [True, False],
'n_iter_no_change': [50],
}
gs_MLP = GridSearchCV(estimator = MLP, param_grid= parameter_space, cv = 7, n_jobs = -1)
gs_MLP_fit = gs_MLP.fit(X, y)
gs_MLP.score(X,y)
And I noticed that whenever I change the order within the hidden_layer_size it gives different answers. First it said (16,8,4,2) and when I put (16,8,4,2) at end it said (32,32,32,32) is the best.
I assume this has to do with the random_state? Do I have to put it in MLPRegressor() instead? As in MLPRegressor(random_state = 0)

Related

Why the result of categorical cross entropy in tensorflow different from the definition?

I am testing outcomes of tf.keras.losses.CategoricalCrossEntropy, and it gives me values different from the definition.
My understanding of cross entropy is:
def ce_loss_def(y_true, y_pred):
return tf.reduce_sum(-tf.math.multiply(y_true, tf.math.log(y_pred)))
And lets say I have values like this:
pred = [0.1, 0.1, 0.1, 0.7]
target = [0, 0, 0, 1]
pred = tf.constant(pred, dtype = tf.float32)
target = tf.constant(target, dtype = tf.float32)
pred_2 = [0.1, 0.3, 0.1, 0.7]
target = [0, 0, 0, 1]
pred_2 = tf.constant(pred_2, dtype = tf.float32)
target = tf.constant(target, dtype = tf.float32)
By the definition I think it should disregard the probabilities in the non-target classes, like this:
ce_loss_def(y_true = target, y_pred = pred), ce_loss_def(y_true = target, y_pred = pred_2)
(<tf.Tensor: shape=(), dtype=float32, numpy=0.35667497>,
<tf.Tensor: shape=(), dtype=float32, numpy=0.35667497>)
But tf.keras.losses.CategoricalCrossEntropy doesn't give me the same results:
ce_loss_keras = tf.keras.losses.CategoricalCrossentropy()
ce_loss_keras(y_true = target, y_pred = pred), ce_loss_keras(y_true = target, y_pred = pred_2)
outputs:
(<tf.Tensor: shape=(), dtype=float32, numpy=0.35667497>,
<tf.Tensor: shape=(), dtype=float32, numpy=0.5389965>)
What am I missing?
Here is the link to the notebook I used to get this result:
https://colab.research.google.com/drive/1T69vn7MCGMSQ8hlRkyve6_EPxIZC1IKb#scrollTo=dHZruq-PGyzO

I found out what the problem was. The vector elements get scaled automatically somehow, to sum up to 1 because the values are probabilities.
import tensorflow as tf
ce_loss = tf.keras.losses.CategoricalCrossentropy()
pred = [0.05, 0.2, 0.25, 0.5]
target = [0, 0, 0, 1]
pred = tf.constant(pred, dtype = tf.float32)
target = tf.constant(target, dtype = tf.float32)
pred_2 = [0.1, 0.3, 0.1, 0.5] # pred_2 has P(class2) = 0.3, instead of P(class2) = 0.1.
target = [0, 0, 0, 1]
pred_2 = tf.constant(pred_2, dtype = tf.float32)
target = tf.constant(target, dtype = tf.float32)
c1, c2 = ce_loss(y_true = target, y_pred = pred), ce_loss(y_true = target, y_pred = pred_2)
print("CE loss at dafault value: {}. CE loss with different probability of non-target classes:{}".format(c1,c2))
gives
CE loss at default value: 0.6931471824645996.
CE loss with with different probability of non-target classes:0.6931471824645996
As intended.

Keras model, training data format

I have some data in a cvs. I made a preprocess, now they look like that:
[array([66, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([18, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([26, 34, 9, 41, 19, 23, 29, 30, 1, 0]), array([15, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
for one line, it's an array of arrays
And I have a array of results 0 if it's false and 1 if it's true.
with keras and a lot of data I want in result a float between 0 and 1.
For now keras gave me an error:
ValueError: setting an array element with a sequence.
So I was thinking that my data aren't in good format.
If I take only one column it's working...
Have I to concat all arrays in one for each row or I have the wrong keras model?
here my definition of keras model:
df = dataset.values.tolist()
X = df
y = dataset['result']
X = np.array(X)
y = np.array(y)
model = Sequential()
model.add(Dense(12, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
_, accuracy = model.evaluate(X, y)
if I didn't convert list to numpy I get this error:
Please provide as model inputs either a single array or a list of arrays.

Using Pipeline with GridSearchCV

Suppose I have this Pipeline object:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('my_transform', my_transform()),
('estimator', SVC())
])
To pass the hyperparameters to my Support Vector Classifier (SVC) I could do something like this:
pipe_parameters = {
'estimator__gamma': (0.1, 1),
'estimator__kernel': (rbf)
}
Then, I could use GridSearchCV:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters)
grid.fit(X_train, y_train)
We know that a linear kernel does not use gamma as a hyperparameter. So, how could I include the linear kernel in this GridSearch?
For example, In a simple GridSearch (without Pipeline) I could do:
param_grid = [
{'C': [ 0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'kernel': ['rbf']},
{'C': [0.1, 1, 10, 100, 1000],
'kernel': ['linear']},
{'C': [0.1, 1, 10, 100, 1000],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
'degree': [2, 3],
'kernel': ['poly']}
]
grid = GridSearchCV(SVC(), param_grid)
Therefore, I need a working version of this sort of code:
pipe_parameters = {
'bag_of_words__max_features': (None, 1500),
'estimator__kernel': (rbf),
'estimator__gamma': (0.1, 1),
'estimator__kernel': (linear),
'estimator__C': (0.1, 1),
}
Meaning that I want to use as hyperparameters the following combinations:
kernel = rbf, gamma = 0.1
kernel = rbf, gamma = 1
kernel = linear, C = 0.1
kernel = linear, C = 1

You are almost there. Similar to how you created multiple dictionaries for SVC model, create a list of dictionaries for the pipeline.
Try this example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
remove = ('headers', 'footers', 'quotes')
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
pipe = Pipeline([
('bag_of_words', CountVectorizer()),
('estimator', SVC())])
pipe_parameters = [
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [ 0.1, ],
'estimator__gamma': [0.0001, 1],
'estimator__kernel': ['rbf']},
{'bag_of_words__max_features': (None, 1500),
'estimator__C': [0.1, 1],
'estimator__kernel': ['linear']}
]
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, pipe_parameters, cv=2)
grid.fit(data_train.data, data_train.target)
grid.best_params_
# {'bag_of_words__max_features': None,
# 'estimator__C': 0.1,
# 'estimator__kernel': 'linear'}

RandomSearchCV super slow - troubleshooting performance enhancement

I have been working on the below script for random forest classification and am running into some problems related to the performance of the randomized search - it's taking a very long time to complete & I wonder if there is either something I am doing wrong or something I could do better to make it faster.
Would anybody be able to suggest speed/performance improvements I could make?
Thanks in advance!
forest_start_time = time.time()
model = RandomForestClassifier()
param_grid = {
'bootstrap': [True, False],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [200, 300, 500, 1000]
}
bestforest = RandomizedSearchCV(estimator = model,
param_distributions = param_grid,
cv = 3, n_iter = 10,
n_jobs = available_processor_count)
bestforest.fit(train_features, train_labels.ravel())
forest_score = bestforest.score(test_features, test_labels.ravel())
print(forest_score)
forest_end_time = time.time()
forest_duration = forest_start_time-forest_end_time

The only way to speed this up is to 1) reduce the features or/and use more CPU cores n_jobs = -1:
bestforest = RandomizedSearchCV(estimator = model,
param_distributions = param_grid,
cv = 3, n_iter = 10,
n_jobs = -1)

Grid search on parameters inside the parameters of a BaggingClassifier

This is a follow up on a question answered here, but I believe it deserves its own thread.
In the previous question, we were dealing with “an Ensemble of Ensemble classifiers, where each has its own parameters.” Let's start with the example provided by MaximeKan in his answer:
my_est = BaggingClassifier(RandomForestClassifier(n_estimators = 100, bootstrap = True,
max_features = 0.5), n_estimators = 5, bootstrap_features = False, bootstrap = False,
max_features = 1.0, max_samples = 0.6 )
Now say I want to go one level above that: Considerations like efficiency, computational cost, etc., aside, and as a general concept: How would I ran grid search with this kind of setup?
I can set up two parameter grids along these lines:
One for the BaggingClassifier:
BC_param_grid = {
'bootstrap': [True, False],
'bootstrap_features': [True, False],
'n_estimators': [5, 10, 15],
'max_samples' : [0.6, 0.8, 1.0]
}
And one for the RandomForestClassifier:
RFC_param_grid = {
'bootstrap': [True, False],
'n_estimators': [100, 200, 300],
'max_features' : [0.6, 0.8, 1.0]
}
Now I can call grid search with my estimator:
grid_search = GridSearchCV(estimator = my_est, param_grid = ???)
What do I do with the param_grid parameter in this case? Or more specifically, how do I use both of the parameter grids I set up?
I have to say, it feels like I’m playing with matryoshka dolls.

Following #James Dellinger comment above, and expanding from there, I was able to get it done. Turns out the "secret sauce" is indeed a mostly-undocumented feature - the __ (double underline) separator (there's some passing reference to it in the Pipeline documentation): it seems that adding the inside/base estimator name, followed by this __ to the name of an inside/base estimator parameter, allows you to create a param_grid which covers parameters for both the outside and inside estimators.
So for the example in the question, the outside estimator is BaggingClassifier and the inside/base estimator is RandomForestClassifier. So what you need to do is, first, to import what needs to be imported:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import GridSearchCV
followed by the param_grid assignments (in this case, those in example in the question):
param_grid = {
'bootstrap': [True, False],
'bootstrap_features': [True, False],
'n_estimators': [5, 10, 15],
'max_samples' : [0.6, 0.8, 1.0],
'base_estimator__bootstrap': [True, False],
'base_estimator__n_estimators': [100, 200, 300],
'base_estimator__max_features' : [0.6, 0.8, 1.0]
}
And, finally, your grid search:
grid_search=GridSearchCV(BaggingClassifier(base_estimator=RandomForestClassifier()), param_grid=param_grid, cv=5)
And you're off to the races.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Sklearn gridsearch with MLP where to put the random state? - scikit-learn

Related

Why the result of categorical cross entropy in tensorflow different from the definition?

Keras model, training data format

Using Pipeline with GridSearchCV

RandomSearchCV super slow - troubleshooting performance enhancement

Grid search on parameters inside the parameters of a BaggingClassifier

Categories

Resources