GridSearchCV gives different results than LassoCV for optimal alpha - python-3.x

I am aware of the standard process of finding the optimal value of alpha/lambda using Cross Validation technique through GridSearchCV class in sklearn.model_selection library.Here's my code to find that .
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3, random_state=100)
hyper_param = {'alpha':alphas}
model = Lasso()
model_cv = GridSearchCV(estimator = model,
param_grid=hyper_param,
scoring='r2',
cv=cv,
verbose=1,
return_train_score=True
)
model_cv.fit(X_train,y_train)
#checking the bestscore
model_cv.best_params_
This gives me alpha=0.01
Now, looking on LassoCV , as per my understanding , this library creates model by selecting best optimal alpha by the passed alphas list, and please note , I have used the same cross validation scheme for both of them. But when trying sklearn.linear_model.LassoCV with RepeatedKFold cross validation scheme.
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=100)
ls_cv_m=LassoCV(alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
ls_cv_m.fit(X_train_reduced,y_train)
print('Alpha Value %d'%ls_cv_m.alpha_)
print('The coefficients are {}',ls_cv_m.coef_)
I get alpha=0 for the same data and this alpha value in not present in the list of decimal values passed in alphas argument for this.
This has confused me about the actual implementation of LassoCV.
and my doubts are ..
Why do I get optimal alpha as 0 in LassoCV when the list passed to the argument does not has zero in it.
What is the difference between LassoCV and Lasso then, if I have to anyways find most suitable alpha from GridSearchCV only?

First you should pass your alphas as keywords parameters rather then positional parameters since the first positional parameter for LassoCV is eps.
ls_cv_m=LassoCV(alphas=alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
Then, the model is returning as optimal parameter one of the alphas that you previously defined, however you are simply printing it as an integer number casting the float to int. Replace %d with %f to print it in the float format:
print('Alpha Value %f'%ls_cv_m.alpha_)
Have a look here for more details about Python printing formats and styles.
As for your second question, Lasso is the linear model while LassoCV is an iterative process that allows you to find the optimal parameters for a Lasso model using Cross-validation.

Related

How to extract the aggregated gradient from tensorflow_federated?

I have a tensorflow model like this
def input_spec():
return(
tf.TensorSpec([None, 122], tf.float64),
tf.TensorSpec([None, 5],tf.uint8))
def model_fn():
model=tf.keras.models.Sequential([
tf.keras.layers.Dense(64,input_shape=(122,)),
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dropout(.15),
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dropout(.15),
tf.keras.layers.Dense(32,activation='relu'),
tf.keras.layers.Dropout(.15),
tf.keras.layers.Dense(5,activation='softmax')])
return tff.learning.from_keras_model(
model,
input_spec=input_spec(),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=[tf.keras.metrics.CategoricalAccuracy()])
I set the iterative_process in the following
iterative_process=tff.learning.algorithms.build_weighted_fed_avg(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.Adam(),
server_optimizer_fn=lambda: tf.keras.optimizers.Adam())
I have learnt that we can obtain the aggregated weight by model_weights=iterative_process.get_model_weights(state), but I still need to know how to obtain the aggregated gradients.
While running the training procedure, the aggregated (pseudo) gradients can in some cases be computed by subtracting the state at the beginning of the round from that at the end. In the code snippet above, this will not quite literally be true since the server optimizer is Adam (which performs some rescaling of the pseudo-gradients, as well as the addition of a momentum accumulator, if I recall correctly).
If you are simply using gradient-descent with a learning rate of 1 on the server (traditionally the default setting for FedAvg), code like the following should give you this aggregated pseudogradient:
pseudo_grad = tf.nest.map_structure(
lambda x, y: x - y, previous_state.global_model_weights.trainable,
state.global_model_weights.trainable)
Some helpful measurements for debugging can alternatively be accessed by wrapping the aggregator parameter to your build_weighted_fed_avg call in an aggregator that adds these debug measurements, if this is the underlying goal here. You would additionally be able to read these values directly if you implemented a tff.templates.AggregationProcess which output the averaged pseudogradient in the measurements field of its result; these should be passed through directly by the rest of the FedAvg implementation.

Getting a Scoring Function by Name in scikit-learn

In scikit-learn , there is the notion of a scoring function. If we have some predicted labels and the true labels, we can get to the score by calling scoring(y_true, y_predict). An example of such scoring function is sklearn.metrics.accuracy_score.
A scoring function is not to be confused of the scorer, which is an object that can be called as scorer(estimator, X, y_true).
There are many builtin scorers in scikit-learn. It is possible to get to these scorers by their string names. For example, we can get the scorer corresponding to the name 'accuracy' by calling sklearn.metrics.get_scorer("accuracy")/
But it turns out that there is no obvious mechanism to access the built-in scoring functions by their names at run-time, through passing in the name as a string. For example, there is no way to access sklearn.metrics.accuracy_score by its name accuracy.
For example, if at run time, the program knows the name of the scoring function is contained in variable name, I am looking for a mechanism get_scoring_function(), such that, get_scoring_function(name) will return the scoring function handle. Note that this name, name, is not known at scripting time.
Is there any way to access the built-in scoring functions by their names at run time through passing in the names as strings?
You can use the get_scorer() function, which accepts a string as an argument, and then get the _score_func attribute of the returned object.
So for example
from sklearn.metrics import get_scorer
get_scorer('accuracy')._score_func(y_true, y_pred)
is equivalent to
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
I myself faced this task, and I haven't found a better way to access metrics by names than with sklearn.metrics.get_scorer function, but the drawback of it is that you have to pass an estimator there, not predictions. I tried to use the #collinb9 recommendation, but you see, you have to access a protected method there, and in my case, it led to unpleasant consequences, namely incorrectly calculated metrics.
This is a short example showing this problem.
from sklearn import datasets, model_selection, linear_model, metrics
features, labels = datasets.make_regression(1000, random_state=123)
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.1, random_state=567)
model = linear_model.LinearRegression()
model.fit(train_features, train_labels)
print(f'variant 1 neg_mse = {metrics.get_scorer("neg_mean_squared_error")(model, test_features, test_labels)}')
print(f'variant 1 neg_rmse = {metrics.get_scorer("neg_root_mean_squared_error")(model, test_features, test_labels)}\n')
preds = model.predict(test_features)
print(f'variant 2 mse = {metrics.mean_squared_error(test_labels, preds)}')
print(f'variant 2 rmse = {metrics.mean_squared_error(test_labels, preds, squared=False)}\n')
print(f'protected neg_mse = {metrics.get_scorer("neg_mean_squared_error")._score_func(test_labels, preds)}')
print(f'protected neg_rmse = {metrics.get_scorer("neg_root_mean_squared_error")._score_func(test_labels, preds)}')
The output of this program will be:
variant 1 neg_mse = -2.142587870436064e-25
variant 1 neg_rmse = -4.628809642268803e-13
variant 2 mse = 2.142587870436064e-25
variant 2 rmse = 4.628809642268803e-13
protected neg_mse = 2.142587870436064e-25
protected neg_rmse = 2.142587870436064e-25
You see, metrics calculated with the use of the protected method differ. First, we ordered to get negative values, but got positive ones (it should be mentioned, that for variant 2 metrics we didn't imply negative values). Second, the neg_mse and neg_rmse values are equal but should be different.
If we go to the source code of sklearn metrics, we will see:
This is how _score_func is called: it is multiplied by sign, so that's where we lose our negative values.
This is how scorers are made: you see, neg_root_mean_squared_error_scorer has extra parameter squared=False. This parameter is stated explicitly as an optional one in metrics.mean_squared_error, so you won't make a mistake. We can pass this parameter as a keyword argument to _score_fun and at least we will get a correct absolute value then:
print(f'protected neg_rmse = {metrics.get_scorer("neg_root_mean_squared_error")._score_func(test_labels, preds, squared=False)}')
protected neg_rmse = 4.628809642268803e-13
To make things short, I've shown, to my knowledge, the only way to get sklearn metrics by name (btw, you can find the full list of names here), and that it's not safe to use protected methods that you're not supposed to use. BTW, I was using sklearn version=0.24.2.
Since the documentation is incomplete, you'll have to go directly to the source code here for the complete list of metric names:
Metric Names
Search for __all__.
Answer of #collinb9 should not be accepted as it would lead to incorrect calculations.
You need other arguments (such as squared:False for rmse) to compute the correct thing. They can be accessed via the _kwargs attribute of _BaseScorer class. If you combine _score_func and _kwargs then we can get the corresponding scorer function.
The full answer to the question should be:
import functools
import sklearn
def score(scoring_name, y_true, y_pred):
sklearn_scorer = sklearn.metrics.get_scorer(scoring_name)
return sklearn_scorer._sign * sklearn_scorer._score_func(
y_true=y_true, y_pred=y_pred, **sklearn_scorer._kwargs
)
score("neg_root_mean_squared_error", y_true, y_pred)

GridSearchCV uses predict or predict_proba?

It works for the huber and log, however only the logarithm has a predict_proba? How it works? I used roc_auc_score.
as it has written in O'reily book it works on mean_score as you can access to all mean_scores it got with this code
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
it will print all mean_scores that calculated for each param and you can see the difference between them easily
Grid Search CV has both the predict and the predict_proba functions.
If you consider a binary classification problem, predict will have the values of 0 or 1. While, Predict_proba will have the probability values of it being 0 or 1.
predict_proba will have an array output like [0.23 0.77]

ValueError: could not convert string to float: ' '. Is Permutation importance only applicable for numeric features?

I've a Data frame that contain dtypes as categorical, float, int.
X - contain features of all the three given dtypes and y is int.
I've created a pipline as given below.
get_imputer():
imputing function
get_encoder():
some encoder function
#model
pipeline = Pipeline(steps=[
('imputer', get_imputer()),
('encoder', get_encoder()),
('regressor', RandomForestRegressor())
])
I needed to find permutation importance of the model. below is the code for that.
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(pipeline.steps[2][1], random_state=1).fit(X, y)
eli5.show_weights(perm)
But this code is throwing an error as follows:
ValueError: could not convert string to float: ''
Let's understand the working of PermutationImportance in short.
After you have trained your model with all the features, PermutationImportance shuffles values of column/s and checks the effect on Loss function.
Eg.
There are 5 features(columns) and there are n rows:
f1 f2 f3 f4 f5
v1 v2 v3 v4 v5
v6 v7 v8 v9 v10
.
.
.
vt . . . .
Now to identify whether f3 column is important or not, it shuffles values in column f3. Eg. Value of f3 in row x is swapped with the value of f3 in row y, then it checks the effect on the loss function. And hence, identifies the importance of a feature in a model.
Now, to answer this particular question, I would say that any model is trained when all the features are numerical(as ML model does not understand text directly). So, in you PermutionImportance argument, you need to supply columns that are numbers. As you have trained a model after converting categorical/textual things in numbers, you need to apply the same conversion strategy to your new input.
Hence, PermuationImportance should be used only when your data is pre-processed and your dataframe has everything numerical.
For the next poor soul...
I came across this post while having the same problem. While the accepted answer makes total sense - the fact is that in the OP's pipeline, it appears as though he is handling the categorical data with encoders which will convert them to numeric.
So, it appears that PermutationImportance is checking the array for numeric way too early in the process (before the pipeline entirely). Instead, it should check after the preprocessing steps and right before fitting the model. This is frustrating because if it doesn't work with pipelines it makes it hard to use.
I started off having some luck using sklearn's implementation of permutation_importance instead... But then I figured it out.
You need to separate the pipeline again and you should be able to get it to work. It's annoying, but it works!
import eli5
from eli5.sklearn import PermutationImportance
estimator = pipeline.named_steps['regressor']
# I didnt have multiple steps when I did it, but maybe this is right?
preprocessor = pipeline.named_steps['imputer']['encoder']
X2 = preprocessor.transform(X)
perm = PermutationImportance(estimator, random_state=1).fit(X2.toarray(), y)
eli5.show_weights(perm)

Default value in Svm prediction Scikitlearn

I am using scikitlearn for svm classification.
I need a classifier that returns default value when a given test item doesn't match any of the training-set items, i.e. when the distance is very high. Is that possible?
For Example
Let's say my training-set is
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64]]
and labels
y=[0,1,2]
then I run training
clf = svm.SVC()
clf.fit(X, y)
then I run prediction
clf.predict([-100,-100,-200])
Now as we can see the test-item [-100,-100,-200] is too far away from any of the training-items, in this case the prediction will yield [2] which is this item [16, 16,64], is there anyway to make it return anything else (not from training-set)?
I think you can create a label for those big values, and added into your training set.
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64],[-100,-100,200]]
Y=[0,1,2,100]
and give a try.
Since SVM is supervised learning, which means the 'OUTPUT' have to be specified. If you are not certain about the 'OUTPUT', do some non supervised clustering (kmeans for example), and have a rough idea how many possible 'OUTPUT' you will expect.

Resources