How to get actual mean of k folds by using GridSearchCV? - scikit-learn

I am using GridSearchCV with cv = KFold(n_splits=10), scoring='accuracy' with some testing SVM (c=1, gamma=1).
For this testing, I am using only vector of 51 values, and another one of 51 binary responses.
My results look like this:
'split0_test_score': array([ 0.16666667]), 'split1_test_score': array([ 0.4]), 'split2_test_score': array([ 0.8]), 'split3_test_score': array([ 0.6]), 'split4_test_score': array([ 0.2]), 'split5_test_score': array([ 1.]), 'split6_test_score': array([ 0.2]), 'split7_test_score': array([ 0.]), 'split8_test_score': array([ 0.4]), 'split9_test_score': array([ 0.6]),
'mean_test_score': array([ 0.43137255]) ...
The problem is that mean score is not the actual mean score of all folds test score (it should be 0.4367). Is there a way to get real mean of all folds from GridSearchCV? Or do I have to extract it manually?
Thank you

I also noticed such discrepancies using GridSearchCV from Scikit-learn. Using my own test cases, the difference betwen the average (numpy.mean) over splitX_test_score[i] and mean_test_score from the attribute cv_results_ is noticeable from the 17th decimal with 2 folds. With 10 folds, there are discrepancies from the 6th decimal.
I think this issue may be related to floating point precision. Please, could someone explain how exactly mean_test_score (which function is used, with which floating point precision)? Many thanks in advance.
Edit: I read the answer from Leena in the following topic: sikit learn cv grid scores - Unexpected results. The difference is due to the parameter iid. If set to False, then mean_test_score is computed from mean value across folds.

Related

TokenClassification with BERT: add tensors after text was already tokenized?

I am using Bert for NER TokenClassification.
Since I want to manually truncate the (training) text and add padding and special tokens on my own, the tokenizer function looks like this:
tokenized_text = tokenizer.encode_plus(text, add_special_tokens=False, is_split_into_words=True)
I have successfully trained my model and now want to use it to predict new text.
The Huggingface tutorial suggest to do it as follows:
with torch.no_grad():
logits = model(**tokenized_text).logits
predicted_token_class_ids = logits.argmax(dim = -1)
predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
My problem is that in order to use the code above tokenized_text has to be in (pytorch) tensor format, but I originally did not use the return_tensors="pt" parameter, since I wanted to leave "input_ids", "token_type_ids" and "attention_mask" as list datatype to manipulate them easier.
So my question is basically if I can transform an already tokenized text to a tokenized text in the tensor format.
As far as the documentation tells return_tensors="pt" just returns torch.Tensor objects for the "input_ids", "token_type_ids" and the "attention_mask".
So I simply tried to use:
tokenized_text["input_ids"] = torch.Tensor(tokenized_text["input_ids"])
tokenized_text["token_type_ids"] = torch.Tensor(tokenized_text["token_type_ids"])
tokenized_text["attention_mask"] = torch.Tensor(tokenized_text["attention_mask"])
This made my tokenized text look like this:
{'input_ids': tensor([ 101., 5911., 26664., ....
'token_type_ids': tensor([0., 0., 0., ....
'attention_mask': tensor([1., 1., 1., .... }
Which is a bit weird, since if I use return_tensors="pt" from the beginning the tokenized text looks like this: (Basically it has one more layer of [ ] and not a "." after every element.
{'input_ids': tensor([[19770, 30882, 215, ....
'token_type_ids': tensor([[0, 0, 0, ....
'attention_mask': tensor([[1, 1, 1, .... }
I tried that on a custom text just to get the reference, currently it is not really an option for me to use return_tensors="pt" directly during my tokenization.
If I run the prediction code as suggested by Huggingface on the return_tensors="pt" tokenized text it works just fine, but if I use my manually to tensor converted tokenized text I receive the following error:
ValueError: not enough values to unpack (expected 2, got 1)
Does anyone have a suggestion as to what I should change or experienced another way to predict new data with a trained model?
I could solve it after some more digging through the documentation. Turns out that just using torch.Tensor(tokenized_text["input_ids"] was not enough.
I had to add another dimension so that the tensor has the size of [1,512].
I did this with:
local_copy["input_ids"] = local_copy["input_ids"][None, :]
I had to typecast my tensor from float to int with:
local_copy["input_ids"] = local_copy["input_ids"].type(torch.int64)

What is Mean_test_score and STD_Test_Score used for [duplicate]

Hello I'm doing a GridSearchCV and I'm printing the result with the .cv_results_ function from scikit learn.
My problem is that when I'm evaluating by hand the mean on all the test score splits I obtain a different number compared to what it is written in 'mean_test_score'. Which is different from the standard np.mean()?
I attach here the code with the result:
n_estimators = [100]
max_depth = [3]
learning_rate = [0.1]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)
gkf = GroupKFold(n_splits=7)
grid_search = GridSearchCV(model, param_grid, scoring=score_auc, cv=gkf)
grid_result = grid_search.fit(X, Y, groups=patients)
grid_result.cv_results_
The result of this operation is:
{'mean_fit_time': array([ 8.92773601]),
'mean_score_time': array([ 0.04288721]),
'mean_test_score': array([ 0.83490629]),
'mean_train_score': array([ 0.95167036]),
'param_learning_rate': masked_array(data = [0.1],
mask = [False],
fill_value = ?),
'param_max_depth': masked_array(data = [3],
mask = [False],
fill_value = ?),
'param_n_estimators': masked_array(data = [100],
mask = [False],
fill_value = ?),
'params': ({'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100},),
'rank_test_score': array([1]),
'split0_test_score': array([ 0.74821666]),
'split0_train_score': array([ 0.97564995]),
'split1_test_score': array([ 0.80089016]),
'split1_train_score': array([ 0.95361201]),
'split2_test_score': array([ 0.92876979]),
'split2_train_score': array([ 0.93935856]),
'split3_test_score': array([ 0.95540287]),
'split3_train_score': array([ 0.94718634]),
'split4_test_score': array([ 0.89083901]),
'split4_train_score': array([ 0.94787374]),
'split5_test_score': array([ 0.90926355]),
'split5_train_score': array([ 0.94829775]),
'split6_test_score': array([ 0.82520379]),
'split6_train_score': array([ 0.94971417]),
'std_fit_time': array([ 1.79167576]),
'std_score_time': array([ 0.02970254]),
'std_test_score': array([ 0.0809713]),
'std_train_score': array([ 0.0105566])}
As you can see, doing the np.mean of all the test_score it gives you a value approximately of 0.8655122606479532 while the 'mean_test_score' is 0.83490629
Thanks for you help,
Leonardo.
I will post this as a new answer since its so much code:
The test and train scores of the folds are: (taken from the results you posted in your question)
test_scores = [0.74821666,0.80089016,0.92876979,0.95540287,0.89083901,0.90926355,0.82520379]
train_scores = [0.97564995,0.95361201,0.93935856,0.94718634,0.94787374,0.94829775,0.94971417]
The amount of training samples in those folds are: (taken from the output of print([(len(train), len(test)) for train, test in gkf.split(X, groups=patients)]))
train_len = [41835, 56229, 56581, 58759, 60893, 60919, 62056]
test_len = [24377, 9983, 9631, 7453, 5319, 5293, 4156]
Then the test- and train-means with the amount of training samples per fold as weight is:
train_avg = np.average(train_scores, weights=train_len)
-> 0.95064898361714389
test_avg = np.average(test_scores, weights=test_len)
-> 0.83490628649308296
So this is exactly the value sklearn gives you. It is also the correct mean accuracy of your classification. The mean of the folds is incorrect in that it depends on the somewhat arbitrary splits/folds you chose.
So in concusion, both explanations were indeed identical and correct.
If you see the original code of GridSearchCV in their github repository, they dont use np.mean() instead they use np.average() with weights. Hence the difference. Here's their code:
n_splits = 3
test_sample_counts = np.array(test_sample_counts[:n_splits],
dtype=np.int)
weights = test_sample_counts if self.iid else None
means = np.average(test_scores, axis=1, weights=weights)
stds = np.sqrt(np.average((test_scores - means[:, np.newaxis])
axis=1, weights=weights))
cv_results = dict()
for split_i in range(n_splits):
cv_results["split%d_test_score" % split_i] = test_scores[:,
split_i]
cv_results["mean_test_score"] = means
cv_results["std_test_score"] = stds
In case you want to know more about the difference between them take a look
Difference between np.mean() and np.average()
I suppose the reason for the different means are different weighting factors in the mean calculation.
The mean_test_score that sklearn returns is the mean calculated on all samples where each sample has the same weight.
If you calculate the mean by taking the mean of the folds (splits), then you only get the same results if the folds are all of equal size. If they are not, then all samples of larger folds will automatically have a smaller impact on the mean of the folds than smaller folds, and the other way around.
Small numeric example:
mean([2,3,5,8,9]) = 5.4 # mean over all samples ('mean_test_score')
mean([2,3,5]) = 3.333 # mean of fold 1
mean([8,9]) = 8.5 # mean of fold 2
mean(3.333, 8.5) = 5.91 # mean of means of folds
5.4 != 5.91

Using the colon operator to slice columns in numpy when it might be a vector or might be a matrix

I have two general functions Estability3 and Lstability3 where I would like to evaluate both two dimensional slices of arrays and one dimensional ranges of vectors. I have explored the error outside the functions in a jupyter notebook with some of the arguments to the functions.
These function compute energy and angular momentum. The position and velocity data needed to compute the energy and angular momentum is stored in a two dimensional matrix called xvec where the position and velocity are along a row and the three entries represent the three stars. xvec0 is the initial data for the simulation (timestep 0).
xvec0
array([[-5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -2.23606798e+00, 0.00000000e+00],
[ 5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 2.23606798e+00, 0.00000000e+00],
[ 9.95024876e+02, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 4.46099737e-01, 0.00000000e+00]])
I select the first star of the zeroth timestep by selecting the first row of this matrix. If I were looping over thousands of timesteps like usual I would use thousands of matrices like these and append them to a list then convert to a numpy array with thousands of columns. (so xvec1_0 would have thousands of columns instead of one).
xvec1_0=xvec0[0]
Since xvec1_0 has only one column, here I am trying to force numpy to recognize it as a matrix. It doesn't work.
np.reshape(xvec1_0,(1,6))
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
I see that it has two outer brackets, which implies that it is a matrix. But when I try to use the colon index over the one column like I normally do over the 1000s of columns, I get an error.
xvec1_0[:,0:3]
IndexError Traceback (most recent call last)
<ipython-input-115-79d26475ac10> in <module>
----> 1 xvec1_0[:,0:3]
IndexError: too many indices for array
Why can't I use the : operator to obtain the first row of this two dimensional array? How can I do that in this more general code that also applies to matrices?
Thanks,
Steven
I think I misread the function definition for reshape. I thought it changed it in place. It doesn't, I needed to assign an output, like this
xvec0_1 = np.reshape(xvec1_0,(1,6))
xvec1_0[:,0:3]
array([[-5., 0., 0.]])
xvec1_0
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
xvec1_0.shape
(1, 6)
Thanks to a friend's help, I discovered that the following works just fine.
import numpy as np
x = np.zeros((1,6))
print(x.shape)
print(x[:,0:3])
x[:,0:3]
(1, 6)
[[0. 0. 0.]]
array([[0., 0., 0.]])
x = np.zeros((6,))
print(x.shape)
x = np.reshape(x, (1,6))
print(x[:,0:3])
x[:,0:3]
(6,)
[[0. 0. 0.]]
array([[0., 0., 0.]])
Probably I should have thought of some of these tests, but I thought I already had found the most basic test when I saw the output from np.reshape. I really appreciate the help from my friend, and hope my question did not waste anyone's time too badly.

sklearn's precision_recall_curve incorrect on small example

Here is a very small example using precision_recall_curve():
from sklearn.metrics import precision_recall_curve, precision_score, recall_score
y_true = [0, 1]
y_predict_proba = [0.25,0.75]
precision, recall, thresholds = precision_recall_curve(y_true, y_predict_proba)
precision, recall
which results in:
(array([1., 1.]), array([1., 0.]))
The above does not match the "manual" calculation which follows.
There are three possible class vectors depending on threshold: [0,0] (when the threshold is > 0.75) , [0,1] (when the threshold is between 0.25 and 0.75), and [1,1] (when the threshold is <0.25). We have to discard [0,0] because it gives an undefined precision (divide by zero). So, applying precision_score() and recall_score() to the other two:
y_predict_class=[0,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
which gives:
(1.0, 1.0)
and
y_predict_class=[1,1]
precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class)
which gives
(0.5, 1.0)
This seems not to match the output of precision_recall_curve() (which for example did not produce a 0.5 precision value).
Am I missing something?
I know I am late, but I had your same doubt that I have eventually solved.
The main point here is that precision_recall_curve() does not output precision and recall values anymore after full recall is obtained the first time; moreover, it concatenates a 0 to the recall array and a 1 to the precision array so as to let the curve start in correspondence of the y-axis.
In your specific example, you'll have effectively two arrays done like this (they are ordered the other way around because of the specific implementation of sklearn):
precision, recall
(array([1., 0.5]), array([1., 1.]))
Then, the values of the two arrays which do correspond to the second occurrence of full recall are omitted and 1 and 0 values (for precision and recall, respectively) are concatenated as described above:
precision, recall
(array([1., 1.]), array([1., 0.]))
I have tried to explain it here in full details; another useful link is certainly this one.

Treat a tuple/list of Tensors as a single Tensor

I'm using Pytorch for some robotics Reinforcement Learning tasks. I'd like to use both images and information about the state as observations for this task. The implementation I'm using does not directly support this so I'm making some amendments. Expected observations are either state, as a 1 dimensional Tensor, or images as a 3 dimensional Tensor (channels, width, height). In my task I would like the observation to be a tuple of Tensors.
In many places in my codebase, the observation is of course expected to be a single Tensor, not a tuple of Tensors. Is there an easy way to treat a tuple of Tensors as a single Tensor?
For example, I would like:
observation.to(device)
to work as normal when observation is a single Tensor, and call .to(device) on each Tensor when observation is a tuple of Tensors.
It should be simple enough to create a data type that can support this, but I'm wondering does such a data type already exist? I haven't found anything so far.
If your tensors are all of the same size, you can use torch.stack to concatenate them into one tensor with one more dimension.
Example:
>>> import torch
>>> a=torch.randn(2,1)
>>> b=torch.randn(2,1)
>>> c=torch.randn(2,1)
>>> a
tensor([[ 0.7691],
[-0.0297]])
>>> b
tensor([[ 0.4844],
[-0.9142]])
>>> c
tensor([[ 0.0210],
[-1.1543]])
>>> torch.stack((a,b,c))
tensor([[[ 0.7691],
[-0.0297]],
[[ 0.4844],
[-0.9142]],
[[ 0.0210],
[-1.1543]]])
You can then use torch.unbind to go the other direction.

Resources