sklearn HalvingGridSearchCV giving inconsistent results - search

The best params are not the same as the highest ranked. Here is the set up.
halving_cv = HalvingGridSearchCV(
knn_model, knn_grid,
scoring="roc_auc",
n_jobs=-1,
min_resources="exhaust",
factor=3,
cv=5, random_state=1234,
refit=True,
)
grid_result = halving_cv.fit(x_trained, y_train)
And here are best params from it.
grid_result.best_params_
{'algorithm': 'ball_tree', 'n_neighbors': 95, 'p': 1, 'weights': 'uniform'}
And here are top 4 ranked combos from grid_result.cv_results_:
params rank_test_score
{'algorithm': 'ball_tree', 'n_neighbors': 55, 'p': 2, 'weights': 'uniform'} 1
{'algorithm': 'ball_tree', 'n_neighbors': 55, 'p': 1, 'weights': 'uniform'} 2
{'algorithm': 'ball_tree', 'n_neighbors': 85, 'p': 2, 'weights': 'uniform'} 3
{'algorithm': 'ball_tree', 'n_neighbors': 15, 'p': 1, 'weights': 'uniform'} 4
The best params are not seen. The question is how best_params_ is obtained?
I was expecting best params to be the highest ranked.

With help of sklearn folks, I believe I have an answer. The table reflects the values as they are computed during halving. Hence, scores are based on sub-samples until the very last iteration. Hence, you might see high test scores on early iterations because of small sample size. It is the results from the last iteration (which uses the full table) that are actually reported as "best".

Related

Make predictions on a dataframe with list categorical columns and other types of data

I have a dataframe that looks like this:
df = {'user_id': [23, 34, 12, 9],
'car_id': [[22, 132, 999], [22, 345, 2], [134], [87, 44, 3, 222]],
'start_date': ['2012-02-17', '2013-11-22', '2013-11-22', '2014-03-15'],
'cat_col1': ['str1', 'str2', 'str3', 'str3'],
'cat_col2': [['str1', 'str2'], ['str4'], ['str5, str1'], ['str6', 'str2']],
'cat_col3': [['str11', 'str22', 'str34'], ['str444'], ['str51, str111'], ['str62', 'str233']],
'num_sold': [23, 43, 111, 23],
'to_predict': [0.4, 0.5, 0.22, 0.9]}
There are around 100 000 unique user_ids and 200 000 unique car_ids and categorical columns have thousands of unique values so OHE is not an option. I need to predict to_predict for a given value of cat_col1, cat_col2, cat_col3 (I need to have their original values at the end for predictions). There is a relationship between those categorical columns but it is not clearly defined. Is it possible to do this in keras with embedding layers perhaps and would that make sense for categorical columns? If so, would it make sense utilise the date column and convert it into time series using LSTMs? Or what would be the best approach for this kind of prediction in general?

Why would a Torchscript trace return different looking encoded_inputs compared to the original Transformer model?

Background
I'm working with a finetuned Mbart50 model that I need sped up for inferencing because using the HuggingFace model as-is is fairly slow with my current hardware. I wanted to use TorchScript because I couldn't get onnx to export this particular model as it seems it will be supported at a later time (I would be glad to be wrong otherwise).
Convert Transformer to a Pytorch trace:
import torch
""" Model data """
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", torchscript= True)
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer.src_lang = 'en_XX'
dummy = "To celebrate World Oceans Day, we're swimming through a shoal of jack fish just off the coast of Baja, California, in Cabo Pulmo National Park. This Mexican marine park in the Sea of Cortez is home to the northernmost and oldest coral reef on the west coast of North America, estimated to be about 20,000 years old. Jacks are clearly plentiful here, but divers and snorkelers in Cabo Pulmo can also come across many other species of fish and marine mammals, including several varieties of sharks, whales, dolphins, tortoises, and manta rays."
model.config.forced_bos_token_id=250006
myTokenBatch = tokenizer(dummy, max_length=192, padding='max_length', truncation = True, return_tensors="pt")
torch.jit.save(torch.jit.trace(model, [myTokenBatch.input_ids,myTokenBatch.attention_mask]), "././traced-model/mbart-many.pt")
Inference Step:
import torch
""" Model data """
from transformers import MBart50TokenizerFast
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
model = torch.jit.load('././traced-model/mbart-many.pt')
MAX_LENGTH = 192
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
model.to(device)
model.eval()
tokenizer.src_lang = 'en_XX'
dummy = "To celebrate World Oceans Day, we're swimming through a shoal of jack fish just off the coast of Baja, California, in Cabo Pulmo National Park. This Mexican marine park in the Sea of Cortez is home to the northernmost and oldest coral reef on the west coast of North America, estimated to be about 20,000 years old. Jacks are clearly plentiful here, but divers and snorkelers in Cabo Pulmo can also come across many other species of fish and marine mammals, including several varieties of sharks, whales, dolphins, tortoises, and manta rays."
myTokenBatch = tokenizer(dummy, max_length=192, padding='max_length', truncation = True, return_tensors="pt")
encode, pool , norm = model(myTokenBatch.input_ids,myTokenBatch.attention_mask)
Expected Encoding Output:
These are tokens that can be decoded to words with MBart50TokenizerFast.
tensor([[250004, 717, 176016, 6661, 55609, 7, 10013, 4, 642,
25, 107, 192298, 8305, 10, 15756, 289, 111, 121477,
67155, 1660, 5773, 70, 184085, 111, 118191, 4, 39897,
4, 23, 143740, 21694, 432, 9907, 5227, 5, 3293,
181815, 122084, 9201, 23, 70, 27414, 111, 48892, 169,
83, 5368, 47, 70, 144477, 9022, 840, 18, 136,
10332, 525, 184518, 456, 4240, 98, 70, 65272, 184085,
111, 23924, 21629, 4, 25902, 3674, 47, 186, 1672,
6, 91578, 5369, 10332, 5, 21763, 7, 621, 123019,
32328, 118, 7844, 3688, 4, 1284, 41767, 136, 120379,
2590, 1314, 23, 143740, 21694, 432, 831, 2843, 1380,
36880, 5941, 3789, 114149, 111, 67155, 136, 122084, 21968,
8080, 4, 26719, 40368, 285, 68794, 111, 54524, 1224,
4, 148, 50742, 7, 4, 13111, 19379, 1779, 4,
43807, 125216, 7, 4, 136, 332, 102, 62656, 7,
5, 2, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1]])
Actual Output:
I don't know what this is... print(encode)
(tensor([[[[-9.3383e-02, -2.0395e-01, 4.8226e-03, ..., 1.8068e+00,
1.1528e-01, 7.0406e-02],
[-4.4630e-02, -2.2453e-01, 9.5264e-02, ..., 1.6921e+00,
1.4607e-01, 4.8238e-02],
[-7.8206e-01, 1.2699e-01, 1.6467e+00, ..., -1.7057e+00,
8.7768e-01, 8.2230e-01],
...,
[-1.2145e-02, -2.1855e-03, -6.0966e-03, ..., 2.9296e-02,
2.2141e-03, 3.2074e-02],
[-1.4671e-02, -2.8995e-03, -5.8610e-03, ..., 2.8525e-02,
2.4620e-03, 3.1593e-02],
[-1.5877e-02, -3.5165e-03, -4.8743e-03, ..., 2.8930e-02,
2.9877e-03, 3.3892e-02]]]], grad_fn=<CopyBackwards>))
Found the answer here: https://stackoverflow.com/a/66117248/13568346
You can't directly convert a seq2seq model (encoder-decoder model) using this method. To convert a seq2seq model (encoder-decoder) you have to split them and convert them separately, an encoder to onnx and a decoder to onnx. you can follow this guide (it was done for T5 which is also a seq2seq model). you need to provide a dummy variable to both encoder and to the decoder separately. by default when converting using this method it provides the encoder the dummy variable.

Pytorch how to reshape/reduce the number of filters without altering the shape of the individual filters

With a 3D tensor of shape (number of filters, height, width), how can one reduce the number of filters with a reshape which keeps the original filters together as whole blocks?
Assume the new size has dimensions chosen such that a whole number of the original filters can fit side by side in one of the new filters. So an original size of (4, 2, 2) can be reshaped to (2, 2, 4).
A visual explanation of the side by side reshape where you see the standard reshape will alter the individual filter shapes:
I have tried various pytorch functions such as gather and select_index but not found a way to get to the end result in a general manner (i.e. works for different numbers of filters and different filter sizes).
I think it would be easier to rearrange the tensor values after performing the reshape but could not get a tensor of the pytorch reshaped form:
[[[1,2,3,4],
[5,6,7,8]],
[[9,10,11,12],
[13,14,15,16]]]
to:
[[[1,2,5,6],
[3,4,7,8]],
[[9,10,13,14],
[11,12,15,16]]]
for completeness, the original tensor before reshaping:
[[[1,2],
[3,4]],
[[5,6],
[7,8]],
[[9,10],
[11,12]],
[[13,14],
[15,16]]]
Another option is to construct a list of parts and concatenate them
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
y = torch.cat([x[i::2] for i in range(2)], dim=2)
print('Before\n', x)
print('After\n', y)
which gives
Before
tensor([[[0, 0],
[0, 0]],
[[1, 1],
[1, 1]],
[[2, 2],
[2, 2]],
[[3, 3],
[3, 3]]])
After
tensor([[[0, 0, 1, 1],
[0, 0, 1, 1]],
[[2, 2, 3, 3],
[2, 2, 3, 3]]])
Or a little more generally we could write a function that takes groups of neighbors along a source dimension and concatenates them along a destination dimension
def group_neighbors(x, group_size, src_dim, dst_dim):
assert x.shape[src_dim] % group_size == 0
return torch.cat([x[[slice(None)] * (src_dim) + [slice(i, None, group_size)] + [slice(None)] * (len(x.shape) - (src_dim + 2))] for i in range(group_size)], dim=dst_dim)
x = torch.arange(4).reshape(4, 1, 1).repeat(1, 2, 2)
# read as "take neighbors in groups of 2 from dimension 0 and concatenate them in dimension 2"
y = group_neighbors(x, group_size=2, src_dim=0, dst_dim=2)
print('Before\n', x)
print('After\n', y)
You could do it by chunking tensor and then recombining.
def side_by_side_reshape(x):
n_pairs = x.shape[0] // 2
filter_size = x.shape[-1]
x = x.reshape((n_pairs, 2, filter_size, filter_size))
return torch.stack(list(map(lambda x: torch.hstack(x.unbind()), k)))
>> p = torch.arange(1, 91).reshape((10, 3, 3))
>> side_by_side_reshape(p)
tensor([[[ 1, 2, 3, 10, 11, 12],
[ 4, 5, 6, 13, 14, 15],
[ 7, 8, 9, 16, 17, 18]],
[[19, 20, 21, 28, 29, 30],
[22, 23, 24, 31, 32, 33],
[25, 26, 27, 34, 35, 36]],
[[37, 38, 39, 46, 47, 48],
[40, 41, 42, 49, 50, 51],
[43, 44, 45, 52, 53, 54]],
[[55, 56, 57, 64, 65, 66],
[58, 59, 60, 67, 68, 69],
[61, 62, 63, 70, 71, 72]],
[[73, 74, 75, 82, 83, 84],
[76, 77, 78, 85, 86, 87],
[79, 80, 81, 88, 89, 90]]])
but I know it's not ideal since there is map, list and unbind which disrupts memory. This is what I offer till I figure out how to do it via view only (so a real reshape)

Input missed data in DF with predicted data

I have a data 200 cols and 30k rows. I have a missing data and I'd like to predict it to fill in the missing data. I want to predict None values and put the predicted data there.
I want to split data by indexes, train model on Known data, predict Unknown values, join Known and Predicted values and return them back to data on exactly the same places.
P.S. Median, dropna and other methods are not interesting, just prediction of missed values.
df = {'First' : [30, 22, 18, 49, 22], 'Second' : [80, 28, 16, 56, 30], 'Third' : [14, None, None, 30, 27], 'Fourth' : [14, 85, 17, 22, 14], 'Fifth' : [22, 33, 45, 72, 11]}
df = pd.DataFrame(df, columns = ['First', 'Second', 'Third', 'Fourth'])
Same DF with all cols comleated by data.
I do not really understand your question as well but I might have an idea for you. Have a look at the fancyimpute package. This package offers you imputation methods based on predictive models (e.g. KNN). Hope this will solve your question.
It is hard to understand the question. However, it seems like you may be interested in this question and the answer.
Using a custom function Series in fillna
Basically (from the link), you would
create a column with predicted values
use fillna with that column as parameter

How to slice a Tensor by another Tensor in Keras?

I have a Keras network with two inputs:
image of shape (128, 128, 3)
bounding-box of shape (4), i.e. (x0, y0, x1, y1)
In my network definition, I need to include the extraction of the image patch defined by the bounding-box from the input image, but I do not know how (or my attempts did not work). Here is my current attempt to achieve this, can someone please help me to understand slicing Tensors by Values of other Tensors in Keras?
# get masked image and bounding box information as inputs
masked_img = Input(shape=self.input_shape)
mask_bounding_box = Input(shape=(4,))
# fill in the masked region and extract the fill-in region
filled_img = self.generator(masked_img)
fill_in = K.slice(filled_img, (int(mask_bounding_box[0]), int(mask_bounding_box[1])),
(int(mask_bounding_box[2]), int(mask_bounding_box[3])))
Does anybody know how to do this? Any hint in the right direction would help me, please ...
Thanks in advance!
here's a native numpy solution.
import numpy as np
a = np.arange(48).reshape(3,4,4)
a
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]],
[[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]],
[[32, 33, 34, 35],
[36, 37, 38, 39],
[40, 41, 42, 43],
[44, 45, 46, 47]]])
box = (1,1,2,2) # slicing from (1,1) to (2,2)
b = a[:, box[0]:box[2]+1, box[1]:box[3]+1] # slicing on all channels
b
array([[[ 5, 6],
[ 9, 10]],
[[21, 22],
[25, 26]],
[[37, 38],
[41, 42]]])
Keras.backend.slice() requires starts and offsets, so you could do it like this:
import keras.backend as K
start=(0,1,1) # 1st channel, x1, y1
sizes=(3,2,2) # number of channels, x2-x1+1, y2-y1+1
with sess.as_default():
b=K.slice(a, start, sizes)
print(b.eval())
[[[ 5 6]
[ 9 10]]
[[21 22]
[25 26]]
[[37 38]
[41 42]]]

Resources