Hyperopt tuning parameters get stuck - python-3.x

I'm testing to tune parameters of SVM with hyperopt library.
Often, when i execute this code, the progress bar stop and the code get stuck.
I do not understand why.
Here is my code :
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
X_train = normalize(X_train)
def hyperopt_train_test(params):
if 'decision_function_shape' in params:
if params['decision_function_shape'] == "ovo":
params['break_ties'] = False
clf = svm.SVC(**params)
y_pred = clf.fit(X_train, y_train).predict(X_test)
return precision_recall_fscore_support(y_test, y_pred, average='macro')[0]
space4svm = {
'C': hp.uniform('C', 0, 20),
'kernel': hp.choice('kernel', ['linear', 'sigmoid', 'poly', 'rbf']),
'degree': hp.uniform('degree', 10, 30),
'gamma': hp.uniform('gamma', 10, 30),
'coef0': hp.uniform('coef0', 15, 30),
'shrinking': hp.choice('shrinking', [True, False]),
'probability': hp.choice('probability', [True, False]),
'tol': hp.uniform('tol', 0, 3),
'decision_function_shape': hp.choice('decision_function_shape', ['ovo', 'ovr']),
'break_ties': hp.choice('break_ties', [True, False])
}
def f(params):
print(params)
precision = hyperopt_train_test(params)
return {'loss': -precision, 'status': STATUS_OK}
trials = Trials()
best = fmin(f, space4svm, algo=tpe.suggest, max_evals=35, trials=trials)
print('best:')
print(best)

I would suggest restricting the space of your parameters and see if that works. Fix the probability parameter to False and see if the model trains. Also, gamma needs to be {‘scale’, ‘auto’} according to the documentation.
Also at every iteration print out your params to better understand which combination is causing the model to get stuck.

Related

tune hyperparameters of XGBRanker

I try to optimize my hyperparameters of my XGBoost Ranker model, but I can't
Here is what my table (df on code) looks like :
query
relevance
features
1
5
5.4.7....
1
3
6........
2
5
3........
2
3
8........
3
2
1........
Then I split my table on train test with on the test table only one query:
gss = GroupShuffleSplit(test_size=1, n_splits=1,).split(df, groups=df['query'])
X_train_inds, X_test_inds = next(gss)
train_data= df.iloc[X_train_inds]
X_train=train_data.drop(columns=["relevance"])
Y_train=train_data.relevance
test_data= df.iloc[X_test_inds]
X_test=test_data.drop(columns=["relevance"])
Y_test=test_data.relevance
and constitute groups which is the number of lines by query:
groups = train_data.groupby('query').size().to_frame('size')['size'].to_numpy()
And then I run my model and try to optimize the hyperparameters with a RandomizedSearchCV:
param_dist = {'n_estimators': randint(40, 1000),
'learning_rate': uniform(0.01, 0.59),
'subsample': uniform(0.3, 0.6),
'max_depth': [3, 4, 5, 6, 7, 8, 9],
'colsample_bytree': uniform(0.5, 0.4),
'min_child_weight': [0.05, 0.1, 0.02]
}
scoring = sklearn.metrics.make_scorer(sklearn.metrics.ndcg_score, k=10,
greater_is_better=True)
model = xgb.XGBRanker(
tree_method='hist',
booster='gbtree',
objective='rank:ndcg',)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv=5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train,Y_train, group=groups)
Then I have the following error message which it seems be related to my construction of groups but I don't see why (Knowing that without the randomsearch the model works) :
Check failed: group_ptr_.back() == num_row_ (11544 vs. 9235) : Invalid group structure. Number of rows obtained from groups doesn't equal to actual number of rows given by data.
Same problem as here:(Tuning XGBRanker produces error for groups)

How to pass group information in sklearn Random Grid Search for XGBRanker?

When I'm tryingto perform random grid search on XGBRanker model, I keep getting an error as follows:
/workspace/src/objective/rank_obj.cc:52: Check failed: gptr.size() != 0 && gptr.back() == info.labels_.Size(): group structure not consistent with #rows
The error seems to be regarding the structure of the group information passed. I'm passing the size of each group. If there are N rows and 2 groups then the array passed would be [g1_size, g2_size].
I'm not sure where I'm going wrong since I'm able to fit the model without any issues. Only when I try to perform RandomGridSearchCV, am I facing this error. The code snippet is as follows:
model = xgb.XGBRanker(
objective="rank:ndcg",
max_depth= 10,
n_estimators=100,
verbosity=1)
param_dist = {'n_estimators': [100,200,300],
'learning_rate': [1e-3,1e-4,1e-5],
'subsample': [0.8,0.9,1],
'max_depth': [5, 6, 7]
}
fit_params = {"group": groups}
scoring = make_scorer(ndcg_score, greater_is_better=True)
clf = RandomizedSearchCV(model,
param_distributions=param_dist,
cv =5,
n_iter=5,
scoring=scoring,
error_score=0,
verbose=3,
n_jobs=-1)
clf.fit(X_train, Y_train,**fit_params)

Implementing word dropout in pytorch

I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).
Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!

Custom loss function in Keras, how to deal with placeholders

I am trying to generate a custom loss function in TF/Keras,the loss function works if it is run in a session and passed constants, however, it stops working when compiled into a Keras.
The cost function (thanks to Lior for converting it to TF)
def ginicTF(actual,pred):
n = int(actual.get_shape()[-1])
inds = K.reverse(tf.nn.top_k(pred,n)[1],axes=[0])
a_s = K.gather(actual,inds)
a_c = K.cumsum(a_s)
giniSum = K.sum(a_c)/K.sum(a_s) - (n+1)/2.0
return giniSum / n
def gini_normalizedTF(a,p):
return -ginicTF(a, p) / ginicTF(a, a)
#Test the cost function
sess = tf.InteractiveSession()
p = [0.9, 0.3, 0.8, 0.75, 0.65, 0.6, 0.78, 0.7, 0.05, 0.4, 0.4, 0.05, 0.5, 0.1, 0.1]
a = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ac = tf.placeholder(shape=(len(a),),dtype=K.floatx())
pr = tf.placeholder(shape=(len(p),),dtype=K.floatx())
print(gini_normalizedTF(ac,pr).eval(feed_dict={ac:a,pr:p}))
this prints -0.62962962963, which is the correct value.
Now let's put this into Keras MLP
def makeModel(n_feat):
model = Sequential()
#hidden layer #1
model.add(layers.Dense(12, input_shape=(n_feat,)))
model.add(layers.Activation('selu'))
model.add(layers.Dropout(0.2))
#output layer
model.add(layers.Dense(1))
model.add(layers.Activation('softmax'))
model.compile(loss=gini_normalizedTF, optimizer='sgd', metrics=['binary_accuracy'])
return model
model=makeModel(n_feats)
model.fit(x=Mout,y=targets,epochs=n_epochs,validation_split=0.2,batch_size=batch_size)
This generates error
<ipython-input-62-6ade7307336f> in ginicTF(actual, pred)
9 def ginicTF(actual,pred):
10
---> 11 n = int(actual.get_shape()[-1])
12
13 inds = K.reverse(tf.nn.top_k(pred,n)[1],axes=[0])
TypeError: __int__ returned non-int (type NoneType)
I tried going around it by giving a default value of n/etc but this doesn't seem to be going anywhere.
Can someone explain the nature of this problem and how I can remedy it?
Thank you!
Edit:
Updated things to keep it as tensor and then cast
def ginicTF(actual,pred):
nT = K.shape(actual)[-1]
n = K.cast(nT,dtype='int32')
inds = K.reverse(tf.nn.top_k(pred,n)[1],axes=[0])
a_s = K.gather(actual,inds)
a_c = K.cumsum(a_s)
n = K.cast(nT,dtype=K.floatx())
giniSum = K.cast(K.sum(a_c)/K.sum(a_s),dtype=K.floatx()) - (n+1)/2.0
return giniSum / n
def gini_normalizedTF(a,p):
return ginicTF(a, p) / ginicTF(a, a)
Still has the issue of getting "none" when used as a cost function

How to get all parameters of estimator in PySpark

I have a RandomForestRegressor, GBTRegressor and I'd like to get all parameters of them. The only way I found it could be done with several get methods like:
from pyspark.ml.regression import RandomForestRegressor, GBTRegressor
est = RandomForestRegressor()
est.getMaxDepth()
est.getSeed()
But RandomForestRegressor and GBTRegressor have different parameters so it's not a good idea to hardcore all that methods.
A workaround could be something like this:
get_methods = [method for method in dir(est) if method.startswith('get')]
params_est = {}
for method in get_methods:
try:
key = method[3:]
params_est[key] = getattr(est, method)()
except TypeError:
pass
Then output will be like this:
params_est
{'CacheNodeIds': False,
'CheckpointInterval': 10,
'FeatureSubsetStrategy': 'auto',
'FeaturesCol': 'features',
'Impurity': 'variance',
'LabelCol': 'label',
'MaxBins': 32,
'MaxDepth': 5,
'MaxMemoryInMB': 256,
'MinInfoGain': 0.0,
'MinInstancesPerNode': 1,
'NumTrees': 20,
'PredictionCol': 'prediction',
'Seed': None,
'SubsamplingRate': 1.0}
But I think there should be a better way to do that.
extractParamMap can be used to get all params from every estimator, for example:
>>> est = RandomForestRegressor()
>>> {param[0].name: param[1] for param in est.extractParamMap().items()}
{'numTrees': 20, 'cacheNodeIds': False, 'impurity': 'variance', 'predictionCol': 'prediction', 'labelCol': 'label', 'featuresCol': 'features', 'minInstancesPerNode': 1, 'seed': -5851613654371098793, 'maxDepth': 5, 'featureSubsetStrategy': 'auto', 'minInfoGain': 0.0, 'checkpointInterval': 10, 'subsamplingRate': 1.0, 'maxMemoryInMB': 256, 'maxBins': 32}
>>> est = GBTRegressor()
>>> {param[0].name: param[1] for param in est.extractParamMap().items()}
{'cacheNodeIds': False, 'impurity': 'variance', 'predictionCol': 'prediction', 'labelCol': 'label', 'featuresCol': 'features', 'stepSize': 0.1, 'minInstancesPerNode': 1, 'seed': -6363326153609583521, 'maxDepth': 5, 'maxIter': 20, 'minInfoGain': 0.0, 'checkpointInterval': 10, 'subsamplingRate': 1.0, 'maxMemoryInMB': 256, 'lossType': 'squared', 'maxBins': 32}
As described in How to print best model params in pyspark pipeline , you can get any model parameter that is available in the original JVM object of any model using the following structure
<yourModel>.stages[<yourModelStage>]._java_obj.<getYourParameter>()
All get-parameters are available here
https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/RandomForestClassificationModel.html
For example, if you want to get MaxDepth of your RandomForest after cross-validation (getMaxDepth is not available in PySpark) you use
cvModel.bestModel.stages[-1]._java_obj.getMaxDepth()

Resources