Hyperparameter search with Gridsearch giving parameter values that don't work - python-3.x

I am running a hyperparameter search with scikit-learn's GridSearch using a CountVectorizer and a RandomForestClassifier. The hyperparameter search grid looks like this:
grid = {
'vectorizer__ngram_range': [(1, 1)],
'vectorizer__stop_words': [None, german_stop_words],
'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],
'vectorizer__max_features': [None,100,1000, 1500],
'classifier__class_weight': ['balanced', 'balanced_subsample', None],
'classifier__n_jobs': [-1],
'classifier__n_estimators': [100, 190, 250]
}
The gridsearch runs until the end and gives me a best_params result. I have run it several times and different results come out. During the run I get these errors sometimes
warnings.warn("Estimator fit failed. The score on this train-test"
/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/root/complex_semantics/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 1213, in fit_transform
raise ValueError(
ValueError: max_df corresponds to < documents than min_df
Which I assume is normal since some values are not well-mixed. But a couple of times after getting the best params and running the model with them I then get an error telling me that the values of max_df and min_df are incorrect since the amount of documents selected with max_df is lower than the amount with min_df.
How come it runs correct during hyperparameter search with the same dataset and not with the normal run?
Any ideas? Is there a way to avoid this?
This is the code for the GridSearch
pipeline = Pipeline([('vectorizer', CountVectorizer()),('classifier', RandomForestClassifier())])
scoring_function = make_scorer(matthews_corrcoef)
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring=scoring_function, n_jobs=-1, cv=5)
grid_search.fit(X=train_text, y=train_labels)
print("-----------")
print(grid_search.best_score_)
print(grid_search.best_params_)

The values in your max_df are smaller than in min_df.
The default max_df is 1.0, which means ignore terms that appear in more than 100% of the documents.
min_df is used for removing terms that appear too occasionally.
Let's see what that translates to in your case.
'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],
Let's see an example.
max_df = 0.25 means "ignore terms that appear in more than 25% of the documents"
min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
The issue I am seeing is with 5 and 10 in min_df.
min_df = 5 means "ignore terms that appear in less than 5 documents".
min_df = 10 means "ignore terms that appear in less than 10 documents".
The error even tells you about this ValueError: max_df corresponds to < documents than min_df which probably comes from using 10 or 5 in min_df as you probably have less documents in total than those values.
So I would suggest to just sticking to float values(percentages) for both max_df and min_df and perhaps use the values [0.01, 0.1, 0.2] for vectorizer__min_df.

Related

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part).
I am following this exact tutorial here : https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
The training works flawlessly, but the problems that I have begin when I try to predict on a simple sample.
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
loaded_model = AutoModel.from_pretrained('./my_model_own_custom_training.pth',
from_tf=False)
input_sentence = "John Nash is a great mathematician, he lives in France"
tokenized_input_sentence = tokenizer([input_sentence],
truncation=True,
is_split_into_words=False,
return_tensors='pt')
predictions = loaded_model(tokenized_input_sentence["input_ids"])[0]
Predictions is of shape (1,13,768)
How can I arrive at the final result of the form [JOHN <-> ‘B-PER’, … France <-> “B-LOC”], where B-PER and B-LOC are two ground truth labels, representing the tag for a person and location respectively?
The result of the prediction is:
torch.Size([1, 13, 768])
If I write:
print(predictions.argmax(axis=2))
tensor([613, 705, 244, 620, 206, 206, 206, 620, 620, 620, 477, 693, 308])
I get the tensor above.
However I would have expected to get the tensor representing the ground truth [0…8] labels from the ground truth annotations.
Summary when loading the model :
loading configuration file ./my_model_own_custom_training.pth/config.json
Model config DistilBertConfig {
“name_or_path": “distilbert-base-uncased”,
“activation”: “gelu”,
“architectures”: [
“DistilBertForTokenClassification”
],
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”,
“3”: “LABEL_3”,
“4”: “LABEL_4”,
“5”: “LABEL_5”,
“6”: “LABEL_6”,
“7”: “LABEL_7”,
“8”: “LABEL_8”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2,
“LABEL_3”: 3,
“LABEL_4”: 4,
“LABEL_5”: 5,
“LABEL_6”: 6,
“LABEL_7”: 7,
“LABEL_8”: 8
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights”: true,
“transformers_version”: “4.8.1”,
“vocab_size”: 30522
}
The answer is a bit trickier than expected[Huge credits to Niels Rogge].
Firstly, loading models in huggingface-transformers can be done in (at least) two ways:
AutoModel.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
AutoModelForTokenClassification.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
It seems that, according to the task at hand, different AutoModels subclasses need to be used. In this scenario I posted, it is the AutoModelForTokenClassification() that has to be used.
After that, a solution to obtain the predictions would be to do the following:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1)

AxisError: axis 1 is out of bounds for array of dimension 1 when calculating AUC

I have a classification problem where I have the pixels values of an 8x8 image and the number the image represents and my task is to predict the number('Number' attribute) based on the pixel values using RandomForestClassifier. The values of the number values can be 0-9.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
forest_model = RandomForestClassifier(n_estimators=100, random_state=42)
forest_model.fit(train_df[input_var], train_df[target])
test_df['forest_pred'] = forest_model.predict_proba(test_df[input_var])[:,1]
roc_auc_score(test_df['Number'], test_df['forest_pred'], average = 'macro', multi_class="ovr")
Here it throws an AxisError.
Traceback (most recent call last):
File "dap_hazi_4.py", line 44, in
roc_auc_score(test_df['Number'], test_df['forest_pred'], average = 'macro', multi_class="ovo")
File "/home/balint/.local/lib/python3.6/site-packages/sklearn/metrics/_ranking.py", line 383, in roc_auc_score
multi_class, average, sample_weight)
File "/home/balint/.local/lib/python3.6/site-packages/sklearn/metrics/_ranking.py", line 440, in _multiclass_roc_auc_score
if not np.allclose(1, y_score.sum(axis=1)):
File "/home/balint/.local/lib/python3.6/site-packages/numpy/core/_methods.py", line 38, in _sum
return umr_sum(a, axis, dtype, out, keepdims, initial, where)
AxisError: axis 1 is out of bounds for array of dimension 1
The error is due to multi-class problem that you are solving as others suggested. All you need to do is instead of predicting the class, you need to predict the probabilities. I had this same problem before, doing this solves it.
Here is how to do it -
# you might be predicting the class this way
pred = clf.predict(X_valid)
# change it to predict the probabilities which solves the AxisError problem.
pred_prob = clf.predict_proba(X_valid)
roc_auc_score(y_valid, pred_prob, multi_class='ovr')
0.8164900342274142
# shape before
pred.shape
(256,)
pred[:5]
array([1, 2, 1, 1, 2])
# shape after
pred_prob.shape
(256, 3)
pred_prob[:5]
array([[0. , 1. , 0. ],
[0.02, 0.12, 0.86],
[0. , 0.97, 0.03],
[0. , 0.8 , 0.2 ],
[0. , 0.42, 0.58]])
Actually, as your problem is multi-class the labels must be one-hot encoded.
When labels are one-hot encoded then the 'multi_class' arguments work.
By providing one-hot encoded labels you can resolve the error.
Suppose, you have 100 test labels with 5 unique classes then your matrix size(test label's) must be (100,5) NOT (100,1)
You sure this [:,1] in test_df['forest_pred'] = forest_model.predict_proba(test_df[input_var])[:,1]
is right? It's probably 1D array

Bugs when fitting Multi label text classification models

I am now trying to fit a classification model for a Multi label text classification problem.
I have a train set X_train that contains list of cleaned text, like
["I am constructing Markov chains with to states and inferring
transition probabilities empirically by simply counting how many
times I saw each transition in my raw data",
"I know the chips only of the players of my table and mine obviously I
also know the total number of chips the max and min amount chips the
players have and the average stackIs it possible to make an
approximation of my probability of winningI have,
...]
and a train multiple tags set y corresponding to each text in X_train, like
[['hypothesis-testing', 'statistical-significance', 'markov-process'],
['probability', 'normal-distribution', 'games'],
...]
Now I want to fit a model that could predict the tags in a text set X_test that has same format as X_train.
I have used the MultiLabelBinarizer to convert the tags and used TfidfVectorizer to convert the cleaned text in train set.
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(y)
Y = multilabel_binarizer.transform(y)
vectorizer = TfidfVectorizer(stop_words = stopWordList)
vectorizer.fit(X_train)
x_train = vectorizer.transform(X_train)
But when I try to fit the model I always get bugs.I have tried OneVsRestClassifier and LogisticRegression.
When I fit a OneVsRestClassifier model I got bugs like
Traceback (most recent call last):
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/opt/conda/envs/data3/lib/python3.6/socketserver.py", line 696, in __init__
self.handle()
File "/usr/local/spark/python/pyspark/accumulators.py", line 268, in handle
poll(accum_updates)
File "/usr/local/spark/python/pyspark/accumulators.py", line 241, in poll
if func():
File "/usr/local/spark/python/pyspark/accumulators.py", line 245, in accum_updates
num_updates = read_int(self.rfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
When I fit a LogisticRegression model I got bugs like
/opt/conda/envs/data3/lib/python3.6/site-packages/sklearn/linear_model/sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
Anyone knows where the problem is and how to solve this? Many thanks.
OneVsRestClassifier fits one classifier per class. You need to tell it which type of classifier you want (for example Losgistic regression).
The following code works for me:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
classifier = OneVsRestClassifier(LogisticRegression())
classifier.fit(x_train, Y)
X_test= ["I play with Markov chains"]
x_test = vectorizer.transform(X_test)
classifier.predict(x_test)
output: array([[0, 1, 1, 0, 0, 1]])

scikit-learn: building a learning curve with SVC

I'm trying to graph a learning curve using the SVC classifier. The dataset is kinda skewed, about 150, 1000, 1000, 1000 and 150 in size. I'm running into problem with fitting the estimator:
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/learning_curve.py", line 135, in learning_curve
for train, test in cv for n_train_samples in train_sizes_abs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 644, in __call__
self.dispatch(function, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 391, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 129, in __init__
self.results = func(*args, **kwargs)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1233, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Users/carrier24sg/.virtualenvs/ml/lib/python2.7/site-packages/sklearn/svm/base.py", line 450, in _validate_targets
% len(cls))
ValueError: The number of classes has to be greater than one; got 1
My code
df = pd.read_csv('../resources/problem2_processed_validate.csv')
data, label = preprocess_text(df)
cv = StratifiedKFold(label, 10)
plt = plot_learning_curve(estimator=SVC(), title="Learning curve", X=data, y=label.values, cv
train_sizes, train_scores, test_scores = learning_curve(
estimator, data, y=label, cv=cv, train_sizes=np.linspace(.1, 1.0, 5))
Even though I use stratified sampling, I still run into this error. I believe its because the learning curve code doesn't perform stratification when incrementing dataset size, and I've got all similar class labels at one step.
How should I resolve this??
You could use StratifiedShuffleSplit instead of StratifiedKFold, and then write the learning curve loop yourself, creating a new CV object at each iteration. StratifiedShuffleSplit allows you to specify a train_size and a test_size which you can increment as you create your learning curve. As long as you let train_size be greater than the number of classes, it will be able to stratify.
You are right. learning_curve doesn't perform stratification when creating a smaller data set, it just takes the first bit of the data. Lines 134-136 in learning_curve.py say
train[:n_train_samples] for n_train_samples in train_sizes_abs
You can shuffle your data in advance, so that the slice train[:n_train_samples] may (but is not guaranteed to) include data points from all classes. If you are willing to do some more work, what #eickenberg proposed will work.
PS This sounds like something that should be included in sklearn. If you do end up writing that code, please send a pull request on github

How to handle NaNs returned from 'roc_curve' before passing to 'auc'?

I am using 'roc_curve' from the metrics model in scikit-learn. The example shows that 'roc_curve' should be called before 'auc' similar to:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
and then:
metrics.auc(fpr, tpr)
However the following error is returned:
Traceback (most recent call last): File "analysis.py", line 207, in <module>
r = metrics.auc(fpr, tpr) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/metrics/metrics.py", line 66, in auc
x, y = check_arrays(x, y) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 215, in check_arrays
_assert_all_finite(array) File "/apps/anaconda/1.6.0/lib/python2.7/site-packages/sklearn/utils/validation.py", line 18, in _assert_all_finite
raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity.
What does it mean in terms or results/is there a way to overcome this?
Are you trying to us roc_curve to evaluate a multiclass classifier? In other words, if you are using roc_curve on a classification problem that is not binary, then this won't work correctly. There is math out there for multidimensional ROC analysis, but the current ROC methods in python don't implement them.
To evaluate multiclass problems trying using methods like: confusion_matrix and classification_report from sklearn, and kappa() from skll.
You state this line:
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
which leads to the conclusion that you may have copied the sklearn example which also uses "pos_label=2".
However, in most cases you want the "pos_label" to be 1. So if your code outputs probabilities and they are between 0 and 1, then your pos_label should be 1.

Resources