HuggingFace-Transformers --- NER single sentence/sample prediction - python-3.x

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part).
I am following this exact tutorial here : https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
The training works flawlessly, but the problems that I have begin when I try to predict on a simple sample.
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
loaded_model = AutoModel.from_pretrained('./my_model_own_custom_training.pth',
from_tf=False)
input_sentence = "John Nash is a great mathematician, he lives in France"
tokenized_input_sentence = tokenizer([input_sentence],
truncation=True,
is_split_into_words=False,
return_tensors='pt')
predictions = loaded_model(tokenized_input_sentence["input_ids"])[0]
Predictions is of shape (1,13,768)
How can I arrive at the final result of the form [JOHN <-> ‘B-PER’, … France <-> “B-LOC”], where B-PER and B-LOC are two ground truth labels, representing the tag for a person and location respectively?
The result of the prediction is:
torch.Size([1, 13, 768])
If I write:
print(predictions.argmax(axis=2))
tensor([613, 705, 244, 620, 206, 206, 206, 620, 620, 620, 477, 693, 308])
I get the tensor above.
However I would have expected to get the tensor representing the ground truth [0…8] labels from the ground truth annotations.
Summary when loading the model :
loading configuration file ./my_model_own_custom_training.pth/config.json
Model config DistilBertConfig {
“name_or_path": “distilbert-base-uncased”,
“activation”: “gelu”,
“architectures”: [
“DistilBertForTokenClassification”
],
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”,
“3”: “LABEL_3”,
“4”: “LABEL_4”,
“5”: “LABEL_5”,
“6”: “LABEL_6”,
“7”: “LABEL_7”,
“8”: “LABEL_8”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2,
“LABEL_3”: 3,
“LABEL_4”: 4,
“LABEL_5”: 5,
“LABEL_6”: 6,
“LABEL_7”: 7,
“LABEL_8”: 8
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights”: true,
“transformers_version”: “4.8.1”,
“vocab_size”: 30522
}

The answer is a bit trickier than expected[Huge credits to Niels Rogge].
Firstly, loading models in huggingface-transformers can be done in (at least) two ways:
AutoModel.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
AutoModelForTokenClassification.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
It seems that, according to the task at hand, different AutoModels subclasses need to be used. In this scenario I posted, it is the AutoModelForTokenClassification() that has to be used.
After that, a solution to obtain the predictions would be to do the following:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1)

Related

Hyperparameter search with Gridsearch giving parameter values that don't work

I am running a hyperparameter search with scikit-learn's GridSearch using a CountVectorizer and a RandomForestClassifier. The hyperparameter search grid looks like this:
grid = {
'vectorizer__ngram_range': [(1, 1)],
'vectorizer__stop_words': [None, german_stop_words],
'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],
'vectorizer__max_features': [None,100,1000, 1500],
'classifier__class_weight': ['balanced', 'balanced_subsample', None],
'classifier__n_jobs': [-1],
'classifier__n_estimators': [100, 190, 250]
}
The gridsearch runs until the end and gives me a best_params result. I have run it several times and different results come out. During the run I get these errors sometimes
warnings.warn("Estimator fit failed. The score on this train-test"
/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 330, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 292, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "/root/complex_semantics/lib/python3.8/site-packages/joblib/memory.py", line 352, in __call__
return self.func(*args, **kwargs)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py", line 740, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/feature_extraction/text.py", line 1213, in fit_transform
raise ValueError(
ValueError: max_df corresponds to < documents than min_df
Which I assume is normal since some values are not well-mixed. But a couple of times after getting the best params and running the model with them I then get an error telling me that the values of max_df and min_df are incorrect since the amount of documents selected with max_df is lower than the amount with min_df.
How come it runs correct during hyperparameter search with the same dataset and not with the normal run?
Any ideas? Is there a way to avoid this?
This is the code for the GridSearch
pipeline = Pipeline([('vectorizer', CountVectorizer()),('classifier', RandomForestClassifier())])
scoring_function = make_scorer(matthews_corrcoef)
grid_search = GridSearchCV(pipeline, param_grid=grid, scoring=scoring_function, n_jobs=-1, cv=5)
grid_search.fit(X=train_text, y=train_labels)
print("-----------")
print(grid_search.best_score_)
print(grid_search.best_params_)
The values in your max_df are smaller than in min_df.
The default max_df is 1.0, which means ignore terms that appear in more than 100% of the documents.
min_df is used for removing terms that appear too occasionally.
Let's see what that translates to in your case.
'vectorizer__max_df': [0.25, 0.5, 0.75, 1],
'vectorizer__min_df': [0.01, 0.1, 1, 5, 10],
Let's see an example.
max_df = 0.25 means "ignore terms that appear in more than 25% of the documents"
min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
The issue I am seeing is with 5 and 10 in min_df.
min_df = 5 means "ignore terms that appear in less than 5 documents".
min_df = 10 means "ignore terms that appear in less than 10 documents".
The error even tells you about this ValueError: max_df corresponds to < documents than min_df which probably comes from using 10 or 5 in min_df as you probably have less documents in total than those values.
So I would suggest to just sticking to float values(percentages) for both max_df and min_df and perhaps use the values [0.01, 0.1, 0.2] for vectorizer__min_df.

Reading test images for resnet18

I am trying to read an image file and classify and image.
My model is resnet18 I trained it previously and planning to use a different .py script to classify a list of images. This is my network:
PATH = './net.pth'
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 16)
model_ft.load_state_dict(torch.load(PATH))
model_ft.eval()
And I am trying to read Images this way:
imsize = 256
loader = transforms.Compose([transforms.Scale(imsize), transforms.ToTensor()])
def image_loader(image_name):
#load image, returns cuda tensor
image = Image.open(image_name)
image = loader(image).float()
image = Variable(image, requires_grad=True)
image = image.unsqueeze(0)
return image.cuda()
image = image_loader("dataset/test/9673.png")
model_ft(image)
I am getting this error:
"Given groups=1, weight of size [64, 3, 7, 7], expected input[1, 4, 676, 256] to have 3 channels, but got 4 channels instead"
I've got recommended to remove the unsqueeze for resnet18, doing that I got the following error:
"Expected 4-dimensional input for 4-dimensional weight [64, 3, 7, 7], but got 3-dimensional input of size [4, 676, 256] instead"
I do not quite understand the problem I am dealing with, how should I read my test set? I'll need to write the class ID-s and the file names into a .txt afterwards.
You are using a PNG image which has 4 channels. your network expects 3 channels.
Convert to RGB and you should be fine. In your image_loader simply do:
image = Image.open(image_name).convert('RGB')
I think your image input is of the shape: batch_size*4*height*width instead of batch_size*3*height*width. That's why the error. Can you do this and report the shape:
print(image.shape) after the call to image_loader().

Multiple Entity recognition with Spacy python Error

i am stuck on a problem and seeking help from you. i am trying to train multiple entity using spacy
Following is my Train Data
response =[
('java developer with java and html css javascript ',
{'entities': [(0, 14, 'jobtitle'),
(0 , 4, 'skills'),
(34,37,'skills'),
(38, 49, 'skills')
]
}),
('looking for software engineer with java python',
{
'entities': [
(12, 29, 'jobtitle'),
(40, 46, 'skills'),
(35,39,"skills")
]
})
]
here is train code i have issue
nlp = spacy.blank("en")
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
Error :
ValueError: [E103] Trying to set conflicting doc.ents: '(0, 14, 'jobtitle')' and '(0, 4, 'skills')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
As the error message explains, spacy's NER model does not support overlapping entity spans, so you can't train a model using these annotations.

How to get words from output of XLNet using Transformers library

I am using Hugging Face's Transformer library to work with different NLP models. Following code does masking with XLNet. It outputs a tensor with numbers. How do I convert the output to words again?
import torch
from transformers import XLNetModel, XLNetTokenizer, XLNetLMHeadModel
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetLMHeadModel.from_pretrained('xlnet-base-cased')
# We show how to setup inputs to predict a next token using a bi-directional context.
input_ids = torch.tensor(tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")).unsqueeze(0) # We will predict the masked token
print(input_ids)
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)
next_token_logits = outputs[0] # Output has shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size]
The current output I get is:
tensor([[[ -5.1466, -17.3758, -17.3392, ..., -12.2839, -12.6421, -12.4505]]],
grad_fn=AddBackward0)
The output you have is a tensor of size 1 by 1 by vocabulary size. The meaning of the nth number in this tensor is the estimated log-odds of the nth vocabulary item. So, if you want to get out the word that the model predicts to be most likely to come in the final position (the position you specified with target_mapping), all you need to do is find the word in the vocabulary with the maximum predicted log-odds.
Just add the following to the code you have:
predicted_index = torch.argmax(next_token_logits[0][0]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
So predicted_token is the token the model predicts as most likely in that position.
Note, by default behaviour of XLNetTokenizer.encoder() adds special tokens and to the end of a string of tokens when it encodes it. The code you have given masks and predicts the final word, which, after running though tokenizer.encoder() is the special character '<cls>', which is probably not what you want.
That is, when you run
tokenizer.encode("I went to <mask> York and saw the <mask> <mask> building.")
the result is a list of token ids,
[35, 388, 22, 6, 313, 21, 685, 18, 6, 6, 540, 9, 4, 3]
which, if you convert back to tokens (by calling tokenizer.convert_ids_to_tokens() on the above id list), you will see has two extra tokens added at the end,
['▁I', '▁went', '▁to', '<mask>', '▁York', '▁and', '▁saw', '▁the', '<mask>', '<mask>', '▁building', '.', '<sep>', '<cls>']
So, if the word you are meaning to predict is 'building', you should use perm_mask[:, :, -4] = 1.0 and target_mapping[0, 0, -4] = 1.0.

Building my own tf.Estimator, how did model_params overwrite model_dir? RuntimeWarning?

Recently I built a customized deep neural net model using TFLearn, which claims to bring deep learning to the scikit-learn estimator API. I could train models and make predictions, but I couldn't get the scoring (evaluate) function to work, so I couldn't do cross-validation. I tried to ask questions about TFLearn in various places, but I got no responses.
It appears that TensorFlow itself has an estimator class. So I am putting TFLearn aside, and I'm trying to follow the guide at https://www.tensorflow.org/extend/estimators. Somehow I'm managing to get variables where they don't belong. Can anyone spot my problem? I will post code and the output.
Note: Of course, I can see the RuntimeWarning at the top of the output. I have found references to this warning online, but so far everyone claims it's harmless. Maybe it is not...
CODE:
import tensorflow as tf
from my_library import Database, l2_angle_distance
def my_model_function(topology, params):
# This function will eventually be a function factory. This should
# allow easy exploration of hyperparameters. For now, this just
# returns a single, fixed model_fn.
def model_fn(features, labels, mode):
# Input layer
net = tf.layers.conv1d(features["x"], topology[0], 3, activation=tf.nn.relu)
net = tf.layers.dropout(net, 0.25)
# The core of the network is here (convolutional layers only for now).
for nodes in topology[1:]:
net = tf.layers.conv1d(net, nodes, 3, activation=tf.nn.relu)
net = tf.layers.dropout(net, 0.25)
sh = tf.shape(features["x"])
net = tf.reshape(net, [sh[0], sh[1], 3, 2])
predictions = tf.nn.l2_normalize(net, dim=3)
# PREDICT EstimatorSpec
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode,
predictions={"vectors": predictions})
# TRAIN or EVAL EstimatorSpec
loss = l2_angle_distance(labels, predictions)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=params["learning_rate"])
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, predictions, loss, train_op)
return model_fn
##===================================================================
window = "whole"
encoding = "one_hot"
db = Database("/home/bwllc/Documents/Files for ML/compact")
traindb, testdb = db.train_test_split()
train_features, train_labels = traindb.values(window, encoding)
test_features, test_labels = testdb.values(window, encoding)
# Create the model.
tf.logging.set_verbosity(tf.logging.INFO)
LEARNING_RATE = 0.01
topology = (60,40,20)
model_params = {"learning_rate": LEARNING_RATE}
model_fn = my_model_function(topology, model_params)
model = tf.estimator.Estimator(model_fn, model_params)
print("\nmodel_dir? No? Why not? ", model.model_dir, "\n") # This documents the error
# Input function.
my_input_fn = tf.estimator.inputs.numpy_input_fn({"x" : train_features}, train_labels, shuffle=True)
# Train the model.
model.train(input_fn=my_input_fn, steps=20)
OUTPUT
/usr/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': {'learning_rate': 0.01}, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0b55279048>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
model_dir? No? Why not? {'learning_rate': 0.01}
INFO:tensorflow:Create CheckpointSaverHook.
Traceback (most recent call last):
File "minimal_estimator_bug_example.py", line 81, in <module>
model.train(input_fn=my_input_fn, steps=20)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 302, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/estimator/estimator.py", line 756, in _train_model
scaffold=estimator_spec.scaffold)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 411, in __init__
self._save_path = os.path.join(checkpoint_dir, checkpoint_basename)
File "/usr/lib/python3.6/posixpath.py", line 78, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not dict
------------------
(program exited with code: 1)
Press return to continue
I can see exactly what went wrong, model_dir (which I left as the default) somehow bound to the value I intended for model_params. How did this happen in my code? I can't see it.
If anyone has advice or suggestions, I would greatly appreciate them. Thanks!
Simply because you're feeding your model_param as a model_dir when you construct your Estimator.
From the tensorflow documentation :
Estimator __init__ function :
__init__(
model_fn,
model_dir=None,
config=None,
params=None
)
Notice how the second argument is the model_dir one. If you want to specify only the params one, you need to pass it as a keyword argument.
model = tf.estimator.Estimator(model_fn, params=model_params)
Or specify all the previous positional arguments :
model = tf.estimator.Estimator(model_fn, None, None, model_params)

Resources