How to set the label names when using the Huggingface TextClassificationPipeline? - nlp

I am using a fine-tuned Huggingface model (on my company data) with the TextClassificationPipeline to make class predictions. Now the labels that this Pipeline predicts defaults to LABEL_0, LABEL_1 and so on. Is there a way to supply the label mappings to the TextClassificationPipeline object so that the output may reflect the same?
Env:
tensorflow==2.3.1
transformers==4.3.2
Sample Code:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # or any {'0', '1', '2'}
from transformers import TextClassificationPipeline, TFAutoModelForSequenceClassification, AutoTokenizer
MODEL_DIR = "path\to\my\fine-tuned\model"
# Feature extraction pipeline
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_DIR)
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
pipeline = TextClassificationPipeline(model=model,
tokenizer=tokenizer,
framework='tf',
device=0)
result = pipeline("It was a good watch. But a little boring.")[0]
Output:
In [2]: result
Out[2]: {'label': 'LABEL_1', 'score': 0.8864616751670837}

The simplest way is to add such a mapping is to edit the config.json of the model to contain: id2label field as below:
{
"_name_or_path": "distilbert-base-uncased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"id2label": [
"negative",
"positive"
],
"attention_dropout": 0.1,
.
.
}
A in-code way to set this mapping is by adding the id2label param in the from_pretrained call as below:
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_DIR, id2label={0: 'negative', 1: 'positive'})
Here is the Github Issue I raised for this to get added into the Documentation of transformers.XForSequenceClassification.

Related

resize_token_embeddings on the a pertrained model with different embedding size

I would like to ask about the way to change the embedding size of the trained model.
I have a trained model models/BERT-pretrain-1-step-5000.pkl.
Now I am adding a new token [TRA]to the tokeniser and try to use the resize_token_embeddings to the pertained one.
from pytorch_pretrained_bert_inset import BertModel #BertTokenizer
from transformers import AutoTokenizer
from torch.nn.utils.rnn import pad_sequence
import tqdm
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model_bert = BertModel.from_pretrained('bert-base-uncased', state_dict=torch.load('models/BERT-pretrain-1-step-5000.pkl', map_location=torch.device('cpu')))
#print(tokenizer.all_special_tokens) #--> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
#print(tokenizer.all_special_ids) #--> [100, 102, 0, 101, 103]
num_added_toks = tokenizer.add_tokens(['[TRA]'], special_tokens=True)
model_bert.resize_token_embeddings(len(tokenizer)) # --> Embedding(30523, 768)
print('[TRA] token id: ', tokenizer.convert_tokens_to_ids('[TRA]')) # --> 30522
But I encountered the error:
AttributeError: 'BertModel' object has no attribute 'resize_token_embeddings'
I assume that it is because the model_bert(BERT-pretrain-1-step-5000.pkl) I had has the different embedding size.
I would like to know if there is any way to fit the embedding size of my modified tokeniser and the model I would like to use as the initial weights.
Thanks a lot!!
resize_token_embeddings is a huggingface transformer method. You are using the BERTModel class from pytorch_pretrained_bert_inset which does not provide such a method. Looking at the code, it seems like they have copied the BERT code from huggingface some time ago.
You can either wait for an update from INSET (maybe create a github issue) or write your own code to extend the word_embedding layer:
from torch import nn
embedding_layer = model.embeddings.word_embeddings
old_num_tokens, old_embedding_dim = embedding_layer.weight.shape
num_new_tokens = 1
# Creating new embedding layer with more entries
new_embeddings = nn.Embedding(
old_num_tokens + num_new_tokens, old_embedding_dim
)
# Setting device and type accordingly
new_embeddings.to(
embedding_layer.weight.device,
dtype=embedding_layer.weight.dtype,
)
# Copying the old entries
new_embeddings.weight.data[:old_num_tokens, :] = embedding_layer.weight.data[
:old_num_tokens, :
]
model.embeddings.word_embeddings = new_embeddings

Creating custom DependencyParser from scratch in Spacy 3

I am trying to implement my own DependencyParser from scratch in Spacy 3. I create an empty model, create an empty DependencyParser, train it and save its configuration. But when I try to load my custom parser config again, I can only do it successfully if the model is empty. If I am using a non-empty model, then I keep getting this error - ValueError: could not broadcast input array from shape (106,64) into shape (27,64).
import spacy
import random
from spacy.tokens import Doc
from spacy.training import Example
from spacy.pipeline import DependencyParser
from typing import List, Tuple
PARSER_CONFIG = 'parser.cfg'
TRAINING_DATA = [
('find a high paying job with no experience', {
'heads': [0, 4, 4, 4, 0, 7, 7, 4],
'deps': ['ROOT', '-', 'QUALITY', 'QUALITY', 'ACTIVITY', '-', 'QUALITY', 'ATTRIBUTE']
}),
('find good workout classes near home', {
'heads': [0, 3, 3, 0, 5, 3],
'deps': ['ROOT', 'QUALITY', 'QUALITY', 'ACTIVITY', 'QUALITY', 'ATTRIBUTE']
})
]
def create_training_examples(training_data: List[Tuple]) -> List[Example]:
""" Create list of training examples """
examples = []
nlp = spacy.load('en_core_web_md')
for text, annotations in training_data:
print(f"{text} - {annotations}")
examples.append(Example.from_dict(nlp(text), annotations))
return examples
def save_parser_config(parser: DependencyParser):
print(f"Save parser config to '{PARSER_CONFIG}' ... ", end='')
parser.to_disk(PARSER_CONFIG)
print("DONE")
def load_parser_config(parser: DependencyParser):
print(f"Load parser config from '{PARSER_CONFIG}' ... ", end='')
parser.from_disk(PARSER_CONFIG)
print("DONE")
def main():
nlp = spacy.blank('en')
# Create new parser
parser = nlp.add_pipe('parser', first=True)
for text, annotations in TRAINING_DATA:
for label in annotations['deps']:
if label not in parser.labels:
parser.add_label(label)
print(f"Added labels: {parser.labels}")
examples = create_training_examples(TRAINING_DATA)
# Training
# NOTE: The 'lambda: examples' part is mandatory in Spacy 3 - https://spacy.io/usage/v3#migrating-training-python
optimizer = nlp.initialize(lambda: examples)
print(f"Training ... ", end='')
for i in range(25):
print(f"{i} ", end='')
random.shuffle(examples)
nlp.update(examples, sgd=optimizer)
print(f"... DONE")
save_parser_config(parser)
# I can load parser config to blank model ...
nlp = spacy.blank('en')
parser = nlp.add_pipe('parser')
# ... but I cannot load parser config to already existing model
# Return -> ValueError: could not broadcast input array from shape (106,64) into shape (27,64)
# nlp = spacy.load('en_core_web_md')
# parser = nlp.get_pipe('parser')
load_parser_config(parser)
print(f"Current pipeline is {nlp.meta['pipeline']}")
doc = nlp(u'find a high paid job with no degree')
print(f"Arcs: {[(w.text, w.dep_, w.head.text) for w in doc if w.dep_ != '-']}")
if __name__ == '__main__':
main()
The custom parser itself is working as expected. You can test this by commenting out all the code from save_parser_config(parser) to load_parser_config(parser) (inclusive), and run the code again. You will see new labels are assigned as needed. This is why I think the root of the problem is the inability to load the parser configuration of an empty model into a non-empty model. But how to get around this?
I contacted the developers and this is what they answered - https://github.com/explosion/spaCy/discussions/9239

Cannot export PyTorch model to ONNX

I am trying to convert a pre-trained torch model to ONNX, but recive the following error:
RuntimeError: step!=1 is currently not supported
I'm trying this on a pre-trained colorization model: https://github.com/richzhang/colorization
Here is the code I ran in Google Colab:
!git clone https://github.com/richzhang/colorization.git
cd colorization/
import colorizers
model = colorizer_siggraph17 = colorizers.siggraph17(pretrained=True).eval()
input_names = [ "input" ]
output_names = [ "output" ]
dummy_input = torch.randn(1, 1, 256, 256, device='cpu')
torch.onnx.export(model, dummy_input, "test_converted_model.onnx", verbose=True,
input_names=input_names, output_names=output_names)
I appreciate any help :)
UPDATE 1: #Proko suggestion solved the ONNX export issue. Now I have a new possibly related problem when I try to convert the ONNX to TensorRT. I get the following error:
[TensorRT] ERROR: Network must have at least one output
Here is the code I used:
import torch
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt
import onnx
TRT_LOGGER = trt.Logger()
def build_engine(onnx_file_path):
# initialize TensorRT engine and parse ONNX model
builder = trt.Builder(TRT_LOGGER)
builder.max_workspace_size = 1 << 25
builder.max_batch_size = 1
if builder.platform_has_fast_fp16:
builder.fp16_mode = True
network = builder.create_network()
parser = trt.OnnxParser(network, TRT_LOGGER)
# parse ONNX
with open(onnx_file_path, 'rb') as model:
print('Beginning ONNX file parsing')
parser.parse(model.read())
print('Completed parsing of ONNX file')
# generate TensorRT engine optimized for the target platform
print('Building an engine...')
engine = builder.build_cuda_engine(network)
context = engine.create_execution_context()
print("Completed creating Engine")
return engine, context
ONNX_FILE_PATH = 'siggraph17.onnx' # Exported using the code above
engine,_ = build_engine(ONNX_FILE_PATH)
I tried to force the build_engine function to use the output of the network by:
network.mark_output(network.get_layer(network.num_layers-1).get_output(0))
but it did not work.
I appropriate any help!
Like I have mentioned in a comment, this is because slicing in torch.onnx supports only step = 1 but there are 2-step slicing in the model:
self.model2(conv1_2[:,:,::2,::2])
Your only option as for now is to rewrite slicing to be some other ops. You can do it by using range and reshape to obtain proper indices. Consider the following function "step-less-arange" (I hope it is generic enough for anyone with similar problem):
def sla(x, step):
diff = x % step
x += (diff > 0)*(step - diff) # add length to be able to reshape properly
return torch.arange(x).reshape((-1, step))[:, 0]
usage:
>> sla(11, 3)
tensor([0, 3, 6, 9])
Now you can replace every slice like this:
conv2_2 = self.model2(conv1_2[:,:,self.sla(conv1_2.shape[2], 2),:][:,:,:, self.sla(conv1_2.shape[3], 2)])
NOTE: you should optimize it. Indices are calculated for every call so it might be wise to pre-compute it.
I have tested it with my fork of the repo and I was able to save the model:
https://github.com/prokotg/colorization
What works for me was to add the opset_version=11 on torch.onnx.export
First I had tried use opset_version=10, but the API suggest 11 so it works.
So your function should be:
torch.onnx.export(model, dummy_input, "test_converted_model.onnx", verbose=True,opset_version=11,
input_names=input_names, output_names=output_names)

Hyper parameter tuning on pipeline object

I have this pipeline,
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
("selector", get_numeric_data),
])),
('text_features', Pipeline([
("selector",get_text_data),
("vectorizer", HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC,non_negative=True, norm=None, binary=False, ngram_range=(1,2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
])), ("clf",LogisticRegression())
])
When I try to do
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
c_space = np.logspace(-5, 8, 15)
param_grid = {"C": c_space,"penalty": ['l1', 'l2']}
logreg_cv = GridSearchCV(pl,param_grid=param_grid,cv=5)
logreg_cv.fit(X_train,y_train)
It throws me
ValueError: Invalid parameter penalty for estimator
Pipeline(memory=None,
steps=[('union', FeatureUnion(n_jobs=1,
transformer_list=[('numeric_features', Pipeline(memory=None,
steps=[('selector', FunctionTransformer(accept_sparse=False,
func= at 0x00000190ECB49488>, inv_kw_args=None,
inverse_func=None, kw_args=None, pass_y=...ty='l2', random_state=None,
solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))]). Check the list of available parameters
with estimator.get_params().keys().
Although "C" and "penalty" legit parameters in this case. Please help me hoe to go about it.
"C" and "penalty" are legit parameters of LogisticRegression, not Pipeline object that you send to GridSearchCV.
Your pipeline currently have two components, "union" and "clf". Now the pipeline dont know which part to send the paramters. You need to append these names used in pipeline with params, so that it can identify them and send them to correct object.
Do this:
param_grid = {"clf__C": c_space,"clf__penalty": ['l1', 'l2']}
Note that there are two underscores in between the name of object in pipeline and the parameters.
Its mentioned in the documentation of Pipeline and FeatureUnion here:
Parameters of the estimators in the pipeline can be accessed using the
__ syntax
With various examples to demonstrate the usage.
Following this, if you want to say change the ngram_range of HashingVectorizer, you would do this:
"union__text_features__vectorizer__ngram_range" : [(1,3)]

Usng same Label Encoder to test dataset? or new Label Encoder?

I'm totally novice on scikit-learn.
I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below
from sklearn import preprocessing
# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] ) # labeling from string
....
1. Using same label encoder
df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
2. Using different label encoder
le_for_test_blood_type = preprocessing.LabelEncoder()
df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
Which one is right code?
Or, whatever I choose the above's code it does not make any differences
because training dataset's categorical data and test dataset's categorical data should be the same as a result.
The problem is the way you use it in fact.
As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.
The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
from official doc
I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:
In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.

Resources