Reduce inference time for BERT

Reduce inference time for BERT - pytorch

I want to further improve the inference time from BERT.
Here is the code below:
for sentence in list(data_dict.values()):
tokens = {'input_ids': [], 'attention_mask': []}
new_tokens = tokenizer.encode_plus(sentence, max_length=512,
truncation=True, padding='max_length',
return_tensors='pt',
return_attention_mask=True)
tokens['input_ids'].append(new_tokens['input_ids'][0])
tokens['attention_mask'].append(new_tokens['attention_mask'][0])
# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
outputs = model(**tokens)
embeddings = outputs[0]
Is there a way to provide batches (like in training) instead of the whole dataset?

There are several optimizations that we can do here, which are (mostly) natively supported by the Huggingface tokenizer.
TL;DR, an optimized version would be this one, I have explained the ideas behind each change below.
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad():
outputs = model(**tokenized_sentences)
The first optimization is to batch together several samples at the same time. For this, it is helpful to have a closer look at the actual __call__ function of the tokenizer, see here (bold highlight by me):
text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded [...].
This means it is enough if we can simply pass several samples at the same time, and we already get the readily processed batch back. I want to personally note that it would be in theory possible to pass the entire list of samples at once, but there are also some drawbacks that we go into later.
To actually pass a decently sized number of samples to the tokenizer, we need a function that can aggregate several samples from the dictionary (our batch-to-be) in a single iteration. I've used another Stackoverflow answer for this, see this post for several valid answers.
I've chosen the highest-voted answer, but do note that this creates and explicit copy, and might therefore not be the most memory-efficient solution. Then you can simply iterate over the batches, like so:
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
...
The next optimization is in the way you can call your tokenizer. Your code does this with many several steps, which can be aggregated into a single call. For the sake of clarity, I also point out which of these arguments are not required in your call (this often improves your code readability).
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad(): # Just to be sure
outputs = model(**tokenized_sentences)
I want to comment on the use of some of the arguments as well:
max_length=512: This is only required if your value differs from the model's default max_length. For most models, this will otherwise default to 512.
return_attention_mask: Will also default to the model-specific values, and in most cases does not need to be set explicitly.
padding=True: If you noticed, this is different from your version, and arguably what gives you the most "out-of-the-box" speedup. By using padding=max_length, each sequence will be computing quite a lot of unnecessary tokens, since each input is 512 tokens long. For most real-world data I have seen, inputs tend to be much shorter, and therefore you only need to consider the longest sequence length in your batch. padding=True does exactly that. For actual (CPU inference) speedups, I have played around with some different sequence lengths myself, see my repository on Github. Noticeably, for the same CPU and different batch sizes, there is a 10x speedup possible.
Edit: I've added the torch.no_grad() here, too, just in case somebody else wants to use this snippet. I generally recommend to use it right before the piece of code that is actually affected by it, just so that nothing gets overlooked by accident.
Also, there are some more possible optimizations that require you to have a bit more insights into your data samples:
If the variance of sample lengths is quite drastic, you can get an even higher speedup if you sort your samples by length (ideally, tokenized length, but character length / word count will also give you an approximate idea). That way, when batching several samples together, you minimize the amount of padding that is required.

Maybe you might be interested in Intel OpenVINO backend for inference execution on CPU? It's currently work in progress on branch https://github.com/huggingface/transformers/pull/14203

I had the same issue of time inference with Bert on the CPU. I started using HuggingFace Pipelines for inference, and the Trainer for training.
It's well documented on HuggingFace.
The pipeline makes it simple to perform inference on batches. On one pass, you can get the inference done instead of looping on a sequence of single texts.

Related

Having trouble training Word2Vec iteratively on Gensim

I'm attempting to train multiple texts supplied by myself iteratively. However, I keep running into an issue when I train the model more than once:
ValueError: You must specify either total_examples or total_words, for proper learning-rate and progress calculations. If you've just built the vocabulary using the same corpus, using the count cached in the model is sufficient: total_examples=model.corpus_count.
I'm currently initiating my model like this:
model = Word2Vec(sentences, min_count=0, workers=cpu_count())
model.build_vocab(sentences, update=False)
model.save('firstmodel.model')
model = Word2Vec.load('firstmodel.model')
and subsequently training it iteratively like this:
model.build_vocab(sentences, update = True)
model.train(sentences, totalexamples=model.corpus_count, epochs=model.epochs)
What am I missing here?
Somehow, it worked when I just trained one other model, so not sure why it doesn't work beyond two models...

First, the error message says you need to supply either the total_examples or total_words parameter to train() (so that it has an accurate estimate of the total training-corpus size).
Your code, as currently shown, only supplies totalexamples – a parameter name missing the necessary _. Correcting this typo should remedy the immediate error.
However, some other comments on your usage:
repeatedly calling train() with different data is an expert technique highly subject to error or other problems. It's not the usual way of using Word2Vec, nor the way most published results were reached. You can't count on it to always improve the model with new words; it might make the model worse, as new training sessions update some-but-not-all words, and alter the (usual) property that the vocabulary has one consistent set of word-frequencies from one single corpus. The best course is to train() once, with all available data, so that the full vocabulary, word-frequencies, & equally-trained word-vectors are achieved in a single consistent session.
min_count=0 is almost always a bad idea with word2vec: words with few examples in the corpus should be discarded. Trying to learn word-vectors for them not only gets weak vectors for those words, but dilutes/distracts the model from achieving better vectors for surrounding more-common words.
a count of workers up to your local cpu_count() only reliably helps up to about 4-12 workers, depending on other parameters & the efficiency of your corpus-reading, then more workers can hurt, due to inefficiencies in the Python GIL & Gensim corpus-to-worker handoffs. (inding the actual best count for your setup is, unfortunately, still just a matter of trial and error. But if you've got 16 (or more) cores, your setting is almost sure to do worse than a lower workers number.

Gensim doc2vec file stream training worse performance

Recently I switched to gensim 3.6 and the main reason was the optimized training process, which streams the training data directly from file, thus avoiding the GIL performance penalties.
This is how I used to trin my doc2vec:
training_iterations = 20
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
for epoch in range(training_iterations):
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=d2v.iter)
d2v.alpha -= 0.0002
d2v.min_alpha = d2v.alpha
And it is classifying documents quite well, only draw back is that when it is trained CPUs are utilized at 70%
So the new way:
corpus_fname = "spped.data"
save_as_line_sentence(corpus, corpus_fname)
# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()
# Train models using all cores
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores, dm=0, vector_size=200, epochs=50)
Now all CPUs are utilized at 100%
but the model is performing very poorly.
According to the documentation, I should not use the train method also, I should use only epoch count and not iterations, also the min_aplpha and aplha values should not be touched.
The configuration of both Doc2Vec looks the same to me so is there an issue with my new set up or configuration, or there is something wrong with the new version of gensim?
P.S I am using the same corpus in both cases, also I tried epoch count = 100, also with smaller numbers like 5-20, but I had no luck
EDIT: First model was doing 20 iterations 5 epoch each, second was doing 50 epoch, so having the second model make 100 epochs made it perform even better, since I was no longer managing the alpha by myself.
About the second issue that popped up: when providing file with line documents, the doc ids were not always corresponding to the lines, I didn't manage to figure out what could be causing this, it seems to work fine for small corpus, If I find out what I am doing wrong I will update this answer.
The final configuration for corpus of size 4GB looks like this
d2v = Doc2Vec(vector_size=200, workers=cpu_count(), alpha=0.025, min_alpha=0.00025, dm=0)
d2v.build_vocab(corpus)
d2v.train(corpus, total_examples=d2v.corpus_count, epochs=100)

Most users should not be calling train() more than once in their own loop, where they try to manage the alpha & iterations themselves. It is too easy to do it wrong.
Specifically, your code where you call train() in a loop is doing it wrong. Whatever online source or tutorial you modeled this code on, you should stop consulting, as it's misleading or outdated. (The notebooks bundled with gensim are better examples on which to base any code.)
Even more specifically: your looping code is actually doing 100 passes over the data, 20 of your outer loops, then the default d2v.iter 5 times each call to train(). And your first train() call is smoothly decaying the effective alpha from 0.025 to 0.00025, a 100x reduction. But then your next train() call uses a fixed alpha of 0.0248 for 5 passes. Then 0.0246, etc, until your last loop does 5 passes at alpha=0.0212 – not even 80% of the starting value. That is, the lowest alpha will have been reached early in your training.
Call the two options exactly the same except for the way the corpus_file is specified, instead of an iterable corpus.
You should get similar results from both corpus forms. (If you had a reproducible test case where the same corpus gets very different-quality results, and there wasn't some other error, that could be worth reporting to gensim as a bug.)
If the results for both aren't as good as when you were managing train() and alpha wrongly, it would likely be because you aren't doing a comparable amount of total training.

adding and accessing auxiliary tf.Dataset attributes with Keras

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.
I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks

As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset can be extended.
On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.

PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader

As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?
I will also post this question in the PyTorch Forum...
Thanks!

In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.
This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.
But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
You can use the DataLoader given you write your own Dataset class and you are using batch_size=1. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate will give you a hard time):
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class FooDataset(Dataset):
def __init__(self, data, target):
assert len(data) == len(target)
self.data = data
self.target = target
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return len(self.data)
data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']
ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)
print(list(enumerate(dl)))
# [(0, [
# 1 2 3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
# 4 5 6 7 8
# [torch.LongTensor of size 1x5]
# , ('b',)])]
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?
Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence and pack_padded_sequence. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).

Understanding Input Sequences of Unlimited Length for RNNs in Keras

I have been looking into an implementation of a certain architecture of deep learning model in keras when I came across a technicality that I could not grasp. In the code, the model is implemented as having two inputs; the first is the normal input that goes through the graph (word_ids in the sample code below), while the second is the length of that input, which seems to be involved nowhere other than the inputs argument in the keras Model instant (sequence_lengths in the sample code below).
word_ids = Input(batch_shape=(None, None), dtype='int32')
word_embeddings = Embedding(input_dim=embeddings.shape[0],
output_dim=embeddings.shape[1],
mask_zero=True,
weights=[embeddings])(word_ids)
x = Bidirectional(LSTM(units=64, return_sequences=True))(word_embeddings)
x = Dense(64, activation='tanh')(x)
x = Dense(10)(x)
sequence_lengths = Input(batch_shape=(None, 1), dtype='int32')
model = Model(inputs=[word_ids, sequence_lengths], outputs=[x])
I think this is done to make the network accept a sequence of any length. My questions are as follow:
Is what I think correct?
If yes, then, I feel like there is a bit of
magic going on under the hood. Any suggestions on how to wrap
one's head around this?
Does this mean that using this method, one doesn't need to pad his sequences (neither in training nor in prediction), and that keras will somehow know how to pad them automatically?

Do you need to pass sequence_lengths as an input?
No, it's absolutely not necessary to pass the sequence lengths as inputs, either if you're working with fixed or with variable length sequences.
I honestly don't understand why that model in the code uses this input if it's not sent to any of the model layers to be processed.
Is this really the complete model?
Why would one pass the sequence lengths as an input?
Well, maybe they want to perform some custom calculations with those. It might be an interesting option, but none of these calculations are present (or shown) in the code you posted. This model is doing absolutely nothing with this input.
How to work with variable sequence length?
For that, you've got two options:
Pad the sequences, as you mentioned, to a fixed size, and add Masking layers to the input (or use the mask_zeros=True option in the embedding layer).
Use the length dimension as None. This is done with one of these:
batch_shape=(batch_size, None)
input_shape=(None,)
PS: these shapes are for Embedding layers. An input that goes directly into recurrent networks would have an additional last dimension for input features
When using the second option (length = None), you should process each batch separately, because you are not able to put all sequences with different lengths in the same numpy array. But there is no limitation in the model itself, and no padding is necessary in this case.
How to work with "unlimited" length
The only way to work with unlimited length is using stateful=True.
In this case, every batch you pass will not be seen as "another group of sequences", but "additional steps of the previous batch".

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string