PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader - nlp

As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?
I will also post this question in the PyTorch Forum...
Thanks!

In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.
This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.
But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
You can use the DataLoader given you write your own Dataset class and you are using batch_size=1. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate will give you a hard time):
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class FooDataset(Dataset):
def __init__(self, data, target):
assert len(data) == len(target)
self.data = data
self.target = target
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return len(self.data)
data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']
ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)
print(list(enumerate(dl)))
# [(0, [
# 1 2 3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
# 4 5 6 7 8
# [torch.LongTensor of size 1x5]
# , ('b',)])]
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?
Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence and pack_padded_sequence. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).

Related

Reduce inference time for BERT

I want to further improve the inference time from BERT.
Here is the code below:
for sentence in list(data_dict.values()):
tokens = {'input_ids': [], 'attention_mask': []}
new_tokens = tokenizer.encode_plus(sentence, max_length=512,
truncation=True, padding='max_length',
return_tensors='pt',
return_attention_mask=True)
tokens['input_ids'].append(new_tokens['input_ids'][0])
tokens['attention_mask'].append(new_tokens['attention_mask'][0])
# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
outputs = model(**tokens)
embeddings = outputs[0]
Is there a way to provide batches (like in training) instead of the whole dataset?
There are several optimizations that we can do here, which are (mostly) natively supported by the Huggingface tokenizer.
TL;DR, an optimized version would be this one, I have explained the ideas behind each change below.
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad():
outputs = model(**tokenized_sentences)
The first optimization is to batch together several samples at the same time. For this, it is helpful to have a closer look at the actual __call__ function of the tokenizer, see here (bold highlight by me):
text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded [...].
This means it is enough if we can simply pass several samples at the same time, and we already get the readily processed batch back. I want to personally note that it would be in theory possible to pass the entire list of samples at once, but there are also some drawbacks that we go into later.
To actually pass a decently sized number of samples to the tokenizer, we need a function that can aggregate several samples from the dictionary (our batch-to-be) in a single iteration. I've used another Stackoverflow answer for this, see this post for several valid answers.
I've chosen the highest-voted answer, but do note that this creates and explicit copy, and might therefore not be the most memory-efficient solution. Then you can simply iterate over the batches, like so:
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
...
The next optimization is in the way you can call your tokenizer. Your code does this with many several steps, which can be aggregated into a single call. For the sake of clarity, I also point out which of these arguments are not required in your call (this often improves your code readability).
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad(): # Just to be sure
outputs = model(**tokenized_sentences)
I want to comment on the use of some of the arguments as well:
max_length=512: This is only required if your value differs from the model's default max_length. For most models, this will otherwise default to 512.
return_attention_mask: Will also default to the model-specific values, and in most cases does not need to be set explicitly.
padding=True: If you noticed, this is different from your version, and arguably what gives you the most "out-of-the-box" speedup. By using padding=max_length, each sequence will be computing quite a lot of unnecessary tokens, since each input is 512 tokens long. For most real-world data I have seen, inputs tend to be much shorter, and therefore you only need to consider the longest sequence length in your batch. padding=True does exactly that. For actual (CPU inference) speedups, I have played around with some different sequence lengths myself, see my repository on Github. Noticeably, for the same CPU and different batch sizes, there is a 10x speedup possible.
Edit: I've added the torch.no_grad() here, too, just in case somebody else wants to use this snippet. I generally recommend to use it right before the piece of code that is actually affected by it, just so that nothing gets overlooked by accident.
Also, there are some more possible optimizations that require you to have a bit more insights into your data samples:
If the variance of sample lengths is quite drastic, you can get an even higher speedup if you sort your samples by length (ideally, tokenized length, but character length / word count will also give you an approximate idea). That way, when batching several samples together, you minimize the amount of padding that is required.
Maybe you might be interested in Intel OpenVINO backend for inference execution on CPU? It's currently work in progress on branch https://github.com/huggingface/transformers/pull/14203
I had the same issue of time inference with Bert on the CPU. I started using HuggingFace Pipelines for inference, and the Trainer for training.
It's well documented on HuggingFace.
The pipeline makes it simple to perform inference on batches. On one pass, you can get the inference done instead of looping on a sequence of single texts.

What are some ways to speed up data loading on large sparse arrays (~1 million x 1 million, density ~0.0001) in Pytorch?

I am working on a binary classification problem. I have ~1.5 million data points, and the dimensionality of the feature space is 1 million. This dataset is stored as a sparse array, with a density of ~0.0001. For this post, I'll limit the scope to assume that the model is a shallow feedforward neural network, and also assume that the dimensionality has already been optimized (so cannot be reduced below 1 million). Naiive approaches to create mini-batches out of this data to feed into the network would take a lot of time (As an example, a basic approach of creating a TensorDataset (map style) from a torch.sparse.FloatTensor representation of the input array, and wrapping a DataLoader around it, means ~20s to get a mini-batch of 32 to the network, as opposed to say ~0.1s to perform the actual training). I am looking for ways to speed this up.
What I've tried
I first figured that reading from such a large sparse array in every iteration of the DataLoader was computationally intensive, so I broke down this sparse array into smaller sparse arrays
For the DataLoader to read from these multiple sparse arrays in an iterative fashion, I replaced the map style dataset that I had inside the DataLoader with an IterableDataset, and streamed these smaller sparse arrays into this IterableDataset like so:
from itertools import chain
from scipy import sparse
class SparseIterDataset(torch.utils.data.IterableDataset):
def __init__(self, fpaths):
super(SparseIter).__init__()
self.fpaths = fpaths
def read_from_file(self, fpath):
data = sparse.load_npz(fpath).toarray()
for d in data:
yield torch.Tensor(d)
def get_stream(self, fpaths):
return chain.from_iterable(map(self.read_from_file, fpaths))
def __iter__(self):
return self.get_stream(self.fpaths)
With this approach, I was able to bring down the time from the naiive base case of ~20s to ~0.2s per minibatch of 32. However, given that my dataset has ~1.5 million samples, this still implies a lot of time spent on even making one pass through the dataset. (As a comparison, even though it's slightly apples to oranges, running a logistic regression on scikit-learn on the original sparse array takes about ~6s per iteration through the whole dataset. With pytorch, with the approach I just outlined, it would take ~3000s just to load all the minibatches in an epoch)
One thing which I am aware of but yet to try is using multiprocess data loading by setting the num_workers argument in the DataLoader. I believe this has its own catches in the case of iterable style datasets though. Plus even a 10x speedup would still mean ~300s per epoch in loading mini batches. I feel I'm being inordinately slow! Are there any other approaches/improvements/best practices that you could suggest?
Your dataset in un-sparsified form would be 1.5M x 1M x 1 byte = 1.5TB as uint8, or 1.5M x 1M x 4 byte = 6TB as float32. Simply reading 6TB from memory to CPU could take 5-10 minutes on a modern CPU (depending on the architecture), and transfer speeds from CPU to GPU would be a bit slower than that (NVIDIA V100 on PCIe has 32GB/s theoretical).
Approaches:
Benchmark everything individually - eg in jupyter
%%timeit data = sparse.load_npz(fpath).toarray()
%%timeit dense = data.toarray() # un-sparsify for comparison
%%timeit t = torch.tensor(data) # probably about the same as the line above
Also print out the shapes and datatypes of everything to make sure they are as expected. I haven't tried running your code but I am pretty sure that (a) sparse.load_npz is extremely fast and unlikely to be a bottleneck, but (b) torch.tensor(data) produces a dense tensor and is also quite slow here
Use torch.sparse. I think torch sparse tensors can be used as regular tensors in most cases. You'd have to do some data prep to convert from scipy.sparse to torch.sparse:
A sparse tensor is represented as a pair of dense tensors: a tensor of
values and a 2D tensor of indices. A sparse tensor can be constructed by
providing these two tensors, as well as the size of the sparse tensor
You mention torch.sparse.FloatTensor but I'm pretty sure you're not making sparse tensors in your code - there is no reason to expect those would be constructed simply from passing a scipy.sparse array to a regular tensor constructor, since that's not how they're usually made.
If you figure out a good way to do this, I recommend you post it as a project or git on github, it would be quite useful.
If torch.sparse doesn't work out, think of other ways to either convert the data to dense only on the GPU, or avoid converting it entirely.
See also:
https://towardsdatascience.com/sparse-matrices-in-pytorch-be8ecaccae6
https://github.com/rusty1s/pytorch_sparse

Padding sequences of 2D elements in keras

I have a set of samples, each being a sequence of a set of attributes (for example a sample can comprise of 10 sequences each having 5 attributes). The number of attributes is always fixed, but the number of sequences (which are timestamps) can vary from sample to sample. I want to use this sample set for training an LSTM network in Keras for a classification problem and therefore I should pad the input size for all batch samples to be the same. But the pad_sequences processor in keras gets a fixed number of sequences with variable attributes and pad the missing attributes in each sequence, while I need to add more sequences of a fixed attribute length to each sample. So I think I can not use it and therefore I padded my samples separately and made a unified datset and then fed my network with it. But is there a shortcut with Keras functions to do this?
Also I heard about masking the padded input data during learning but I am not sure if I really need it as my classifier assigns one class label after processing the whole sample sequence. do I need it? And if yes, could you please help me with a simple example on how to do that?
Unfortunately, the documentation is quite missleading, but pad_sequences does exactly what you want. For example, this code
length3 = np.random.uniform(0, 1, size=(3,2))
length4 = np.random.uniform(0, 1, size=(4,2))
pad_sequences([length3, length4], dtype='float32', padding='post')
results in
[[[0.0385175 0.4333343 ]
[0.332416 0.16542904]
[0.69798684 0.45242336]
[0. 0. ]]
[[0.6518417 0.87938637]
[0.1491589 0.44784057]
[0.27607143 0.02688376]
[0.34607577 0.3605469 ]]]
So, here we have two sequences of different lengths, each timestep having two features, and the result is one numpy array where the shorter of the two sequences got padded with zeros.
Regarding your other question: Masking is a tricky topic, in my experience. But LSTMs should be fine with it. Just use a Masking() layer as your very first one. By default, it will make the LSTMs ignore all zeros, so in you case exactly the ones you added via padding. But you can use any value for masking, just as you can use any value for padding. If possible, choose a value that does not occur in your dataset.
If you don't use masking, that will yield the danger that your LSTM learns that the padded values do have some meaning while in reality they don't.
For example, if during training you feed in the the sequence
[[1,2],
[2,1],
[0,0],
[0,0],
[0,0]]
and later on the trained network you only feed in
[[1,2],
[2,1]]
You could get unexpected results (not necessarily, though). Masking avoids that by excluding the masked value from training.

Understanding Input Sequences of Unlimited Length for RNNs in Keras

I have been looking into an implementation of a certain architecture of deep learning model in keras when I came across a technicality that I could not grasp. In the code, the model is implemented as having two inputs; the first is the normal input that goes through the graph (word_ids in the sample code below), while the second is the length of that input, which seems to be involved nowhere other than the inputs argument in the keras Model instant (sequence_lengths in the sample code below).
word_ids = Input(batch_shape=(None, None), dtype='int32')
word_embeddings = Embedding(input_dim=embeddings.shape[0],
output_dim=embeddings.shape[1],
mask_zero=True,
weights=[embeddings])(word_ids)
x = Bidirectional(LSTM(units=64, return_sequences=True))(word_embeddings)
x = Dense(64, activation='tanh')(x)
x = Dense(10)(x)
sequence_lengths = Input(batch_shape=(None, 1), dtype='int32')
model = Model(inputs=[word_ids, sequence_lengths], outputs=[x])
I think this is done to make the network accept a sequence of any length. My questions are as follow:
Is what I think correct?
If yes, then, I feel like there is a bit of
magic going on under the hood. Any suggestions on how to wrap
one's head around this?
Does this mean that using this method, one doesn't need to pad his sequences (neither in training nor in prediction), and that keras will somehow know how to pad them automatically?
Do you need to pass sequence_lengths as an input?
No, it's absolutely not necessary to pass the sequence lengths as inputs, either if you're working with fixed or with variable length sequences.
I honestly don't understand why that model in the code uses this input if it's not sent to any of the model layers to be processed.
Is this really the complete model?
Why would one pass the sequence lengths as an input?
Well, maybe they want to perform some custom calculations with those. It might be an interesting option, but none of these calculations are present (or shown) in the code you posted. This model is doing absolutely nothing with this input.
How to work with variable sequence length?
For that, you've got two options:
Pad the sequences, as you mentioned, to a fixed size, and add Masking layers to the input (or use the mask_zeros=True option in the embedding layer).
Use the length dimension as None. This is done with one of these:
batch_shape=(batch_size, None)
input_shape=(None,)
PS: these shapes are for Embedding layers. An input that goes directly into recurrent networks would have an additional last dimension for input features
When using the second option (length = None), you should process each batch separately, because you are not able to put all sequences with different lengths in the same numpy array. But there is no limitation in the model itself, and no padding is necessary in this case.
How to work with "unlimited" length
The only way to work with unlimited length is using stateful=True.
In this case, every batch you pass will not be seen as "another group of sequences", but "additional steps of the previous batch".

Keras Lambda Layer for Custom Loss

I am attempting to implement a Lambda layer that will produce a custom loss function. In the layer, I need to be able to compare every element in a batch to every other element in the batch in order to calculate the cost. Ideally, I want code that looks something like this:
for el_1 in zip(y_pred, y_true):
for el_2 in zip(y_pred, y_true):
if el_1[1] == el_2[1]:
# Perform a calculation
else:
# Perform a different calculation
When I true this, I get:
TypeError: TensorType does not support iteration.
I am using Keras version 2.0.2 with a Theano version 0.9.0 backend. I understand that I need to use Keras tensor functions in order to do this, but I can't figure out any tensor functions that do what I want.
Also, I am having difficulty understanding precisely what my Lambda function should return. Is it a tensor of the total cost for each sample, or is it just a total cost for the batch?
I have been beating my head against this for days. Any help is deeply appreciated.
A tensor in Keras commonly has at least 2 dimensions, the batch and the neuron/unit/node/... dimension. A dense layer with 128 units trained with a batch size of 64 would therefore yields a tensor with shape (64,128).
Your LambdaLayer processes tensors as any other layer does, plugging it in after your dense layer from before will give you a tensor with shape (64,128) to process. Processing a tensor works similar to how calculations on numpy arrays works (or any other vector processing library really): you specify one operation to broadcast over all elements in the data structure.
For example, your custom cost is the difference for each value in the batch, you would implement it like so:
cost_layer = LambdaLayer(lambda a,b: a - b)
The - operation is broadcasted over a and b and will return a suitable result provided the dimensions match. The takeaway is that you really only can specify one operation for every value. If you want to do more complex tasks, for example computations based on the value you need single operations that take two operations and apply the correct one accordingly, i.e. the switch operation.
The syntax for K.switch is
K.switch(condition, then_expression, else_expression)
For example, if you want to subtract both values when a != b but add them when they are equal, you would write:
import keras.backend as K
cost_layer = LambdaLayer(lambda a,b: K.switch(a != b, a - b, a + b))

Resources