Must the LSTM input of keras be of fixed length (pad_sequence)
What to do if the input length difference is large (for example, there are tens of thousands, but also dozens of lengths)? It's too much to specify a large maxlen memory
Related
Is it possible to train model with batches that have unequal lenght during an epoch? I am new to pytorch.
If you take a look at the dataloader documentation, you'll see a drop_last parameter, which explains that sometimes when the dataset size is not divisible by the batch size, then you get a last batch of different size. So basically the answer is yes, it is possible, it happens often and it does not affect (too much) the training of a neural network.
However you must a bit careful, some pytorch layers deal poorly with very small batch sizes. For example if you happen to have Batchnorm layers, and if you get a batch of size 1, you'll get errors due to the fact that batchnorm at some point divides by len(batch)-1. More generally, training a network that has batchnorms generally require batches of significant sizes, say at least 16 (literature generally aims for 32 or 64). So if you happen to have variable size batches, take the time to check whether your layers have requirement in terms of batch size for optimal training and convergence. But except in particular cases, your network will train anyway, no worries.
As for how to make your batches with custom sizes, I suggest you look at and take inspiration from the pytorch implementation of dataloader and sampler. You may want to implement something similar to BatchSampler and use the batch_sampler argument of Dataloader
I am switching from TensorFlow to PyTorch and I having some troubles with my net.
I have made a Collator(for the DataLoader) that pads each tensor(originally sentence) in each batch into the Maslen of each batch.
so I have different input sizes per each batch.
my network consists of LSTM -> LSTM -> DENSE
my question is, how can I specify this variable input size to the LSTM?
I assume that in TensorFlow I would do Input((None,x)) bedsore the LSTM.
Thank you in advance
The input size of the LSTM is not how long a sample is. So lets say you have a batch with three samples: The first one has the length of 10, the second 12 and the third 15. So what you already did is pad them all with zeros so that all three have the size 15. But the next batch may have been padded to 16. Sure.
But this 15 is not the input size of the LSTM. The input size in the size of one element of a sample in the batch. And that should always be the same.
For example when you want to classify names:
the inputs are names, for example "joe", "mark", "lucas".
But what the LSTM takes as input are the characters. So "J" then "o" so on. So as input size you have to put how many dimensions one character have.
If you use embeddings, the embedding size. When you use one-hot encoding, the vector size (probably 26). An LSTM takes iteratively the characters of the words. Not the entire word at once.
self.lstm = nn.LSTM(input_size=embedding_size, ...)
I hope this answered your question, if not please clearify it! Good luck!
I am working on a binary classification problem. I have ~1.5 million data points, and the dimensionality of the feature space is 1 million. This dataset is stored as a sparse array, with a density of ~0.0001. For this post, I'll limit the scope to assume that the model is a shallow feedforward neural network, and also assume that the dimensionality has already been optimized (so cannot be reduced below 1 million). Naiive approaches to create mini-batches out of this data to feed into the network would take a lot of time (As an example, a basic approach of creating a TensorDataset (map style) from a torch.sparse.FloatTensor representation of the input array, and wrapping a DataLoader around it, means ~20s to get a mini-batch of 32 to the network, as opposed to say ~0.1s to perform the actual training). I am looking for ways to speed this up.
What I've tried
I first figured that reading from such a large sparse array in every iteration of the DataLoader was computationally intensive, so I broke down this sparse array into smaller sparse arrays
For the DataLoader to read from these multiple sparse arrays in an iterative fashion, I replaced the map style dataset that I had inside the DataLoader with an IterableDataset, and streamed these smaller sparse arrays into this IterableDataset like so:
from itertools import chain
from scipy import sparse
class SparseIterDataset(torch.utils.data.IterableDataset):
def __init__(self, fpaths):
super(SparseIter).__init__()
self.fpaths = fpaths
def read_from_file(self, fpath):
data = sparse.load_npz(fpath).toarray()
for d in data:
yield torch.Tensor(d)
def get_stream(self, fpaths):
return chain.from_iterable(map(self.read_from_file, fpaths))
def __iter__(self):
return self.get_stream(self.fpaths)
With this approach, I was able to bring down the time from the naiive base case of ~20s to ~0.2s per minibatch of 32. However, given that my dataset has ~1.5 million samples, this still implies a lot of time spent on even making one pass through the dataset. (As a comparison, even though it's slightly apples to oranges, running a logistic regression on scikit-learn on the original sparse array takes about ~6s per iteration through the whole dataset. With pytorch, with the approach I just outlined, it would take ~3000s just to load all the minibatches in an epoch)
One thing which I am aware of but yet to try is using multiprocess data loading by setting the num_workers argument in the DataLoader. I believe this has its own catches in the case of iterable style datasets though. Plus even a 10x speedup would still mean ~300s per epoch in loading mini batches. I feel I'm being inordinately slow! Are there any other approaches/improvements/best practices that you could suggest?
Your dataset in un-sparsified form would be 1.5M x 1M x 1 byte = 1.5TB as uint8, or 1.5M x 1M x 4 byte = 6TB as float32. Simply reading 6TB from memory to CPU could take 5-10 minutes on a modern CPU (depending on the architecture), and transfer speeds from CPU to GPU would be a bit slower than that (NVIDIA V100 on PCIe has 32GB/s theoretical).
Approaches:
Benchmark everything individually - eg in jupyter
%%timeit data = sparse.load_npz(fpath).toarray()
%%timeit dense = data.toarray() # un-sparsify for comparison
%%timeit t = torch.tensor(data) # probably about the same as the line above
Also print out the shapes and datatypes of everything to make sure they are as expected. I haven't tried running your code but I am pretty sure that (a) sparse.load_npz is extremely fast and unlikely to be a bottleneck, but (b) torch.tensor(data) produces a dense tensor and is also quite slow here
Use torch.sparse. I think torch sparse tensors can be used as regular tensors in most cases. You'd have to do some data prep to convert from scipy.sparse to torch.sparse:
A sparse tensor is represented as a pair of dense tensors: a tensor of
values and a 2D tensor of indices. A sparse tensor can be constructed by
providing these two tensors, as well as the size of the sparse tensor
You mention torch.sparse.FloatTensor but I'm pretty sure you're not making sparse tensors in your code - there is no reason to expect those would be constructed simply from passing a scipy.sparse array to a regular tensor constructor, since that's not how they're usually made.
If you figure out a good way to do this, I recommend you post it as a project or git on github, it would be quite useful.
If torch.sparse doesn't work out, think of other ways to either convert the data to dense only on the GPU, or avoid converting it entirely.
See also:
https://towardsdatascience.com/sparse-matrices-in-pytorch-be8ecaccae6
https://github.com/rusty1s/pytorch_sparse
In An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, the authors state that TCN networks, a specific type of 1D CNNs applied to sequential data, "can also take in inputs of arbitrary lengths by sliding the 1D convolutional kernels", just like Recurrent Nets. I am asking myself how this can be done.
For an RNN, it is straight-forward that the same function would be applied as often as is the input length. However, for CNNs (or any feed-forward NN in general), one must prespecify the number of input neurons. So the only way I can see TCNs dealing with arbitrary length inputs is by specifying a fixed length input neuron space and then adding zero padding to the arbitrary length inputs.
Am I correct in my understanding?
If you have a fully convolutional neural network, there is no reason to have a fully specified input shape. You definitely need a fixed rank, and the last dimension probably should be the same but otherwise, you can definitely specify an input shape which in tensorflow would look like Input((None, 10)) in the case of 1D-CNNs.
Indeed the shape of the convolution kernel doesn't depend on the length of the input in the temporal dimension (it can depend on the last dimension though typically in convolutional neural networks), and you can apply it to any input with the same rank (and same last dimension).
For example let's say that you are applying only a single 1D convolution, with a kernel that's doing the sum of 2 neighbouring elements (kernel = (1, 1)). This operation could be applied to any input length given it's always 1D.
However, when being confronted with a sequence-to-label task and requiring further operations in the stack such as a fully-connected layer, the inputs must be of fixed length (or must be made so through zero padding).
I am designing an embedding layer where the vocabulary is ~4000 and most training examples have a short length of less than 10. However some examples have a length of 100 or possibly even several hundred, and I would like to avoid zero padding every single example to length 100+ in order to maintain constant input length across all examples.
To remedy this I would like to only pad based on the max length within the batch, so that almost all batches would only have input length ~10 with only a few batches having a lot of padding. How do I load in each batch with a different input length into the Embedding layer?
One possible way is to set input_length argument to None. But if you are going to use Dense and Flatten layers after this layer, they might not work. For more visit keras doc page
... This argument is
required if you are going to connect Flatten then Dense layers
upstream (without it, the shape of the dense outputs cannot be
computed)
model = keras.models.Sequential(
[
keras.layers.Embedding(voc_size, embedding_dim, input_length=None)
]
)
Now the model can accept variable length sequences.