Understanding Input Sequences of Unlimited Length for RNNs in Keras - keras

I have been looking into an implementation of a certain architecture of deep learning model in keras when I came across a technicality that I could not grasp. In the code, the model is implemented as having two inputs; the first is the normal input that goes through the graph (word_ids in the sample code below), while the second is the length of that input, which seems to be involved nowhere other than the inputs argument in the keras Model instant (sequence_lengths in the sample code below).
word_ids = Input(batch_shape=(None, None), dtype='int32')
word_embeddings = Embedding(input_dim=embeddings.shape[0],
output_dim=embeddings.shape[1],
mask_zero=True,
weights=[embeddings])(word_ids)
x = Bidirectional(LSTM(units=64, return_sequences=True))(word_embeddings)
x = Dense(64, activation='tanh')(x)
x = Dense(10)(x)
sequence_lengths = Input(batch_shape=(None, 1), dtype='int32')
model = Model(inputs=[word_ids, sequence_lengths], outputs=[x])
I think this is done to make the network accept a sequence of any length. My questions are as follow:
Is what I think correct?
If yes, then, I feel like there is a bit of
magic going on under the hood. Any suggestions on how to wrap
one's head around this?
Does this mean that using this method, one doesn't need to pad his sequences (neither in training nor in prediction), and that keras will somehow know how to pad them automatically?

Do you need to pass sequence_lengths as an input?
No, it's absolutely not necessary to pass the sequence lengths as inputs, either if you're working with fixed or with variable length sequences.
I honestly don't understand why that model in the code uses this input if it's not sent to any of the model layers to be processed.
Is this really the complete model?
Why would one pass the sequence lengths as an input?
Well, maybe they want to perform some custom calculations with those. It might be an interesting option, but none of these calculations are present (or shown) in the code you posted. This model is doing absolutely nothing with this input.
How to work with variable sequence length?
For that, you've got two options:
Pad the sequences, as you mentioned, to a fixed size, and add Masking layers to the input (or use the mask_zeros=True option in the embedding layer).
Use the length dimension as None. This is done with one of these:
batch_shape=(batch_size, None)
input_shape=(None,)
PS: these shapes are for Embedding layers. An input that goes directly into recurrent networks would have an additional last dimension for input features
When using the second option (length = None), you should process each batch separately, because you are not able to put all sequences with different lengths in the same numpy array. But there is no limitation in the model itself, and no padding is necessary in this case.
How to work with "unlimited" length
The only way to work with unlimited length is using stateful=True.
In this case, every batch you pass will not be seen as "another group of sequences", but "additional steps of the previous batch".

Related

The size of Logits of Roberta model is weird

My input size is [8,22]. A batch with 8 tokenized sentences with a length of 22.
I dont want to use the default classifier.
model = RobertaForSequenceClassification.from_pretrained("xlm-roberta-large")
model.classifier=nn.Identity()
After model(batch)
The size of result is torch.Size([8, 22, 1024]). I have no idea why. Should it be [8,1024]?
The model.classifier object you have replaced used to be an instance of a RobertaClassificationHead. If you take a look at its source code[1], the layer is hard-coded into indexing the first item of the second dimension of its input, which is supposed to be the [CLS] token.
By replacing it with an Identity you miss out on the indexing operation, hence your output shape.
Long story short, don't assume functionality you haven't verified when it comes to non-own code, huggingface in particular (lots of ad-hoc classes and spaghetti interfaces, least as far as I'm concerned).
[1] source

Reduce inference time for BERT

I want to further improve the inference time from BERT.
Here is the code below:
for sentence in list(data_dict.values()):
tokens = {'input_ids': [], 'attention_mask': []}
new_tokens = tokenizer.encode_plus(sentence, max_length=512,
truncation=True, padding='max_length',
return_tensors='pt',
return_attention_mask=True)
tokens['input_ids'].append(new_tokens['input_ids'][0])
tokens['attention_mask'].append(new_tokens['attention_mask'][0])
# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
outputs = model(**tokens)
embeddings = outputs[0]
Is there a way to provide batches (like in training) instead of the whole dataset?
There are several optimizations that we can do here, which are (mostly) natively supported by the Huggingface tokenizer.
TL;DR, an optimized version would be this one, I have explained the ideas behind each change below.
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad():
outputs = model(**tokenized_sentences)
The first optimization is to batch together several samples at the same time. For this, it is helpful to have a closer look at the actual __call__ function of the tokenizer, see here (bold highlight by me):
text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded [...].
This means it is enough if we can simply pass several samples at the same time, and we already get the readily processed batch back. I want to personally note that it would be in theory possible to pass the entire list of samples at once, but there are also some drawbacks that we go into later.
To actually pass a decently sized number of samples to the tokenizer, we need a function that can aggregate several samples from the dictionary (our batch-to-be) in a single iteration. I've used another Stackoverflow answer for this, see this post for several valid answers.
I've chosen the highest-voted answer, but do note that this creates and explicit copy, and might therefore not be the most memory-efficient solution. Then you can simply iterate over the batches, like so:
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
...
The next optimization is in the way you can call your tokenizer. Your code does this with many several steps, which can be aggregated into a single call. For the sake of clarity, I also point out which of these arguments are not required in your call (this often improves your code readability).
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad(): # Just to be sure
outputs = model(**tokenized_sentences)
I want to comment on the use of some of the arguments as well:
max_length=512: This is only required if your value differs from the model's default max_length. For most models, this will otherwise default to 512.
return_attention_mask: Will also default to the model-specific values, and in most cases does not need to be set explicitly.
padding=True: If you noticed, this is different from your version, and arguably what gives you the most "out-of-the-box" speedup. By using padding=max_length, each sequence will be computing quite a lot of unnecessary tokens, since each input is 512 tokens long. For most real-world data I have seen, inputs tend to be much shorter, and therefore you only need to consider the longest sequence length in your batch. padding=True does exactly that. For actual (CPU inference) speedups, I have played around with some different sequence lengths myself, see my repository on Github. Noticeably, for the same CPU and different batch sizes, there is a 10x speedup possible.
Edit: I've added the torch.no_grad() here, too, just in case somebody else wants to use this snippet. I generally recommend to use it right before the piece of code that is actually affected by it, just so that nothing gets overlooked by accident.
Also, there are some more possible optimizations that require you to have a bit more insights into your data samples:
If the variance of sample lengths is quite drastic, you can get an even higher speedup if you sort your samples by length (ideally, tokenized length, but character length / word count will also give you an approximate idea). That way, when batching several samples together, you minimize the amount of padding that is required.
Maybe you might be interested in Intel OpenVINO backend for inference execution on CPU? It's currently work in progress on branch https://github.com/huggingface/transformers/pull/14203
I had the same issue of time inference with Bert on the CPU. I started using HuggingFace Pipelines for inference, and the Trainer for training.
It's well documented on HuggingFace.
The pipeline makes it simple to perform inference on batches. On one pass, you can get the inference done instead of looping on a sequence of single texts.

Keras regression - Should my first/last layer have an activation function?

I keep seeing examples floating around the internet where the input and/or output layer have either no activation function, a linear activation function, or None. What I'm confused about is when to use one, and how to know if you should? I also am confused about what the number of nodes should be for the input layer.
Right now I have a regression problem, I'm trying to predict a real value based on an array of inputs (about 54). Should I be using relu in my activation function for the input layer? Should I have linear as my output activation? My data is linearly scaled from 0 to 1 for each feature independently as they're different units. I was also unsure of the number of nodes I should use for my input layer as I see some examples pick an arbitrary number not related to their input shape, and other examples saying to specifically set it to the number of inputs, or number of inputs plus one for a bias. But none of the examples so far have explained their reasoning behind their choices.
Since my model isn't performing very well, I thought asking what the architecture should be could help me fine tune it more.

Padding sequences of 2D elements in keras

I have a set of samples, each being a sequence of a set of attributes (for example a sample can comprise of 10 sequences each having 5 attributes). The number of attributes is always fixed, but the number of sequences (which are timestamps) can vary from sample to sample. I want to use this sample set for training an LSTM network in Keras for a classification problem and therefore I should pad the input size for all batch samples to be the same. But the pad_sequences processor in keras gets a fixed number of sequences with variable attributes and pad the missing attributes in each sequence, while I need to add more sequences of a fixed attribute length to each sample. So I think I can not use it and therefore I padded my samples separately and made a unified datset and then fed my network with it. But is there a shortcut with Keras functions to do this?
Also I heard about masking the padded input data during learning but I am not sure if I really need it as my classifier assigns one class label after processing the whole sample sequence. do I need it? And if yes, could you please help me with a simple example on how to do that?
Unfortunately, the documentation is quite missleading, but pad_sequences does exactly what you want. For example, this code
length3 = np.random.uniform(0, 1, size=(3,2))
length4 = np.random.uniform(0, 1, size=(4,2))
pad_sequences([length3, length4], dtype='float32', padding='post')
results in
[[[0.0385175 0.4333343 ]
[0.332416 0.16542904]
[0.69798684 0.45242336]
[0. 0. ]]
[[0.6518417 0.87938637]
[0.1491589 0.44784057]
[0.27607143 0.02688376]
[0.34607577 0.3605469 ]]]
So, here we have two sequences of different lengths, each timestep having two features, and the result is one numpy array where the shorter of the two sequences got padded with zeros.
Regarding your other question: Masking is a tricky topic, in my experience. But LSTMs should be fine with it. Just use a Masking() layer as your very first one. By default, it will make the LSTMs ignore all zeros, so in you case exactly the ones you added via padding. But you can use any value for masking, just as you can use any value for padding. If possible, choose a value that does not occur in your dataset.
If you don't use masking, that will yield the danger that your LSTM learns that the padded values do have some meaning while in reality they don't.
For example, if during training you feed in the the sequence
[[1,2],
[2,1],
[0,0],
[0,0],
[0,0]]
and later on the trained network you only feed in
[[1,2],
[2,1]]
You could get unexpected results (not necessarily, though). Masking avoids that by excluding the masked value from training.

Audio classification with Keras: presence of human voice

I'd like to create an audio classification system with Keras that simply determines whether a given sample contains human voice or not. Nothing else. This would be my first machine learning attempt.
This audio preprocessor exists. It claims not to be done, but it's been forked a few times:
https://github.com/drscotthawley/audio-classifier-keras-cnn
I don't understand how this one would work, but I'm ready to give it a try:
https://github.com/keunwoochoi/kapre
But let's say I got one of those to work, would the rest of the process be similar to image classification? Basically, I've never fully understood when to use Softmax and when to use ReLu. Would this be similar with sound as it would with images once I've got the data mapped as a tensor?
Sounds can be seen as a 1D image and be worked with with 1D convolutions.
Often, dilated convolutions may do a good work, see Wave Nets
Sounds can also be seen as sequences and be worked with RNN layers (but maybe they're too bulky in amount of data for that)
For your case, you need only one output with a 'sigmoid' activation at the end and a 'binary_crossentropy' loss.
Result = 0 -> no voice
Result = 1 -> there's voice
When to use 'softmax'?
The softmax function is good for multiclass problems (not your case) where you want only one class as a result. All the results of a softmax function will sum 1. It's intended to be like a probability of each class.
It's mainly used at the final layer, because you only get classes as the final result.
It's good for cases when only one class is correct. And in this case, it goes well with the loss categorical_crossentropy.
Relu and other activations in the middle of the model
These are not very ruled. There are lots of possibilities. I often see relu in image convolutional models.
Important things to know are they "ranges". What are the limits of their outputs?
Sigmoid: from 0 to 1 -- at the end of the model this will be the best option for your presence/abscence classification. Also good for models that want many possible classes together.
Tanh: from -1 to 1
Relu: from 0 to limitless (it simply cuts negative values)
Softmax: from 0 to 1, but making sure the sum of all values is 1. Good at the end of models that want only 1 class among many classes.
Oftentimes it is useful to preprocess the audio to a spectrogram:
Using this as input, you can use classical image classification approaches (like convolutional neural networks). In your case you could divide the input audio in frames of around 20ms-100ms (depending on the time resolution you need) and convert those frames to spectograms. Convolutional networks can also be combined with recurrent units to take a larger time context into account.
It is also possible to train neural networks on raw waveforms using 1D Convolutions. However research has shown that preprocessing approaches using a frequency transformation achieve better results in general.

Resources