Keras SimpleRNN input shape and masking - nlp

Newbie to Keras alert!!!
I've got some questions related to Recurrent Layers in Keras (over theano)
How is the input supposed to be formatted regarding timesteps (say for instance I want a layer that will have 3 timesteps 1 in the future 1 in the past and 1 current) I see some answers and the API proposing padding and using the embedding layer or to shape the input using a time window (3 in this case) and in any case I can't make heads or tails of the API and SimpleRNN examples are scarce and don't seem to agree.
How would the input time window formatting work with a masking layer?
Some related answers propose performing masking with an embedding layer. What does masking have to do with embedding layers anyway, aren't embedding layers basically 1-hot word embeddings? (my application would use phonemes or characters as input)

I can start an answer, but this question is very broad so I would appreciate suggestions on improvement to my answer.
Keras SimpleRNN expects an input of size (num_training_examples, num_timesteps, num_features).
For example, suppose I have sequences of counts of numbers of cars driving by an intersection per hour (small example just to illustrate):
X = np.array([[10, 14, 2, 5], [12, 15, 1, 4], [13, 10, 0, 0]])
Aside: Notice that I was taking observations over four hours, and the last two hours had no cars driving by. That's an example of zero-padding the input, which means making all of the sequences the same length by adding 0s to the end of shorter sequences to match the length of the longest sequence.
Keras would expect the following input shape: (X.shape[0], X.shape1, 1), which means I could do this:
X_train = np.reshape(X, (X.shape[0], X.shape[1], 1))
And then I could feed that into the RNN:
model = Sequential()
model.add(SimpleRNN(units=10, activation='relu', input_shape = (X.shape[1], X.shape[2])))
You'd add more layers, or add regularization, etc., depending on the nature of your task.
For your specific application, I would think you would need to reshape your input to have 3 elements per row (last time step, current, next).
I don't know much about the masking layers, but here is a good place to start.
As far as I know, embeddings are independent of maskings, but you can mask an embedding.
Hope that provides a good starting point!

Related

Mutli variate time series prediction - Conv1D and loss function - pytorch

I have a couple of questions.
I have data of the following shape:
(32, 64, 11)
where 32 is the batch size, 64 is a sequence length and 11 is the number of features. each sample of mine is 64X11, and has a label of 0 or 1.
I’d like to predict when a sequence has a label of “1”.
I’m trying to use a simple architecture with
conv1D → ReLU → flatten → linear → sigmoid.
For the Conv1D I thought that since it is a multi variate time series prediction, and each row in my data is a second, I think that the number of in channels should be the number of features, since that way it will process all of the features concurrently, (I don’t have any spatial things in my data, it doesn’t matter if a column is in index 0 or 9, as it is important in image with pixels.
I can't get to decide how to “initialize” the conv1D parameters. Currently I think the number of channels should be the number of features and not 1, as the reason I just explained, but unsure of it.
Secondly, should the loss function be BCELOSS or something else? assuming that my labels are 0 or 1, and the prediction for me is I want the model to provide a probability of belonging to class with label 1.
A lot of thanks.

Doesn't keras.layers.Flatten lose information?

Brand new to keras and ML in general. I'm looking at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/, and it uses Flatten between Embedding and Dense because Embedding produces a 2D vector but Dense requires a single dimension.
I'm sure I'm missing something obvious here, but why doesn't this lose which words are in which input vectors? How are we able to still know that input #3 was "nice work" and is associated with label #3, 1, for "positive"?
I guess the original dimensions are retained from the original input and then somehow restored for Dense's output? Or am I just totally missing a major conceptual aspect?
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Thanks for any guidance!
Embedding layer gives you a vector for each word token, so the output is 2-d. We need to use flatten before any classifier block.
There is some information lost, for example when we use Convolutional layers, and then flat the feature maps, the spatial information is lost. But we already extract the most important features using Conv layers and we feed those features to fully connected layers.
In your example, the temporal dimension is no longer maintained, usually, it's desired to pass the output of the embedding matrix to an RNN/Conv layer for further feature extraction.
Flatten only is applied on the non-batch dimension, meaning the examples are still separated (if you mean that).
For each sample, let's say nice work, we get 2 vectors (1 for nice, 1 for work), now we only want to know the overall sentiment from the sentence so, once we extract the features, we can apply flatten.

CNN-LSTM Image Classification

Is it possible to reshape 512x512 rgb image to (timestep, dim)? Otherwards, I am trying to convert this reshape layer: Reshape((23, 3887)) to 512 vice 299. Also, is there any documentation explaining how to determine input_dim and timestep for Keras?
It seems like your problem is similar to one that i had earlier today. Look at it here: Keras functional API: Combine CNN model with a RNN to to look at sequences of images
Now to add to the answer from the question i linked too. Let number_of_images be n. In your case the original data format would be (n, 512, 512, 3). All you then need to do decide how many images you want per sequence. Say you want a sequence of 5 images and have gotten 5000 images in total. Then reshaping to (1000, 5, 512, 512, 3) should do. This way the model sees 1000 sequences of 5 images.

Understanding Keras LSTM input shape

I am learning to use Keras LSTM model. I have looked at this tutorial, this tutorial and this tutorial and am feeling unsure about my understanding of LSTM model's input shape. My question is if one is to shape one's data like the first tutorial (8760, 1, 8) and the data is inputted to the network 1 timestep at a time i.e. the input_shape=(1, 8) does the network learn the temporal dependencies between samples?
It only makes sense to have batches of 1 timestep when you're using stateful=True. Otherwise there is no temporal dependency, as you presumed.
The difference is:
stateful=False, input_shape=(1,any):
first batch of shape (N, 1, any): contains N different sequences of length 1
second batch: contains another N different sequences of length 1
total of the two batches: 2N sequences of length 1
more batches: more independent sequences
yes, there is no point in using steps=1 when stateful=False
stateful=True, input_shape=(1,any):
first batch of shape (N, 1, any): contains the first step of N different sequences
second batch: contains the second step of the same N sequences
total of the two batches: N sequences of length 2
more batches = more steps of the same sequences, until you call model.reset_states()
Usually, it's more complicated to handle stateful=True layers, and if you can put entire sequences in a batch, like input_shape=(allSteps, any), there is no reason to turn stateful on.
If you want a detailed explanation of RNNs on Keras, see this answer

Quantifying Text Keywords for Neural Network Analysis

I am working on a small research project. I am looking to write a program that
a) Takes a large number of short texts (~100 words / several thousand texts)
b) Identify keywords in the texts
c) Presents all of them to a group of users who indicate if they found them interesting or not
d) Have the software learn what keywords or combinations are likely to be preferable. Let's assume that the target group is uniform for this example.
Now, there are two main challenges. The first one I have an answer to, the second one I am looking for help with.
1) Keyword identification.
Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords.
How would this problem commonly be addressed?
This is a way to start with:
clean your input text (remove special tokens etc)
use n-grams as features (can just start with 1-gram).
treat user's feedback "preferrable or not" as a binary label.
learn a binary classifier (whatever model is fine, naive bayesian, logistic regression).
1) Keyword identification. Reverse frequency analysis seems to be the way to go here. Identify those words that occur proportionally often in a given text when compared to all others. This has some drawbacks though as for example very common keywords may be overlooked.
You can skip this part in the first model you built. Treat the sentence as bag of words(n-grams) to simplify the first working model. If you want, you can add this as feature weight later.
2) How to prepare the data-set to be numeric. I could map keywords to input neurons and then adjust the value based on their relative frequency, but that limits the model and makes it hard to add new keywords. It also quickly becomes competitively expensive if we want to scale beyond a few dozen keywords
You can just use a dictionary mapping n-grams to integer ids. For each training example, the feature would be sparse hence you have training examples like below:
34, 68, 79293, 23232 -> 0 (negative label)
340, 608, 3, 232 -> 1 (positive label)
Imagine you have a dictionary (or vocabulary) mapping:
3: foo
34: movie
68: in-stock
232: bar
340: barz
...
TO use neural networks, you will need to have an embedding layer to turn sparse features into dense features by aggregating (for instance, averaging) the embedding vectors of all features.
Use the same example as above, suppose we just use 4-dimensional embedding:
34 -> [0.1, 0.2, -0.3, 0]
68 -> [0, 0.1, -0.1, 0.2]
79293 -> [0.3, 0.0, 0.12, 0]
23232 -> [0.4, 0.0, 0.0, 0]
------------------------------- sum
sum -> [0.8, 0.3, -0.28, 0.2]
------------------------------- L1-normalize
l1 -> [0.8, 0.3, -0.28, 0.2] ./ (0.8 + 0.3 + 0.28 + 0.2)
-> [0.51,0.19,-0.18,0.13]
At prediction time, you will need to use the dictionary and the same way of feature extraction (cleanup/n-gram generation/mapping n-gram to ids) so that your model understands the input.
You can simply use sklearn to learn a TFIDF bag of words model of your texts which returns a sparse matrix n_samplesxn_features like this:
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = TfidfTransformer(smooth_idf=False)
X_train = vectorizer.fit_transform(list_of_texts)
print(X_train.shape)
X_train is a scipy csr sparse matrix. If your NN implementation doesn't support sparse matrices you can convert it to a numpy dense matrix but it might fill your RAM; better to use an implementation that supports sparseinput (e.g. I know Lasagne/Theano does that).
After training, you can use the parameters of the NN to find out which features have a high/low weight and so are more/less important for the particular label.

Resources