Relationship between memory cell and time step in LSTM - keras

i'm studying LSTM model.
Does one memory cell of hidden layer in LSTM correspond to one timestep?
example code) model.add(LSTM(128, input_shape = (4, 1)))
When implementing LSTMs in Keras, can set the number of memory cells, as in the example code, regardless of the time step. In the example it is 128.
but, A typical LSTM image is shown to correspond 1: 1 with the number of time steps and the number of memory cells. What is the correct answer?
enter image description here

In LSTM, we supply input in the following manner
[samples,timesteps,features]
samples is for number of training examples you want to feed at a time
timesteps is how many values you want to use
Say you mention timesteps=3
So values at t,t-1 and t-2 are used to predict the data at t+1
features is how many dimensions you want to supply at a time
LSTM has memory cells but I am explaining the code part so as not to confuse you
I hope this helps

as I understand timestep is a length of Sequence per each processing (=Window_Size)... that (dependently on parameter "return_sequences=True/False") will return either multi- or single- output per each step of data processed... like here explained & showed ...
explanation here seems to be better
concerning memory cell - here "A part of a NN that preserves some state across time steps is called a memory cell." - make me consider memory cell to be, probably, a "container" - each for temporal weights per vars in window series, till update of them during further backpropagation (when statefull=True) --
BETTER TO SEE ONCE - pic here memory cell & the logics of its work here
KNOW usage of the whole shape - here - time_steps for backpropagation

Related

Neural network regression evaluation based on target range

I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.

What are the best normalization technique and LSTM structure for forecasting an output with jumps (outliers)?

I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.

Batch size for panel data for LSTM in Keras

I have repeated measurements on subjects, which I have structured as input to an LSTM model in Keras as follows:
batch_size = 1
model = Sequential()
model.add(LSTM(50, batch_input_shape=(batch_size, time_steps, features), return_sequences=True))
Where time_steps are the number of measurements on each subject, and features the number of available features on each measurement. Each row of the data is one subject.
My question is regarding the batch size with this type of data.
Should I only use a batch size of 1, or can the batch size be more than 1 subjects?
Related to that, would I benefit from setting stateful to True? Meaning that learning from one batch would inform the other batches too. Correct me if my understanding about this is not right too.
Great question! Using a batch size greater than 1 is possible with this sort of data and setup, provided that your rows are individual experiments on subjects and that your observations for each subject are ordered sequentially through time (e.g. Monday comes before Tuesday). Make sure that your observations between train and test are not split randomly and that your observations are ordered sequentially by subject in each, and you can apply batch processing. Because of this, set shuffle to false if using Keras as Keras shuffles observations in batches by default.
In regards to setting stateful to true: with a stateful model, all the states are propagated to the next batch. This means that the state of the sample located at index i, Xi will be used in the computation of the sample Xi+bs in the next batch. In the case of time series, this generally makes sense. If you believe that a subject measurement Si infleunces the state of the next subject measurement Si+1, then try setting stateful to true. It may be worth exploring setting stateful to false as well to explore and better understand if a previous observation in time infleunces the following observation for a particular subject.
Hope this helps!

Flow of states in LSTM

I read that the internal state of LSTMs flows as follows:
it is always passed within a batch, so from the last timestamp of the i-th sample to the first of the i+1st
if the LSTM is stateful then the state is passed between batches, so the memory at the last timestamp of batch_k[i] is passed to the first timestamp of batch_{k+1}[i], for all indices i.
For me, this raises several questions. (Please correct me if my understanding is wrong)
Does this mean that the first timestamp of the (i+1)st sample needs to be the sucessor of the last timestep of sample i? (for all i)
Along the same lines, does the first timestamp of the i-th sample in batch k+1 have to be the sucessor of the last timestamp of the i-th sample in batch k?
If the first two conclusions are correct, then for stateful LSTMs we can NEVER shuffle anything and for the non-stateful ones we can at most shuffle the batches, but not the samples within batches, correct?
Why do we split the batch in samples of more than one timestep, anyway? If the above is correct, then the procedure 'within a sample' is the same as 'within a batch', so we might as well use samples of one timestep each.
Question 1
Not true. Sample s is not related to sample s+1 in the same batch. They're independent.
This means that the sample s of the batch b+1 needs to be the sucessor of the sample s of the batch b.
The samples will be processed in parallel, and batches must keep the same order. (That's why the documentation says you need shuffle=False when training stateful=True layers).
Question 2
This is true :)
Question 3
Partially correct. With stateful=True we cannot shuffle the batches (if there are going to be more than one batch).
But with stateful=False this really doesn't matter, because none of the samples will be related to each other. (Each sample in the batch is completely independent)
Question 4
Since "samples" in a batch are indepentent from each other there is a main reason to have many samples in a batch:
You have many independent sequences instead of just one sequence
But you may want to divide each sequence in many batches regarding the "length/timesteps". You would do this if:
Your sequences are way too long to fit your memory, so you load them partially and process them partially
Your model is predicting the future indefinitely, and you need it to predict the step t(n+1) to pass it as an input before it can produce the step t(n+2).
So, you can indeed use samples of one timestep each in stateful=True layers.
These answers may also help you:
https://stackoverflow.com/a/46331227/2097240
https://stackoverflow.com/a/47719094/2097240

Keras LSTM: first argument

In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.

Resources