How do LSTMs utilize input data in practice? - keras

I am learning neural networking and I am trying to implement and understand LSTM and other recurrent NNs with Keras.
I have been trying to understand them by reading articles and books, in particular: this. But I am having trouble connecting the theory to real examples.
For example I have time-series data which I have reformatted into a three dimensional array. My array has size (12000,60,1) and the goal is to predict the next step. My understanding is that my time-step is then 60.
How is this data, in particular the time-step, utilized by the LSTM structure?
My current idea is that, in reference to the diagram, the LSTM takes the first 60-step array and uses the first element as X_0, it then 'does what LSTM cells do' and the updated cell state is passed onto the next cell where X_1 is inputted and the process is repeated.
Now when each of the 60 elements has passed through each of their cells we then have 60 nodes (h0 to h59) which then feed into an output node to predict the next step. The final cell state is then the first cell state of the next array and the next array of 60 is run through in the same manner.
Is this the correct? I am doubtful of my understanding, in particular as to whether the final cell state gets carried to the next array.
If all of this is correct, what does the 50 in LSTM(50) indicate relative to my understanding?

Yes, your explanation is correct, the state is kept and updated across timesteps.
The first parameter of the LSTM layer is the number of neurons, or better said, the dimensionality of the output and the hidden state. Remember the hidden state is a vector, and the dimensions of the internal weight matrices that transform from input to hidden state, hidden to hidden state (recurrent), and hidden state to output are determined by this parameter.
So as in a Dense layer, a LSTM(50) will have a 50-dimensional output vector, and additionally the hidden state of the recurrent layer will also be 50-dimensional.

Related

Standardize or subtract constant to data for regression

I am attempting to create a prediction model using multiple linear regression.
One of the predictor variables I want to use is a percentage, so it ranges from 0 - 100. I hypothesize that when it’s <50% there will be a negative effect on the target variable and when >50% a positive effect.
The mean of the predictor variable isn’t exactly 50 in my data set so I am unsure if I centre or Standardize this variable, or just subtract 50 from it to create the split I am looking for.
I am very new to statistics and self teaching myself at the moment, any help is greatly appreciated.

Why does k=1 in KNN give the best accuracy?

I am using Weka IBk for text classificaiton. Each document basically is a short sentence. The training dataset contains 15,000 documents. While testing, I can see that k=1 gives the best accuracy? How can this be explained?
If you are querying your learner with the same dataset you have trained on with k=1, the output values should be perfect barring you have data with the same parameters that have different outcome values. Do some reading on overfitting as it applies to KNN learners.
In the case where you are querying with the same dataset as you trained with, the query will come in for each learner with some given parameter values. Because that point exists in the learner from the dataset you trained with, the learner will match that training point as closest to the parameter values and therefore output whatever Y value existed for that training point, which in this case is the same as the point you queried with.
The possibilities are:
The data training with data tests are the same data
Data tests have high similarity with the training data
The boundaries between classes are very clear
The optimal value for K is depends on the data. In general, the value of k may reduce the effect of noise on the classification, but makes the boundaries between each classification becomes more blurred.
If your result variable contains values of 0 or 1 - then make sure you are using as.factor, otherwise it might be interpreting the data as continuous.
Accuracy is generally calculated for the points that are not in training dataset that is unseen data points because if you calculate the accuracy for unseen values (values not in training dataset), you can claim that my model's accuracy is the accuracy that is been calculated for the unseen values.
If you calculate accuracy for training dataset, KNN with k=1, you get 100% as the values are already seen by the model and a rough decision boundary is formed for k=1. When you calculate the accuracy for the unseen data it performs really bad that is the training error would be very low but the actual error would be very high. So it would be better if you choose an optimal k. To choose an optimal k you should be plotting a graph between error and k value for the unseen data that is the test data, now you should choose the value of the where the error is lowest.
To answer your question now,
1) you might have taken the entire dataset as train data set and would have chosen a subpart of the dataset as the test dataset.
(or)
2) you might have taken accuracy for the training dataset.
If these two are not the cases than please check the accuracy values for higher k, you will get even better accuracy for k>1 for the unseen data or the test data.

2DPooling in Keras doesn't pool last column

When performing 2DPooling in keras over an input with odd dimension, say 8x24x128, the output is appropriately 4x12x128 if 2x2 pooling is used. When the input has an odd dimension, say 8x25x128, the output is 4x12x128. The pooling does NOT operate on the last column (25) of the input. I would like to zero pad the input to 8x26x128 with an extraneous zero column. Is this possible?
In general terms: what is the proper etiquette for pooling over odd dimensional inputs?

Excel - Iteration based on changing cell value, pasting result

So I have set up a linear model in excel of passenger numbers and population. There are two decay parameters which change the forecasts of passenger numbers - for the different types of transport.
Manually I can change the decay factors 0.1-1.0 for every combination, to see how the fit of the model changes. I would like to find the combination of parameters that creates the best model fit at a 0.01 accuracy. Any ideas how?
Essentially the passenger forecasts change when setting parameters which in turn changes the model fit. I need an easy way to see how model fit changes with changing the parameters! Thanks.

Bootstrapping with Replacement

I'm reading a paper and am confused with their described Bootstrap Method. the text says:
the uncertainties associated with each stacked flux density are
obtained via the bootstrap method, during which random subsamples
(with replacement) of sources are chosen and re-stacked. The number of
sources in each subsample is equal to the original number of sources
in the stack. This process is repeated 10000 times in order to
determine the representative spread in the properties of the
population being stacked.
So, say I have 50 values. I find the average of these values. According to this method, I would get a subsample from this original 50 population and find that average, and repeat this 10,000 times. Now, how would I get a subsample "equal to the original number of sources in the stack" without my subsample BEING EXACTLY THE SAME AS THE ORIGINAL, AND THUS THE EXACT SAME MEAN, WHICH WOULD TELL US NOTHING!?
you can reuse values. So if i have ABCDE as my values, i can bootstrap with AABCD, etc. I can use values twice, that is the key

Resources