When performing 2DPooling in keras over an input with odd dimension, say 8x24x128, the output is appropriately 4x12x128 if 2x2 pooling is used. When the input has an odd dimension, say 8x25x128, the output is 4x12x128. The pooling does NOT operate on the last column (25) of the input. I would like to zero pad the input to 8x26x128 with an extraneous zero column. Is this possible?
In general terms: what is the proper etiquette for pooling over odd dimensional inputs?
Related
I am trying to calculate the emotion of paragraphs by measuring the emotion of its individual sentences. But this ends in vectors of varying length, as a paragraph might be as short as 1 sentence or as long as 30 sentences. how would you suggest converting these vectors to scalars?
The first option is taking average, but this biases the results: It turns out shorter paragraphs have a higher score and longer ones a score around the mean.
The second option is summing up the values, but this biases the results again, as longer paragraph will have bigger scores
The third option is using a method used in VADER, which is summing up and then normalizing, but I could not find a reliable resource that explains how the results are normalized. The only thing I found is the following formula from VADER code:
norm_score = score / math.sqrt((score * score) + alpha)
VADER sets alpha to 15, but how this number should be changed and based on what? Also , where does this normalization method come from?
While applying min max scaling to normalize your features, do you apply min max scaling on the entire dataset before splitting it into training, validation and test data?
Or do you split first and then apply min max on each set, using the min and max values from that specific set?
Lastly , when making a prediction on a new input, should the features of that input be normalized using the min, max values from the training data before being fed into the network?
Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real-world data, so you should treat it the same way.
To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.
I am learning neural networking and I am trying to implement and understand LSTM and other recurrent NNs with Keras.
I have been trying to understand them by reading articles and books, in particular: this. But I am having trouble connecting the theory to real examples.
For example I have time-series data which I have reformatted into a three dimensional array. My array has size (12000,60,1) and the goal is to predict the next step. My understanding is that my time-step is then 60.
How is this data, in particular the time-step, utilized by the LSTM structure?
My current idea is that, in reference to the diagram, the LSTM takes the first 60-step array and uses the first element as X_0, it then 'does what LSTM cells do' and the updated cell state is passed onto the next cell where X_1 is inputted and the process is repeated.
Now when each of the 60 elements has passed through each of their cells we then have 60 nodes (h0 to h59) which then feed into an output node to predict the next step. The final cell state is then the first cell state of the next array and the next array of 60 is run through in the same manner.
Is this the correct? I am doubtful of my understanding, in particular as to whether the final cell state gets carried to the next array.
If all of this is correct, what does the 50 in LSTM(50) indicate relative to my understanding?
Yes, your explanation is correct, the state is kept and updated across timesteps.
The first parameter of the LSTM layer is the number of neurons, or better said, the dimensionality of the output and the hidden state. Remember the hidden state is a vector, and the dimensions of the internal weight matrices that transform from input to hidden state, hidden to hidden state (recurrent), and hidden state to output are determined by this parameter.
So as in a Dense layer, a LSTM(50) will have a 50-dimensional output vector, and additionally the hidden state of the recurrent layer will also be 50-dimensional.
I am new to language processing and sorry if this might look a very basic question.
Given a training sequence, for example: "aaabbcddecbbaaaaabbabbbabccddbbcdaaaaaa" (the real sequence is much longer), I can use the recurrent neural networks such as LSTM to learn patterns and dependencies in the sequence to generate next characters (a single character or several characters). For example, feeding a sample sequence "aaabb" will generate "c". It is worth mentioning, that my alphabet contains only 6 ordered characters {a,b,c,d,e,f}
My question is: how to compute probability of a particular combination of next characters? For example, given a sequence "aabcdcbbaa" what will be the probability of obtaining "cc" ?
Many thanks in advance!
UPD
While writing the question, I realised that the the probability of a combination of the next characters might be computed as a "tensor product" of a single characters. What I mean is: given a test sample, the LSTM outputs a vector (through the softmax function) with probabilities of each character and then these probabilities are converted into a single character (the most probable outcome). For example: the sequence "aabcdcbbaa" will generate a 6-dim vector p1 = (0.1, 0.07, 0.23, 0.15, 0.31, 0.14) which corresponds to characters (a, b, c, d, e, f). Then by using each of these characters we can compute the probabilities of the next (the second) character p2. Then by multiplying these two probability vectors p1Xp2 we can compute the joint probability of obtaining two characters: aa, ab, ac, ad,....
Am I correct?
This paper contains confusion matrices for spelling errors in a noisy channel. It describes how to correct the errors based on conditional properties.
The conditional probability computation is on page 2, left column. In footnote 4, page 2, left column, the authors say: "The chars matrices can be easily replicated, and are therefore omitted from the appendix." I cannot figure out how can they be replicated!
How to replicate them? Do I need the original corpus? or, did the authors mean they could be recomputed from the material in the paper itself?
Looking at the paper, you just need to calculate them using a corpus, either the same one or one relevant to your application.
In replicating the matrices, note that they implicitly define two different chars matrices: a vector and an n-by-n matrix. For each character x, the vector chars contains a count of the number of times the character x occurred in the corpus. For each character sequence xy, the matrix chars contains a count of the number of times that sequence occurred in the corpus.
chars[x] represents a look-up of x in the vector; chars[x,y] represents a look-up of the sequence xy in the matrix. Note that chars[x] = the sum over chars[x,y] for each value of y.
Note that their counts are all based on the 1988 AP Newswire corpus (available from the LDC). If you can't use their exact corpus, I don't think it would be unreasonable to use another text from the same genre (i.e. another newswire corpus) and scale your counts such that they fit the original data. That is, the frequency of a given character shouldn't vary too much from one text to another if they're similar enough, so if you've got a corpus of 22 million words of newswire, you could count characters in that text and then double them to approximate their original counts.