How to calculate the number of layer for Googlenet - conv-neural-network

I know I am asking a silly question but I find no way to get the answer and I really wish to know it. People always said that Googlenet has 22 layers. However, no matter how I calculate, I still do not get 22 layers. How to calculate the number of layers?

I think the correct way to look at this is that GoogLeNet has 22 convolutional layers. Count the number of blue columns and only count the convolutional ones and you will obtain that number. In this case, the parallel convolutions are not considered as separate layers. Maxpooling, concatenation and softmax are not really considered layers here as the don't really perform any computation (they are parameterless). The dense layers are left out, because we're only talking about the convolutional part. All that said, I agree that saying that the model has 22 convolutional layers in this case is actually confusing due to the complexity of the model.

Related

Keras regression - Should my first/last layer have an activation function?

I keep seeing examples floating around the internet where the input and/or output layer have either no activation function, a linear activation function, or None. What I'm confused about is when to use one, and how to know if you should? I also am confused about what the number of nodes should be for the input layer.
Right now I have a regression problem, I'm trying to predict a real value based on an array of inputs (about 54). Should I be using relu in my activation function for the input layer? Should I have linear as my output activation? My data is linearly scaled from 0 to 1 for each feature independently as they're different units. I was also unsure of the number of nodes I should use for my input layer as I see some examples pick an arbitrary number not related to their input shape, and other examples saying to specifically set it to the number of inputs, or number of inputs plus one for a bias. But none of the examples so far have explained their reasoning behind their choices.
Since my model isn't performing very well, I thought asking what the architecture should be could help me fine tune it more.

What are the best normalization technique and LSTM structure for forecasting an output with jumps (outliers)?

I have a time series forecasting case with ten features (inputs), and only one output. I'm using 22 timesteps (history of features) for one step ahead prediction using LSTM. Also, I apply MinMaxScaler for input normalization, but I don't normalize the output. The output contains some rare jumps (such as 20, 50, or more than 100), but the other values are between 0 and ~5 (all values are positive). In this case, it's important to forecast both normal and outlier outputs correctly so I dont want to miss the jumps in my forecasting model. I think if I use MinMaxScaler for output, most of the values will be something near the zero but the others (outliers) will be near one.
What is the best way to normalize the output? Should I leave it without normalization?
What is the best LSTM structure to handle this issue? (currently, I'm using LSTM with relu and Dense layer with relu as the last layer so I the output will be a positive value). I think I should select activation functions correctly for this case.
I think first of all, you should decide on a metric to measure performance. For example, do you want to use MAE or MSE? Or some other metric you decide based on the task at hand. For example, you may tolerate greater error for the "rare jumps", but not for the normal cases, or vice versa. Once you are decided on the error metric, ideally, you should set that as the cost function that the LSTM network would be minimizing.
Now the goal would be to minimize the desired error metric you set. If this was a convex problem, the scaling of the output will not matter. But we now that this is not the case with the complex deep learning architectures. What this means is that while minimizing the cost function with gradient decent, it might get stuck in a local minimum with a very delayed convergence. In this case, normalizing the output might help. How?
Assume that your output has a mean value of 5. With last layers parameters initialized around zero and a bias value of zero (i.e. the linear transformation of relu), the network needs to learn that the bias should be around 5. Depending on the complexity of the network this could take some epochs. However, if you normalize the data, or initialize the bias at 5, then your network starts with a good estimate of the bias and thus converges faster.
Now back to your questions:
I would at least make the output zero mean and use Dense layer with linear output.
The architecture you have seems fine, you can try stacking 2-4 LSTM layers if you think your input has complex time dependencies.
Feel free to update the OP with the the code and the performance you get and we can discuss what else can be improved.

Using LSTM for large text

I have a dataset for detecting fake news that i got from kaggle( https://www.kaggle.com/c/fake-news/data ).
I want to use LSTM for the classification
The mean length of words in a single article is about 750 words. I have tried to remove punctuation, stop words, removed numbers. Preprocessing the text is also taking a very long time.
I'd like a method to feed large text into the LSTM using keras. What should i do to reduce computation time and not lose a lot of accuracy.
There are some things you could try to speed things up:
1. Use CUDNN version of LSTM
It is usually faster, check available layers here keras.layers.CuDNNLSTM is what you are after.
2. Use Conv1d to create features
You can use 1 dimensional convolution with kernel_size specifying how many words should be taken into account and stride specifying the jump of moving window. For kernel_size=3 and stride=3, padding="SAME" it would drop your dimensionality three times.
You may stack more convolutional layers.
On top of that you can still employ LSTM normally.
3. Drop LSTM altogether
You may go with 1d convolutions and pooling for classification, RNNs are not the only way.
On the upside: you will not encounter vanishing gradients (could be mitigated a little by Bidirectional LSTM as well).
On the downside: you will lose strict dependence between words, though it shouldn't be much of a problem for binary classification (I suppose it's your goal).

Is there a relation between the number of LSTM units and the length of the sequence to be trained?

I have programmed keras neural network to train on sequences. Does choosing the LSTM units in keras depend on length of the sequence?
There isn't a set way of determining how many units you should have based on your input.
More units are a way of making the model more complex. Generally speaking, if the look back period for your neural network is longer, then you have more features to train on, which means a more complex model would be better suited for learning your data.
Personally, I like to use the number of timesteps in each sample as my number of units, and I decrease this number as I move deeper into the network.
I have encountered the problem when I designed sports betting prediction engine with LSTM RNN.
There's a rule of thumb that helps for supervised learning problems. Please check this link. Here
But in my opinion, there is still no correct method or formulus to calculate the number of neurons per layer and the number of hidden layers according to the training dataset yet.

Audio classification with Keras: presence of human voice

I'd like to create an audio classification system with Keras that simply determines whether a given sample contains human voice or not. Nothing else. This would be my first machine learning attempt.
This audio preprocessor exists. It claims not to be done, but it's been forked a few times:
https://github.com/drscotthawley/audio-classifier-keras-cnn
I don't understand how this one would work, but I'm ready to give it a try:
https://github.com/keunwoochoi/kapre
But let's say I got one of those to work, would the rest of the process be similar to image classification? Basically, I've never fully understood when to use Softmax and when to use ReLu. Would this be similar with sound as it would with images once I've got the data mapped as a tensor?
Sounds can be seen as a 1D image and be worked with with 1D convolutions.
Often, dilated convolutions may do a good work, see Wave Nets
Sounds can also be seen as sequences and be worked with RNN layers (but maybe they're too bulky in amount of data for that)
For your case, you need only one output with a 'sigmoid' activation at the end and a 'binary_crossentropy' loss.
Result = 0 -> no voice
Result = 1 -> there's voice
When to use 'softmax'?
The softmax function is good for multiclass problems (not your case) where you want only one class as a result. All the results of a softmax function will sum 1. It's intended to be like a probability of each class.
It's mainly used at the final layer, because you only get classes as the final result.
It's good for cases when only one class is correct. And in this case, it goes well with the loss categorical_crossentropy.
Relu and other activations in the middle of the model
These are not very ruled. There are lots of possibilities. I often see relu in image convolutional models.
Important things to know are they "ranges". What are the limits of their outputs?
Sigmoid: from 0 to 1 -- at the end of the model this will be the best option for your presence/abscence classification. Also good for models that want many possible classes together.
Tanh: from -1 to 1
Relu: from 0 to limitless (it simply cuts negative values)
Softmax: from 0 to 1, but making sure the sum of all values is 1. Good at the end of models that want only 1 class among many classes.
Oftentimes it is useful to preprocess the audio to a spectrogram:
Using this as input, you can use classical image classification approaches (like convolutional neural networks). In your case you could divide the input audio in frames of around 20ms-100ms (depending on the time resolution you need) and convert those frames to spectograms. Convolutional networks can also be combined with recurrent units to take a larger time context into account.
It is also possible to train neural networks on raw waveforms using 1D Convolutions. However research has shown that preprocessing approaches using a frequency transformation achieve better results in general.

Resources