What is models.common.C3 in yolov5 model? - conv-neural-network

[Yolo model summary][1]
Also can someone explain the values in arguments column
[1]: https://i.stack.imgur.com/weBPt.png

I am studying the yolov5 architecture right now, so do not take my answer as absolute truth, but for my understanding the C3 Layer is a CSP bottleneck that includes 3 convolutional layers. Essentially it does a Conv on the input tensor and it concats the result to the same tensor passed through a convolution AND a series of bottleneck layers with e=1. Then the whole thing is passed again through a Convolution layer. CSP stands for Cross Stage Partial layer.
As per the first column, it is used in the forward function of the model to understand which tensor to use as the input value of each layer. The majority of the layers has '-1', meaning they take the last layer's output before them as their input, but there are Concat layers that take different levels as input to recreate the PANet architecture in the neck.
For further questions, I suggest you to ask in the Yolov5 github issues section, as they are often quick to give you answers.

Related

Conv2D filters and CNN architecture

I am currently pursuing undergraduation, I am working on CNN model to recognize Telegu characters.
This Questions has two parts,
I have a (32,32,1) shape Telegu character images, I want to train my CNN model to recognize the character. So, what should be my model architecture and how to decide the architecture, no of parameters and hidden layers. I know that my case is exactly same as handwritten digit recognition, but I want to know how to decide those parameters. Is there any common practice in building such architecture.
Operation Conv2D (32, (5,5)) means 32 filters of size 5x5 are applied on to the input, my question is are these filters all same or different, if different what kind of filters are initialized and who decides them?
I tried to surf internet but everywhere I go, the answer I get is Conv2D operation applies filters on input and does the convolution operation.
To decide which model architecture would be best, you need to experiment. Thats the only way. As you want to classify, VGG architecture would be a good starting point I believe. You need to experiment with number of parameters as it depends on your problem. You can use Keras Tuner for it: https://keras.io/keras_tuner/
For kernel initialization, as far as I know convolutional layers in Keras uses Glorot Uniform Initialization but you can change that by using kernel_initializer parameter. Long story short, convolutional layers are initialized with a distribution function and as training goes filters change the values inside, which is learning process. https://keras.io/api/layers/initializers
Edit: I forgot to inform you that I suggest VGG architecture but in a way you downsize the models a lot. Your input shape is little so if your model is too much deep, you will overfit really quickly.

Doesn't keras.layers.Flatten lose information?

Brand new to keras and ML in general. I'm looking at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/, and it uses Flatten between Embedding and Dense because Embedding produces a 2D vector but Dense requires a single dimension.
I'm sure I'm missing something obvious here, but why doesn't this lose which words are in which input vectors? How are we able to still know that input #3 was "nice work" and is associated with label #3, 1, for "positive"?
I guess the original dimensions are retained from the original input and then somehow restored for Dense's output? Or am I just totally missing a major conceptual aspect?
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
Thanks for any guidance!
Embedding layer gives you a vector for each word token, so the output is 2-d. We need to use flatten before any classifier block.
There is some information lost, for example when we use Convolutional layers, and then flat the feature maps, the spatial information is lost. But we already extract the most important features using Conv layers and we feed those features to fully connected layers.
In your example, the temporal dimension is no longer maintained, usually, it's desired to pass the output of the embedding matrix to an RNN/Conv layer for further feature extraction.
Flatten only is applied on the non-batch dimension, meaning the examples are still separated (if you mean that).
For each sample, let's say nice work, we get 2 vectors (1 for nice, 1 for work), now we only want to know the overall sentiment from the sentence so, once we extract the features, we can apply flatten.

Keras regression - Should my first/last layer have an activation function?

I keep seeing examples floating around the internet where the input and/or output layer have either no activation function, a linear activation function, or None. What I'm confused about is when to use one, and how to know if you should? I also am confused about what the number of nodes should be for the input layer.
Right now I have a regression problem, I'm trying to predict a real value based on an array of inputs (about 54). Should I be using relu in my activation function for the input layer? Should I have linear as my output activation? My data is linearly scaled from 0 to 1 for each feature independently as they're different units. I was also unsure of the number of nodes I should use for my input layer as I see some examples pick an arbitrary number not related to their input shape, and other examples saying to specifically set it to the number of inputs, or number of inputs plus one for a bias. But none of the examples so far have explained their reasoning behind their choices.
Since my model isn't performing very well, I thought asking what the architecture should be could help me fine tune it more.

TimeDistributed Layers vs. ConvLSTM-2D

Could anyone explains for me the differences between Time-Distributed Layers (from Keras Wrapper) and ConvLSTM-2D (Convolutional LSTM), for purposes, usage, etc.?
Both applies to a sequence of data.
Time Distributed is a very straightforward layer wrapper which only applies a layer (usually dense layer) on each time point. You need it when you need to change the shape of output tensor, especially the dimension of features, instead of sample size and time step.
ConvLSTM2D, is much more complex. You need to understand cnn and rnn layer first, where LSTM is one of most popular rnn. LSTM itself is applied on a sequence of of tensor, which is used for NLP, time series and for each time step the input is 1-dimension. cnn, the conv part, is usually used to learn from image, which is 2-dimension but don't have a sequence (time step). Combined together, convLSTM is used to learn image in a sequence, like video.

keras RNN w/ local support and shared weights

I would like to understand how Keras sets up weights to be shared. Specifically, I would like to use a convolutional 1D layer for processing a time-frequency representation of an audio signal and feed it into an RNN (perhaps a GRU layer) that has:
local support (e.g. like the Conv1D layer with a specified kernel size). Things that are far away in frequency from an output are unlikely to affect the output.
Shared weights, that is I train only a single set of weights across all of the neurons in the RNN layer. Similar inferences should work at lower or higher frequencies.
Essentially, I'm looking for many of the properties that we find in the 2D RNN layers. I've been looking at some of the Keras source code for the convnets to try to understand how weight sharing is implemented, but when I see the weight allocation code in the layer build methods (e.g. in the _Conv class), it's not clear to me how the code is specifying that the weights for each filter are shared. Is this buried in the backend? I see that the backend call is to a specific 1D, 2D, or 3D convolution.
Any pointers in the right direction would be appreciated.
Thank you - Marie

Resources