NCHW input matrix to Dm conversion logic for convolution in cuDNN - conv-neural-network

I have been trying to understand the convolution lowering operation shown in the cuDNN paper. I was able to understand most of it by reading through and mapping various parameters to the image below. However, I am unable to understand how the original input data (NCHW) was converted into the Dm matrix shown in red.
The ordering of the elements of the Dm matrix does not make sense. Can someone please explain this?

Each column of Dm corresponds to a tile of the original image. Two examples are shown below:
There is no simple mathematical description of how to extract these tiles (authors call it "non-trivial") but some general comments in section 3.1.
A couple of notes:
The exact layout of data in Dm and Fm is flexible: you could permute the rows of Dm and the columns of Fm or vice-versa.
cuDNN does not actually construct Dm in full, rather it lazily generates columns of Dm as they are needed (see section 3.1 of the paper)
Convolution or cross-correlation? The classical definition of a convolution requires that the filters are flipped (along both axes) before applying them to the image. Modern machine-learning frameworks tend not to do this, and mathematical pedants call this a cross-correlation rather than a convolution. From a machine-learning perspective, it doesn't matter which one you use, but filter-flipping gives convolution nice algebraic properties (e.g. commutativity) and matches the definition of convolution used in mathematics (side note: convolve means to fold or roll). In this cuDNN paper the filters are flipped.

Related

What process does happen in 'Input Transform' in PointNet architecture?

I am reading a papers to understand the method which convert the raw point cloud data into machine learning readable dataset. Here I would like to ask you one question that I have in the research paper PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. I want to understand that in the PointNet architecture (shown in Picture below), in the first step, after taking the raw point cloud data into the algorithm, data goes into 'Input transform' part where some process happens in T-Net (Transformation network) and matrix multiplication. My question is 'What does happen in the 'Input Transform' and 'Feature transform' part? what is the input data and what is the output data? Kindly give an explanation about this as that was my main question.
You can find the research paper by the doi: 10.1109/CVPR.2017.16
I'm trying to work this out as well, consider this an incomplete answer. I think the Input transformer with a 3x3 matrix acts to spacially transform (via some affine transformation) the nx3 inputs (3 dimensional think x,y,z). Intuitively you may think of it this way: say you give it a rotated object (say an upside down chair), it would de-rotate the object to a canonical representation (an upright chair). Its a 3x3 matrix to preserve the dimensionality of the input. That way the input become invarient to changes of pose (perspective). After this the shared mlps (essentially a 1x1 conv) increase the number of features from nx3 to (nx64), the next T-net does the same as in the other example, it moves the higher dimensional feature space into a canonical form. As to exactly how the box works Im reading the code and will let you know.

Cross-Correlation between 3d fields numerically

I have a two 3D variables for a each time step (so I have N 3d matrix var(Nx,Ny,Nz), for each variables). I want to construct the two point statistics but I guess I'm doing something wrong.
Two-point statistics formula, where x_r is the reference point and x is the independent variable
I know that the theoretical formulation of a two-point cross correlation is the one written above.
Let's for sake of simplicity ignore the normalization, so I'm focusing on the numerator, that is the part I'm struggling with.
So, my two variables are two 3D matrix, with the following notation phi(x,y,z) = phi(i,j,k), same for psi.
My aim is to compute a 3d correlation, so given a certain reference point Reference_Point = (xr,yr,zr), but I guess I'm doing something wrong. I'm trying that on MATLAB, but my results are not accurate, and by doing some researches online it does seem that I should do convolutions or fft, but I don't find any theoretical framework that explains how to do that and why the formulation above in practices should be implemented by the use of a conv or fft. Moreover I would like to implement my cross-correlation in the spatial domain and not in the frequency domain, and with the convolution I don't understand how to choose the reference point.
Thank you so much in advance for reply

How can I build a good approximation of an unknown distribution when only having samples from it in order to draw from it in torch?

Say I just have random samples from the Distribution and no other data - e.g. a list of numbers - [1,15,30,4,etc.]. What's the best way to estimate the distribution to draw more samples from it in pytorch?
I am currently assuming that all samples come from a Normal distribution and just using the mean and std of the samples to build it and draw from it. The function, however, can be of any distribution.
samples = torch.Tensor([1,2,3,4,3,2,2,1])
Normal(samples.mean(), samples.std()).sample()
If you have enough samples (and preferably sample dimension is higher than 1), you could model the distribution using Variational Autoencoder or Generative Adversarial Networks (though I would stick with the first approach as it's simpler).
Basically, after correct implementation and training you would get deterministic decoder able to decode hidden code you would pass it (say vector of size 10 taken from normal distribution) into a value from your target distribution.
Note it might not be reliable at all though, it would be even harder if your samples are 1D only.
The best way depends on what you want to achieve. If you don't know the underlying distribution, you need to make assumptions about it and then fit a suitable distribution (that you know how to sample) to your samples. You may start with something simple like a Mixture of Gaussians (several normal distributions with different weightings).
Another way is to define a discrete distribution over the values you have. You will give each value the same probability, say p(x)=1/N. When you sample from it, you simply draw a random integer from [0,N) that points to one of your samples.

Why a CNN learns different feature maps

I understand (and please correct me if my understanding is wrong) that the primary purpose of a CNN is to reduce the number of parameters from what you would need if you were to use a fully connected NN. And CNN achieves this by extracting "features" of images.
CNN can do this because in a natural image, there are small features such as lines and elementary curves that may occur in an "invariant" fashion, and constitute the image much like elementary building blocks.
My question is: when we create layers of feature maps, say, 5 of them, and we get these by using the sliding window of a size, say, 5x5 on an image that has pixels of, say, 100x100, Initially, these feature maps are initialized as random number weight matrices, and must progressively adjust the weights with gradient descent right? But then, if we are getting these feature maps by using the exactly same sized windows, sliding in exactly the same ways (sharing the same starting point and the same stride value), on the exactly same image, how can these maps learn different features of the image? Won't they all come out the same, say, a line or a curve?
Is it due to the different initial values of the weight matrices? (I.e. some weight matrices are more receptive to learning a certain particular feature than others?)
Thanks!! I wrote my 4 questions/opinions and indexed them, for the ease of addressing them separately!

Canny algorithm is enough for creating a feature descriptor image and giving for SVM?

i retrieve contours from images by using canny algorithm. it's enough to have a descriptor image and put in SVM and find similarities? Or i need necessarily other features like elongation, perimeter, area ?
I talk about this, because inspired by this example: http://scikit-learn.org/dev/auto_examples/plot_digits_classification.html i give my image in greyscale first, in canny algorithm style second and in both cases my confusion matrix was plenty of 0 like precision, recall, f1-score, support measure
My advice is:
unless you have a low number of images in your database and/or the recognition is going to be really specific (not a random thing for example) I would highly recommend you to apply one or more features extractors such SIFT, Fourier Descriptors, Haralick's Features, Hough Transform to extract more details which could be summarised in a short vector.
Then you could apply SVM after all this in order to get more accuracy.

Resources