What process does happen in 'Input Transform' in PointNet architecture? - conv-neural-network

I am reading a papers to understand the method which convert the raw point cloud data into machine learning readable dataset. Here I would like to ask you one question that I have in the research paper PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. I want to understand that in the PointNet architecture (shown in Picture below), in the first step, after taking the raw point cloud data into the algorithm, data goes into 'Input transform' part where some process happens in T-Net (Transformation network) and matrix multiplication. My question is 'What does happen in the 'Input Transform' and 'Feature transform' part? what is the input data and what is the output data? Kindly give an explanation about this as that was my main question.
You can find the research paper by the doi: 10.1109/CVPR.2017.16

I'm trying to work this out as well, consider this an incomplete answer. I think the Input transformer with a 3x3 matrix acts to spacially transform (via some affine transformation) the nx3 inputs (3 dimensional think x,y,z). Intuitively you may think of it this way: say you give it a rotated object (say an upside down chair), it would de-rotate the object to a canonical representation (an upright chair). Its a 3x3 matrix to preserve the dimensionality of the input. That way the input become invarient to changes of pose (perspective). After this the shared mlps (essentially a 1x1 conv) increase the number of features from nx3 to (nx64), the next T-net does the same as in the other example, it moves the higher dimensional feature space into a canonical form. As to exactly how the box works Im reading the code and will let you know.

Related

Cross-Correlation between 3d fields numerically

I have a two 3D variables for a each time step (so I have N 3d matrix var(Nx,Ny,Nz), for each variables). I want to construct the two point statistics but I guess I'm doing something wrong.
Two-point statistics formula, where x_r is the reference point and x is the independent variable
I know that the theoretical formulation of a two-point cross correlation is the one written above.
Let's for sake of simplicity ignore the normalization, so I'm focusing on the numerator, that is the part I'm struggling with.
So, my two variables are two 3D matrix, with the following notation phi(x,y,z) = phi(i,j,k), same for psi.
My aim is to compute a 3d correlation, so given a certain reference point Reference_Point = (xr,yr,zr), but I guess I'm doing something wrong. I'm trying that on MATLAB, but my results are not accurate, and by doing some researches online it does seem that I should do convolutions or fft, but I don't find any theoretical framework that explains how to do that and why the formulation above in practices should be implemented by the use of a conv or fft. Moreover I would like to implement my cross-correlation in the spatial domain and not in the frequency domain, and with the convolution I don't understand how to choose the reference point.
Thank you so much in advance for reply

How to get three dimensional vector embedding for a list of words

I have been asked to create three dimensional vector embeddings for a series of words. Although I understand what an embedding is and that word2vec will be able to create the vector embeddings, I cannot find a resource that shows me how to create a three dimensional vector (all the resources show many more dimensions than this).
The format I have to create the file in is:
house 34444 0.3232 0.123213 1.231231
dog 14444 0.76762 0.76767 1.45454
which is in the format <token>\t<word_count>\t<vector_embedding_separated_by_spaces>
Can anyone point me towards a resource that will show me how to create the desired file format given some training text?
Once you've decided on a programming language, and word2vec library, its documentation will likely highlight a configurable parameter that lets you specify the dimensionality of the vectors it trains. So, you just need to change that parameter from its typical values , like 100 or 300, to 3.
(Note, though, that 3-dimensional word-vectors are unlikely to show the interesting & useful property of higher-dimensional vectors.)
Once you've used such a library to create the vectors-in-memory, writing them out in your specified format becomes just a file-IO problem, unrelated to word2vec itself. In typical languages, you'd open a new file for writing, loop over your data printing each line properly, then close the file.
(To get a more detailed answer from StackOverflow, you'd want to pick a specific language/library, show what you've already tried with actual code, and show how the results/errors achieved fall short of your goal.)

interpretation of SVD for text mining topic analysis

Background
I'm learning about text mining by building my own text mining toolkit from scratch - the best way to learn!
SVD
The Singular Value Decomposition is often cited as a good way to:
Visualise high dimensional data (word-document matrix) in 2d/3d
Extract key topics by reducing dimensions
I've spent about a month learning about the SVD .. I must admit much of the online tutorials, papers, university lecture slides, .. and even proper printed textbooks are not that easy to digest.
Here's my understanding so far: SVD demystified (blog)
I think I have understood the following:
Any (real) matrix can be decomposed uniquely into 3 multiplied
matrices using SVD, A=U⋅S⋅V^T
S is a diagonal matrix of singular values, in descending order of magnitude
U and V^T are matrices of orthonormal vectors
I understand that we can reduce the dimensions by filtering out less significant information by zero-ing the smaller elements of S, and reconstructing the original data. If I wanted to reduce dimensions to 2, I'd only keep the 2 top-left-most elements of the diagonal S to form a new matrix S'
My Problem
To see the documents projected onto the reduced dimension space, I've seen people use S'⋅V^T. Why? What's the interpretation of S'⋅V^T?
Similarly, to see the topics, I've seen people use U⋅S'. Why? What's the interpretation of this?
My limited school maths tells me I should look at these as transformations (rotation, scale) ... but that doesn't help clarify it either.
** Update **
I've added an update to my blog explanation at SVD demystified (blog) which reflects the rationale from one of the textbooks I looked at to explain why S'.V^T is a document view, and why U.S' is a word view. Still not really convinced ...

Why a CNN learns different feature maps

I understand (and please correct me if my understanding is wrong) that the primary purpose of a CNN is to reduce the number of parameters from what you would need if you were to use a fully connected NN. And CNN achieves this by extracting "features" of images.
CNN can do this because in a natural image, there are small features such as lines and elementary curves that may occur in an "invariant" fashion, and constitute the image much like elementary building blocks.
My question is: when we create layers of feature maps, say, 5 of them, and we get these by using the sliding window of a size, say, 5x5 on an image that has pixels of, say, 100x100, Initially, these feature maps are initialized as random number weight matrices, and must progressively adjust the weights with gradient descent right? But then, if we are getting these feature maps by using the exactly same sized windows, sliding in exactly the same ways (sharing the same starting point and the same stride value), on the exactly same image, how can these maps learn different features of the image? Won't they all come out the same, say, a line or a curve?
Is it due to the different initial values of the weight matrices? (I.e. some weight matrices are more receptive to learning a certain particular feature than others?)
Thanks!! I wrote my 4 questions/opinions and indexed them, for the ease of addressing them separately!

Image Categorization Using Gist Descriptors

I created a multi-class SVM model using libSVM for categorizing images. I optimized for the C and G parameters using grid search and used the RBF kernel.
The classes are 1) animal 2) floral 3) landscape 4) portrait.
My training set is 100 images from each category, and for each image, I extracted a 920-length vector using Lear's Gist Descriptor C code: http://lear.inrialpes.fr/software.
Upon testing my model on 50 images/category, I achieved ~50% accuracy, which is twice as good as random (25% since there are four classes).
I'm relatively new to computer vision, but familiar with machine learning techniques. Any suggestions on how to improve accuracy effectively?
Thanks so much and I look forward to your responses!
This is very very very open research challenge. And there isn't necessarily a single answer that is theoretically guaranteed to be better.
Given your categories, it's not a bad start though, but keep in mind that Gist was originally designed as a global descriptor for scene classification (albeit has empirically proven useful for other image categories). On the representation side, I recommend trying color-based features like patch-based histograms as well as popular low-level gradient features like SIFT. If you're just beginning to learn about computer vision, then I would say SVM is plenty for what you're doing depending on the variability in your image set, e.g. illumination, view-angle, focus, etc.

Resources