Pdf of product of two i.i.d random variable distributed as circular symmetric gaussian - gaussian

what is a distribution of the product of two circular symmetric Gaussian random variables?

Related

How to find the center point position of each Gaussian after the superposition of multiple Gaussian distributions?

Suppose the target is formed by the superposition of multiple Gaussian distributions with the same variance, but to separate out each small Gaussian distribution inside, I wonder if there is a simple solution?

Gaussian Mixture Models for pixel clustering

I have a small set of aerial images where different terrains visible in the image have been have been labelled by human experts. For example, an image may contain vegetation, river, rocky mountains, farmland etc. Each image may have one or more of these labelled regions. Using this small labeled dataset, I would like to fit a gaussian mixture model for each of the known terrain types. After this is complete, I would have N number of GMMs for each N types of terrains that I might encounter in an image.
Now, given a new image, I would like to determine for each pixel, which terrain it belongs to by assigning the pixel to the most probable GMM.
Is this the correct line of thought ? And if yes, how can I go about clustering an image using GMMs
Its not clustering if you use labeled training data!
You can, however, use the labeling function of GMM clustering easily.
For this, compute the prior probabilities, mean and covariance matrixes, invert them. Then classify each pixel of the new image by the maximum probability density (weighted by prior probabilities) using the multivariate Gaussians from the training data.
Intuitively, your thought process is correct. If you already have the labels that makes this a lot easier.
For example, let's pick on a very well known and non-parametric algorithm like Known Nearest Neighbors https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
In this algorithm, you would take your new "pixels" which would then find the closest k-pixels like the one you are currently evaluating; where closest is determined by some distance function (usually Euclidean). From there, you would then assign this new pixel to the most frequently occurring classification label.
I am not sure if you are looking for a specific algorithm recommendation, but KNN would be a very good algorithm to begin testing this type of exercise out on. I saw you tagged sklearn, scikit learn has a very good KNN implementation I suggest you read up on.

How to generate a random number from a weird distribution

I study a problem of a random walk with drift and an absorbing boundary. The system is well theoretically understood. My task is to simulate it numerically, in particular to generate random numbers from this distribution, see the formula. It is the distribution of the coordinate x at time t given the starting point x_0, the noise intensity \sigma and the drift \mu. The question is how to generate random numbers from this distribution? I can of course use the inverse transform sampling, but it is slow. May be I can make use of the fact that the probability density function is the difference of two Gaussian functions? Can I relate somehow my distribution with the normal distribution?

How to select most important features? Feature Engineering

I used the function for gower distance from this link: https://sourceforge.net/projects/gower-distance-4python/files/. My data (df) is such that each row is a trade, and each of the columns are features. Since it contains a lot of categorical data, I then converted the data using gower distance to measure "similarity"... I hope this is correct (as below..):
D = gower_distances(df)
distArray = ssd.squareform(D)
hierarchal_cluster=scipy.cluster.hierarchy.linkage(distArray, method='ward', metric='euclidean', optimal_ordering=False)
I then plot the hierarchical_cluster from above into a dendogram:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
hierarchal_cluster,
truncate_mode='lastp', # show only the last p merged clusters
p=15, # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True # to get a distribution impression in truncated branches
)
I cannot show it, since I do not have enough privilege points, but on the dendogram I can see separate colors.
What is the main discriminator separating them?
How can I find this out?
How can I use PCA to extract useful features?
Do I pass my 'hierarchal_cluster' into a PCA function?
Something like the below..?
pca = PCA().fit(hierarchal_cluster.T)
plt.plot(np.arange(1,len(pca.explained_variance_ratio_)+1,1),pca.explained_variance_ratio_.cumsum())
I hope you do know that PCA works only for continuous data? Since you mentioned, there are many categorical features. From what you have written, it occurs that you got mixed data.
A common practice when dealing with mixed data is to separate the continuous and categorical features/variables. Then find the Euclidean distance between data points for continuous (or numerical) features and Hamming distance for the categorical features [1].
This will enable you to find similarity between continuous and categorical feature separately. Now, while you are at this, apply PCA on the continuous variables to extract important features. And apply Multiple Correspondence Analysis MCA on the categorical features. Thereafter, you can combine the obtained relevant features together, and apply any clustering algorithm.
So essentially, I'm suggesting feature selection/feature extraction before clustering.
[1] Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3), pp.283-304.
Quoting the documentation of scipy on the matter of Ward linkage:
Methods ‘centroid’, ‘median’ and ‘ward’ are correctly defined only if Euclidean pairwise metric is used. If y is passed as precomputed pairwise distances, then it is a user responsibility to assure that these distances are in fact Euclidean, otherwise the produced result will be incorrect.
So you can't use Ward linkage with Gower!

Training of SVM classifier using SIFT features

please i like to classify a set of image in 4 class with SIFT DESCRIPTOR and SVM. Now, using SIFT extractor I get keypoints of different sizes exemple img1 have 100 keypoints img2 have 55 keypoints.... how build histograms that give fixed size vectors with matlab
In this case, perhaps dense sift is a good choice.
There are two main stages:
Stage 1: Creating a codebook.
Divide the input image into a set of sub-images.
Apply sift on each sub-image. Each key point will have 128 dimensional feature vector.
Encode these vectors to create a codebook by simply applying k-means clustering with a chosen k. Each image will produce a matrix Vi (i <= n and n is the number of images used to create the codeword.) of size 128 * m, where m is the number of key points gathered from the image. The input to K-means is therefore, a big matrix V created by horizontal concatenation of Vi, for all i. The output of K-means is a matrix C with size 128 * k.
Stage 2: Calculating Histograms.
For each image in the dataset, do the following:
Create a histogram vector h of size k and initialize it to zeros.
Apply dense sift as in step 2 in stage 1.
For each key point's vector find the index of it's "best match" vector in the codebook matrix C (can be the minimum in the Euclidian distance) .
Increase the corresponding bin to this index in h by 1.
Normalize h by L1 or L2 norms.
Now h is ready for classification.
Another possibility is to use Fisher's vector instead of a codebook, https://hal.inria.fr/file/index/docid/633013/filename/jegou_aggregate.pdf
You will always get different number of keypoints for different images, but the size of feature vector of each descriptor point remains same i.e. 128. People prefer using Vector Quantization or K-Mean Clustering and build Bag-of-Words model histogram. You can have a look at this thread.
Using the conventional SIFT approach you will never have the same number of key points in every image. One way of achieving that is to sample the descriptors densely, using Dense SIFT, that places a regular grid on top of the image. If all images have the same size, then you will have the same number of key points per image.

Resources