measure difference between two distribution - statistics

I have a distance vector of a sample program. I am trying to quantify how similar they are. I was using Euclidean distance between sample groups (each value belongs to a bucket, we compare bucket by bucket), which works fine. But there are too many comparisons that needs to be done for large number of samples.
I was wondering if there is a efficient way to build an index to compare the samples. The samples look like this--
Sample:1 = {25 0 17 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
Sample:2 = {25 1 16 2 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
Sample:3 = {25 3 16 2 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}

There exist many ways to characterise the "difference between two distributions". A specific and targeted answer requires more details concerning e.g. the underlying probability distribution(s).
It all depends on how you define a difference between two distributions. To give you two ideas:
A Kolmogorov-Smirnov test is a non-parametric test, that measures the "distance" between two cumulative/empirical distribution functions.
The Kullback-Leibler divergence measures the "distance" between two distributions in the language of information theory as a change in entropy.
Update [a year later]
Upon revisiting this post it might be important to emphasise a few things:
The standard two-sample Kolmogorov-Smirnov (KS) test assumes the underlying distribution to be continuous. For discrete data (which the data from the original post seems to be), an alternative may be to use a bootstrap version of the two-sample KS test, as in Matching::ks.boot. More details can be found on e.g. Cross Validated: Can I use Kolmogorov-Smirnov to compare two empirical distributions? and on Wikipedia: Two-sample Kolmogorov–Smirnov test.
Provided the sample data in the original post is representative, I don't think there will be a very meaningful answer from either a KS-statistic based test or the KL divergence (or really any other test for that matter). The reason being that the values from every sample are essentially all zero (to be precise, >80% of the values are zero). That in combination with the small sample size of 21 values per sample means that there is really not a lot "left" to characterise any underlying distribution.
In more general terms (and ignoring the limitations pointed out in the previous point), to calculate the KL divergence for all pairwise combinations one could do the following
library(entropy)
library(tidyverse)
expand.grid(1:length(lst), 1:length(lst)) %>%
rowwise() %>%
mutate(KL = KL.empirical(lst[[Var1]], lst[[Var2]]))
Since the KL divergence is not symmetric, we will need to calculate both the upper and lower triangular parts of the pairwise KL divergence matrix. In the interest of reducing compute time one could instead make use of a symmetrised KL divergence, which requires calculating the KL divergence only for the upper or lower triangular parts of the pairwise KL divergence matrix (although the symmetrised KL divergence versions themselves require calculating both KL divergences, i.e. KL(1->2) and KL(2->1) but this may be done through an optimised routine).

Related

How to create a table of random 0's and 1's (with weights) in Excel?

I would like to create a random, square table of 0's and 1's like so
0 1 0
0 0 1
0 1 0
but only bigger (around 14 x 14). The diagonal should be all 0's and the table should be symmetric across the diagonal. For example, this would not be good:
0 1 1
0 0 0
0 1 0
I would also like to have control over the number of 1 's that appear, or at least the probability that 1's appear.
I only need to make a few such tables, so this does not have to be fully automated by any means. I do not mind at all doing a lot of the work by hand.
If possible, I would greatly prefer doing this without coding in VBA (small code in cells is okay of course) since I do not know it at all.
Edit: amended so as to return a symmetrical array, as requested by the OP.
=LET(λ,RANDARRAY(ξ,ξ),IF(1-MUNIT(ξ),GESTEP(MOD(MMULT(λ,TRANSPOSE(λ)),1),1-ζ),0))
where ξ is the side length of the returned square array and ζ is an approximation as to the probability of a non-diagonal entry within that array being unity.
As ξ increases, so does the accuracy of ζ.

In postgis, what is the difference between Z-Dimension and M-Dimension

Just as the title, what is the difference between Z-Dimension and M-Dimension in postgis?
It is said that Z-Dimension generally stores elevation information while M-Dimension stores some other information.
However, is there any substantial difference between Z-Dimension and M-Dimension? For example, is there any difference in spect of the functions offered by postgis?
There is a difference in that PostGIS treats the Z coordinate different in three-dimensional functions.
Consider this example:
SELECT st_3ddistance(
'SRID=4326;POINT ZM(0 0 0 100)'::geometry,
'SRID=4326;POINT ZM(0 0 0 0)'::geometry
);
st_3ddistance
═══════════════
0
(1 row)
SELECT st_3ddistance(
'SRID=4326;POINT ZM(0 0 100 0)'::geometry,
'SRID=4326;POINT ZM(0 0 0 0)'::geometry
);
st_3ddistance
═══════════════
100
(1 row)
The M dimension can be any measure associated with each point, for example the intensity of the radioactive radiation in that location. Obviously, that does not factor in distance measurements. It is used in other functions like ST_InterpolatePoint, which interpolates the M dimension at a given point:
SELECT st_interpolatepoint(
'SRID=4326;LINESTRING ZM(0 0 0 100,50 50 10 0)'::geometry,
'SRID=4326;POINT(20 20)'::geometry
);
st_interpolatepoint
═════════════════════
60
(1 row)
Here the Z coordinate is ignored, since it is a function for 2-dimensional geometry.

What if my dataset contains only 0 and 1? Can I check for correlation for them and get the significant results in Excel?

My data set looks like this:
P T O
1 1 0
1 0 1
1 1 1
0 1 0
1 1 0
My doubt is that we only have two values i.e. zero and one. That would logically mean that correlation can not compute the level of significance. My assumption is in this case it would be than calculating the coefficient based on occurrence of 4 combination i.e. {(1,1),(1,0),(0,1),(0,0)} rather than calculating in the magnitude of change in variables. But conversely if coefficient works only on magnitude of change, than is this the right method for my data set?
Could anyone tell me if I am on right track of thoughts or calculating such coefficient yields no significance?

Canonical Pose of PointCloud

I made a strange observation and I'm asking myself if someone has an explaination.
Suppose we have PointCloud, for example a perfect cylinder. This cylinder is somehow placed and rotated in space. Now for this cylinder we compute it's centroid and eigenvectors with a principal component analysis.
Now I can describe it's pose as an affine transformation of the form:
Eigen::Affine3d pose;
pose.translation << centroid[0], centroid[1], centroid[2];
pose.linear() = evecs;
Now let's assume I want to transform the object in it's unambiguous, canonical pose, then I would do the following:
obj.transform(obj.getPose().inverse());
By this I would transform the object into it's local coordinate system and thus it's canonical pose. And obj.getPose() would construct me it's new pose, again after recomputing it's new eigenvectors and centroid position (which then should be in (0,0,0)).
After I did this, however it ended up the object's new, canonical pose was sometimes this:
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
So the identity matrix. I would have expected this, but sometimes I got this:
-1 0 0 0
0 -1 0 0
0 0 1 0
0 0 0 1
I can't really explain this behaviour to myself and I it's causing problems for me as I'm comparing poses. Do you have an idea why this is happening?
Eigen vectors are not uniquely defined, if v is an eigenvector then -v is also an eigenvector. This is exactly what you observe.
Moreover, if you have a perfectly symmetric cylinder, then two eigenvalues should be equal, and the two respective eigenvectors can be arbitrarily chosen as a pair of orthogonal vectors within the plane orthogonal to the third eigenvector (axis of the cylinder).

Proper use of *apply with a function that takes strings as inputs

I am interested in applying the levenshteinSim function from the Record-Linkage package to vectors of strings (there's a good discussion on the function here ).
Imagine that I have a vector called codes: "A","B","C","D",etc.;
And a vector called tests: "A","B","C","D",etc.
Using sapply to test a particular value in 'tests' against the vector of codes,
sapply(codes,levenshteinSim,str2=tests[1])
I would expect to get a list or vector (my apologies if I make terminological mistakes): [score1] [score2] [score3].
Unfortunately, the output is a test of the value in tests[1] against c("A","B","C","D", ...) -- a single value.
Ultimately, I want to *apply the two vectors against one another to produce a matrix of length len1*len2 -- but I don't want to move forward until I understand what I'm doing wrong.
Can anyone provide guidance?
I'm not sure where the problem lies:
library(RecordLinkage)
sapply(codes,levenshteinSim,str2=test)
A B C D
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
When str2 is just one item, you get a length 4 vector.

Resources