Finding deviation of coordinates placed on boundaries of a rectilinear shape - statistics

I have elements placed on boundary of a rectilinear shape as depicted in the diagram. The intention is to find a metric (like for e.g. avg, std. deviation) which gives me how well the elements are clubbed together or separated throughout (as a whole for orange colors - not per line grouping) and the factor by which each elements loosely/strongly follows the grouping?
Ideally std. deviation would have done the job which provides the factor (x times) deviating from the mean (beautifully explained here: std-dev) but in this case std. deviation is a not a good matric to find grouping since the elements are not fundamentally meant to follow a center.
The data provided are the coordinates of the elements (orange one) : (x1,y1),(x2,y2),(x3,y3)....(xn,yn) and the coordinates of the rectilinear shape.

Compute the Euclidean distances between the points and the closest lines. Then consider either the mean distance or the root-mean-squared distance. (Zero in case of perfect alignment.)

Related

Unsupervised Outlier detection

I have 6 points in each row and have around 20k such rows. Each of these row points are actually points on a curve, the nature of curve of each of the rows is same (say a sigmoidal curve or straight line, etc). These 6 points may have different x-values in each row.I also know a point (a,b) for each row which that curve should pass through. How should I go about in finding the rows which may be anomalous or show an unexpected behaviour than other rows? I was thinking of curve fitting but then I only have 6 points for each curve, all I know is that majority of the rows have same nature of curve, so I can perhaps make a general curve for all the rows and have a distance threshold for outlier detection.
What happens if you just treat the 6 points as a 12 dimensional vector and run any of the usual outlier detection methods such as LOF and LoOP?
It's trivial to see the relationship between Euclidean distance on the 12 dimensional vector, and the 6 Euclidean distances of the 6 points each. So this will compare the similarities of these curves.
You can of course also define a complex distance function for LOF.

Why Principle Components of Covariance matrix capture maximum variance of the variables?

I am trying to understand PCA, I went through several tutorials. So far I understand that, the eigenvectors of a matrix implies the directions in which vectors are rotated and scaled when multiplied by that matrix, in proportion of the eigenvalues. Hence the eigenvector associated with the maximum Eigen value defines direction of maximum rotation. I understand that along the principle component, the variations are maximum and reconstruction errors are minimum. What I do not understand is:
why finding the Eigen vectors of the covariance matrix corresponds to the axis such that the original variables are better defined with this axis?
In addition to tutorials, I reviewed other answers here including this and this. But still I do not understand it.
Your premise is incorrect. PCA (and eigenvectors of a covariance matrix) certainly don't represent the original data "better".
Briefly, the goal of PCA is to find some lower dimensional representation of your data (X, which is in n dimensions) such that as much of the variation is retained as possible. The upshot is that this lower dimensional representation is an orthogonal subspace and it's the best k dimensional representation (where k < n) of your data. We must find that subspace.
Another way to think about this: given a data matrix X find a matrix Y such that Y is a k-dimensional projection of X. To find the best projection, we can just minimize the difference between X and Y, which in matrix-speak means minimizing ||X - Y||^2.
Since Y is just a projection of X onto lower dimensions, we can say Y = X*v where v*v^T is a lower rank projection. Google rank if this doesn't make sense. We know Xv is a lower dimension than X, but we don't know what direction it points.
To do that, we find the v such that ||X - X*v*v^t||^2 is minimized. This is equivalent to maximizing ||X*v||^2 = ||v^T*X^T*X*v|| and X^T*X is the sample covariance matrix of your data. This is mathematically why we care about the covariance of the data. Also, it turns out that the v that does this the best, is an eigenvector. There is one eigenvector for each dimension in the lower dimensional projection/approximation. These eigenvectors are also orthogonal.
Remember, if they are orthogonal, then the covariance between any two of them is 0. Now think of a matrix with non-zero diagonals and zero's in the off-diagonals. This is a covariance matrix of orthogonal columns, i.e. each column is an eigenvector.
Hopefully that helps bridge the connection between covariance matrix and how it helps to yield the best lower dimensional subspace.
Again, eigenvectors don't better define our original variables. The axis determined by applying PCA to a dataset are linear combinations of our original variables that tend to exhibit maximum variance and produce the closest possible approximation to our original data (as measured by l2 norm).

Determine if square cell is inside polygon

For instance, I want the grid cells (squares) inside or partially inside polygon #14 to be associated with #14. Is there an algorithm that will compute if a square is inside a polygon efficiently?
I have the coordinates of the edges that make up the polygon.
If I get it right, this is an implementation of Fortune's algorithm in JavaScript, that takes a set of 2-d points (called sites) and returns a structure containing data for a Voronoi diagram computed for this points. It returns polygons in a list called cells. It seems to use Euclidean distance as measurement. If it's true we know that polygons are always convex (see Formal definition section in Voronoi wiki page).
Now, these are options to solve this problem (hard to simple):
1. Polygon Clipping:
Form a polygon for the square.
Find its intersection with all cells.
Calculate area of this intersections.
Find the largest intersection.
2- Point in Polygon:
You also can simply find the cell that center of the square lies inside it. Ray casting is a robust PIP algorithm. Although there's a simpler approach for convex polygons (see Convex Polygons section here).
3. Distance Between Points:
I if you know the site associated to each cell then you just need to calculate distance between center of square to all sites. Regardless of what distance measurement you use to compute the Voronoi, the center point of square lie inside the cell that distance of it's associated site is minimum, since this is actually the idea for partitioning the plane in a Voronoi diagram.
Exceptions:
First approach is computationally expensive but most accurate. Second and third options work fine in most cases, however, there are exceptions that they fail to decide correctly:
Second and third are pretty much alike, but the down side of PIP is cases where point lies on edges of the polygon that cost you more overhead to detect.

Principle Component Analysis why are we finding the max of the vectors?

I'm attempting to comprehend the PCA but I'm stuck on a specific part. After it being referenced to the Harvard data science course I looked it up here: https://en.wikipedia.org/wiki/Principal_component_analysis
Under details and then just below first component they say "the first loading vector w(1) thus has to satisfy" and I understand why the below line is true?
The arg max where ||w|| = 1 implies were finding the maximum values of the summation when w is a unit vector. But I don't understand why we want that, or how the values are expected to change, if we have a given matrix X. Unless were trying to optimize which weights to dot with each row?
Or do we just do this to get it into Raleigh quotient form, so we can then use the eigenvalues to find the largest eigenvector associated with the matrix? (which is also the largest vector)
and why do we want the largest vector in the first place? In our transformed axis are we just showing the largest variance in each dimension? Wouldn't we want to transform all points and attempt to see some correlation?
In a sense, the eigenvector with the largest eigenvalue is pointing in the direction of the maximum variance. The one with the second largest eigenvalue is pointing in the direction of maximum variance that is left after accounting for the first one. The eigenvector with the second largest eigenvalue will be orthogonal to the one with the largest eigenvalue. Take another look at the Wikipedia article you referenced, and look for the diagram in the upper right. The longer line is for the eigenvector with the largest eigenvalue and it points along the maximum variance in the data. The shorter line is the eigenvector that has the second largest eigenvalue, and it points along the maximum remaining variance, orthogonal to the first line.

Find contour of 2D unorganized pointcloud

I have a set of 2D points, unorganized, and I want to find the "contour" of this set (not the convex hull). I can't use alpha shapes because I have a speed objective (less than 10ms on an average computer).
My first approach was to compute a grid and find the outline squares (squares which have an empty square as a neighbor). So I think I downsized efficiently my numbers of points (from 22000 to 3000 roughly). But I still need to refine this new set.
My question is : how do I find the real outlines points among my green points ?
After a weekend full of reflexions, I may have found a convenient solution.
So we need a grid, we need to fill it with our points, no difficulty here.
We have to decide which squares are considered as "Contour". Our criteria is : at least one empty neighbor and at least 3 non empty neighbors.
We lack connectivity information. So we choose a "Contour" square which as 2 "Contour" neighbors or less. We then pick one of the neighbor. From that, we can start the expansion. We just circle around the current square to find the next "Contour" square, knowing the previous "Contour" squares. Our contour criteria prevent us from a dead end.
We now have vectors of connected squares, and normally if our shape doesn't have a hole, only one vector of connected squares !
Now for each square, we need to find the best point for the contour. We select the one which is farther from the barycenter of our plane. It works for most of the shapes. Another technique is to compute the barycenter of the empty neighbors of the selected square and choose the nearest point.
The red points are the contour of the green one. The technique used is the plane barycenter one.
For a set of 28000 points, this techniques take 8 ms. CGAL's Alpha shapes would take an average 125 ms for 28000 points.
PS : I hope I made myself clear, English is not my mothertongue :s
You really should use the alpha shapes. Maybe use only green points as inputs of the alpha alpha algorithm.

Resources