k-means with ellipsoids - statistics

I have n points in R^3 that I want to cover with k ellipsoids or cylinders (I don't really care; whichever is easier). I want to approximately minimize the union of the volumes. Let's say n is tens of thousands and k is a handful. Development time (i.e. simplicity) is more important than runtime.
Obviously I can run k-means and use perfect balls for my ellipsoids. Or I can run k-means, then use minimum enclosing ellipsoids per cluster rather than covering with balls, though in the worst case that's no better. I've seen talk of handling anisotropy with k-means but the links I saw seemed to think I had a tensor in hand; I don't, I just know the data will be a union of ellipsoids. Any suggestions?
[Edit: There's a couple votes for fitting a mixture of multivariate Gaussians, which seems like a viable thing to try. Firing up an EM code to do that won't minimize the volume of the union, but of course k-means doesn't minimize volume either.]

So you likely know k-means is NP-hard, and this problem is even more general (harder). Because you want to do ellipsoids it might make a lot of sense to fit a mixture of k multivariate gaussian distributions. You would probably want to try and find a maximum likelihood solution, which is a non-convex optimization, but at least it's easy to formulate and there is likely code available.
Other than that you're likely to have to write your own heuristic search algorithm from scratch, this is just a huge undertaking.

I did something similar with multi-variate gaussians using this method. The authors use kurtosis as the split measure, and I found it to be a satisfactory method for my application, clustering points obtained from a laser range finder (i.e. computer vision).

If the ellipsoids can overlap a lot,
then methods like k-means that try to assign points to single clusters
won't work very well.
Part of each ellipsoid has to fit the surface of your object,
but the rest may be inside it, don't-cares.
That is, covering algorithms
seem to me quite different from clustering / splitting algorithms;
unions are not splits.
Gaussian mixtures with lots of overlaps ?
No idea, but see the picture and code on Numerical Recipes p. 845.
Coverings are hard even in 2d, see
find-near-minimal-covering-set-of-discs-on-a-2-d-plane.

Related

How do I analyze the change in the relationship between two variables?

I'm working on a simple project in which I'm trying to describe the relationship between two positively correlated variables and determine if that relationship is changing over time, and if so, to what degree. I feel like this is something people probably do pretty often, but maybe I'm just not using the correct terminology because google isn't helping me very much.
I've plotted the variables on a scatter plot and know how to determine the correlation coefficient and plot a linear regression. I thought this may be a good first step because the linear regression tells me what I can expect y to be for a given x value. This means I can quantify how "far away" each data point is from the regression line (I think this is called the squared error?). Now I'd like to see what the error looks like for each data point over time. For example, if I have 100 data points and the most recent 20 are much farther away from where the regression line/function says it should be, maybe I could say that the relationship between the variables is showing signs of changing? Does that make any sense at all or am I way off base?
I have a suspicion that there is a much simpler way to do this and/or that I'm going about it in the wrong way. I'd appreciate any guidance you can offer!
I can suggest two strands of literature that study changing relationships over time. Typing these names into google should provide you with a large number of references so I'll stick to more concise descriptions.
(1) Structural break modelling. As the name suggest, this assumes that there has been a sudden change in parameters (e.g. a correlation coefficient). This is applicable if there has been a policy change, change in measurement device, etc. The estimation approach is indeed very close to the procedure you suggest. Namely, you would estimate the squared error (or some other measure of fit) on the full sample and the two sub-samples (before and after break). If the gains in fit are large when dividing the sample, then you would favour the model with the break and use different coefficients before and after the structural change.
(2) Time-varying coefficient models. This approach is more subtle as coefficients will now evolve more slowly over time. These changes can originate from the time evolution of some observed variables or they can be modeled through some unobserved latent process. In the latter case the estimation typically involves the use of state-space models (and thus the Kalman filter or some more advanced filtering techniques).
I hope this helps!

Finding powerlines in LIDAR point clouds with RANSAC

I'm trying to find powerlines in LIDAR points clouds with skimage.measures ransac() function. This is my very first time meddling with these modules in python so bear with me.
So far all I knew how to do reliably was filtering low or 'ground' points from the cloud to reduce the number of points to deal with.
def filter_Z(las, threshold):
filtered = laspy.create(point_format = las.header.point_format, file_version = las.header.version)
filtered.points = las.points[las.Z > las.Z.min() + threshold]
print(f'original size: {len(las.points)}')
print(f'filtered size: {len(filtered.points)}')
filtered.write('filtered_points2.las')
return filtered
The threshold is something I put in by hand since in the las files I worked with are some nasty outliers that prevent me from dynamically calculating it.
The filtered point cloud, or one of them atleast looks like this:
Note the evil red outliers on top, maybe they're birds or something. Along with them are trees and roofs of buildings. If anyone wants to take a look at the .las files, let me know. I can't put a wetransfer link in the body of the question.
A top down view:
I've looked into it as much as I could, and found the skimage.measure module and the ransac function that comes with it. I played around a bit to get a feel for it and currently I'm stumped on how to continue.
def ransac_linefit_sklearn(points):
model_robust, inliers = ransac(points, LineModelND, min_samples=2, residual_threshold=1000, max_trials=1000)
return model_robust, inliers
The result is quite predictable (I ran ransac on a 2D view of the cloud just to make it a bit easier on the pc)
Using this doesn't really yield any good results in examples like the one I posted. The vegetation clusters have too many points and the line is fitted through it because it has the highest point density.
I tried DBSCAN() to cluster up the points but it didn't work. I also attempted OPTICS() but as I write it still hasn't finished running.
From what I've read on various articles, the best course of action would be to cluster up the points and perform RANSAC on each individual cluster to find lines, but I'm not really sure on how to do that or what clustering method to use in situations like these.
One thing I'm also curious about doing is just filtering out the big blobs of trees that mess with model fititng.
Inadequacy of RANSAC
RANSAC works best whenever your data fits a mono-modal distribution around your model. In the case of this point cloud, it works best whenever there is only one line with outliers, but there are at least 5 lines when viewed birds-eye. Check out this older SO post that discusses your problem. Francesco's response suggests an iterative RANSAC based approach.
Octrees and SVD
Colleagues worked on a similar problem in my previous job. I am not fluent in the approach, but I know enough to provide some hints.
Their approach resembled Francesco's suggestion. They partitioned the point-cloud into octrees and calculated the singular value decomposition (SVD) within each partition. The three resulting singular values will correspond to the geometric distribution of the data.
If the first singular value is significantly greater than the other two, then the points are line-like.
If the first and second singular values are significantly greater than the other, then the points are plane-like
If all three values are of similar magnitude, then the data is just a "glob" of points.
They used these rules iteratively to rule out which points were most likely NOT part of the lines.
Literature
If you want to look into published methods, maybe this paper is a good starting point. Power lines are modeled as hyperbolic functions.

Separating points following different linear regressions

Given a two variables with the same number of observations, you will apparently see they follow three linear regressions in the scatter plot. How could you separate them into three groups with different linear fittings?
There exist specialized clustering algorithms for this.
Google for "correlation clustering".
If they all go through 0, then it may be easier to apply the proper feature transformation to make them separable. So don't neglect preprocessing. It's the most important part.
I would calculate the slope of the segment between any pairs of points, so with n points you get n(n+1)/2 slopes values, and then use a clustering algorithm.
It is the same idea which is behind the Theil–Sen estimator
It just came to my mind and seems worth to give a try.
Seems to be a mixture of regressions. There are several packages to do this. One of them is FlexMix, while not very satisfying. I put what I got and expected in below.
I think I solved the problem partly. We can use a r package flexmix to achieve this as the lowest panel shows. The package works fine with another two known-fitting groups of data. The seperating ratio can reach as high as 90% with fitting coefficients close to the known coefs.

Interpolation technique for weirdly spaced point data

I have a spatial dataset that consists of a large number of point measurements (n=10^4) that were taken along regular grid lines (500m x 500m) and some arbitrary lines and blocks in between. Single measurements taken with a spacing of about 0.3-1.0m (varying) along these lines (see example showing every 10th point).
The data can be assumed to be normally distributed but shows a strong small-scale variability in some regions. And there is some trend with elevation (r=0.5) that can easily be removed.
Regardless of the coding platform, I'm looking for a good or "the optimal" way to interpolate these points to a regular 25 x 25m grid over the entire area of interest (5000 x 7000m). I know about the wide range of kriging techniques but I wondered if somebody has a specific idea on how to handle the "oversampling along lines" with rather large gaps between the lines.
Thank you for any advice!
Leo
Kriging technique does not perform well when the points to interpolate are taken on a regular grid, because it is necessary to have a wide range of different inter-points distances in order to well estimate the covariance model.
Your case is a bit particular... The oversampling over the lines is not a problem at all. The main problem is the big holes you have in your grid. If think that these holes will create problems whatever the interpolation technique you use.
However it is difficult to predict a priori if kriging will behave well. I advise you to try it anyway.
Kriging is only suited for interpolating. You cannot extrapolate with kriging metamodel, so that you won't be able to predict values in the bottom left part of your figure for example (because you have no point here).
To perform kriging, I advise you to use the following tools (depending the languages you're more familiar with):
DiceKriging package in R (the one I use preferably)
fields package in R (which is more specialized on spatial fields)
DACE toolbox in matlab
Bonus: a link to a reference book about kriging which is available online: http://www.gaussianprocess.org/
PS: This type of question is more statistics oriented than programming and may be better suited to the stats.stackexchange.com website.

How do I measure the distribution of an attribute of a given population?

I have a catalog of 900 applications.
I need to determine how their reliability is distributed as a whole. (i.e. is it normal).
I can measure the reliability of an individual application.
How can I determine the reliability of the group as a whole without measuring each one?
That's a pretty open-ended question! Overall, distribution fitting can be quite challenging and works best with large samples (100's or even 1000's). It's generally better to pick a modeling distribution based on known characteristics of the process you're attempting to model than to try purely empirical fitting.
If you're going to go empirical, for a start you could take a random sample, measure the reliability scores (whatever you're using for that) of your sample, sort them, and plot them vs normal quantiles. If they fall along a relatively straight line the normal distribution is a plausible model, and you can estimate sample mean and variance to parameterize it. You can apply the same idea of plotting vs quantiles from other proposed distributions to see if they are plausible as well.
Watch out for behavior in the tails, in particular. Pretty much by definition the tails occur rarely and may be under-represented in your sample. Like all things statistical, the larger the sample size you can draw on the better your results will be.
I'd also add that my prior belief would be that a normal distribution wouldn't be a great fit. Your reliability scores probably fall on a bounded range, tend to fall more towards one side or the other of that range. If they tend to the high range, I'd predict that they get lopped off at the end of the range and have a long tail to the low side, and vice versa if they tend to the low range.

Resources