Calculating the distance between each pair of a set of points

Calculating the distance between each pair of a set of points - geometry

So I'm working on simulating a large number of n-dimensional particles, and I need to know the distance between every pair of points. Allowing for some error, and given the distance isn't relevant at all if exceeds some threshold, are there any good ways to accomplish this? I'm pretty sure if I want dist(A,C) and already know dist(A,B) and dist(B,C) I can bound it by [dist(A,B)-dist(B,C) , dist(A,B)+dist(B,C)], and then store the results in a sorted array, but I'd like to not reinvent the wheel if there's something better.
I don't think the number of dimensions should greatly affect the logic, but maybe for some solutions it will. Thanks in advance.

If the problem was simply about calculating the distances between all pairs, then it would be a O(n^2) problem without any chance for a better solution. However, you are saying that if the distance is greater than some threshold D, then you are not interested in it. This opens the opportunities for a better algorithm.
For example, in 2D case you can use the sweep-line technique. Sort your points lexicographically, first by y then by x. Then sweep the plane with a stripe of width D, bottom to top. As that stripe moves across the plane new points will enter the stripe through its top edge and exit it through its bottom edge. Active points (i.e. points currently inside the stripe) should be kept in some incrementally modifiable linear data structure sorted by their x coordinate.
Now, every time a new point enters the stripe, you have to check the currently active points to the left and to the right no farther than D (measured along the x axis). That's all.
The purpose of this algorithm (as it is typically the case with sweep-line approach) is to push the practical complexity away from O(n^2) and towards O(m), where m is the number of interactions we are actually interested in. Of course, the worst case performance will be O(n^2).
The above applies to 2-dimensional case. For n-dimensional case I'd say you'll be better off with a different technique. Some sort of space partitioning should work well here, i.e. to exploit the fact that if the distance between partitions is known to be greater than D, then there's no reason to consider the specific points in these partitions against each other.

If the distance beyond a certain threshold is not relevant, and this threshold is not too large, there are common techniques to make this more efficient: limit the search for neighbouring points using space-partitioning data structures. Possible options are:
Binning.
Trees: quadtrees(2d), kd-trees.
Binning with spatial hashing.
Also, since the distance from point A to point B is the same as distance from point B to point A, this distance should only be computed once. Thus, you should use the following loop:
for point i from 0 to n-1:
for point j from i+1 to n:
distance(point i, point j)
Combining these two techniques is very common for n-body simulation for example, where you have particles affect each other if they are close enough. Here are some fun examples of that in 2d: http://forum.openframeworks.cc/index.php?topic=2860.0
Here's a explanation of binning (and hashing): http://www.cs.cornell.edu/~bindel/class/cs5220-f11/notes/spatial.pdf

Related

How to approximate coordinates basing on azimuths?

Suppose I have a series of (imperfect) azimuth readouts, giving me vague angles between a number of points. Lines projected from points A, B, C obviously [-don't-always-] never converge in a single point to define the location of point D. Hence, angles as viewed from A, B and C need to be adjusted.
To make it more fun, I might be more certain of the relative positions of specific points (suppose I locate them on a satellite image, or I know for a fact they are oriented perfectly north-south), so I might want to use that certainty in my calculations and NOT adjust certain angles at all.
By what technique should I average the resulting coordinates, to achieve a "mostly accurate" overall shape?
I considered treating the difference between non-adjusted and adjusted angles as "tension" and trying to "relieve" it in subsequent passes, but that approach gives priority to points calculated earlier.
Another approach could be to calculate the total "tension" in the set, then shake all angles by a random amount, see if that resulted in less tension, and repeat for possibly improved results, trying to evolve a possibly better solution.

As I understand it you have a bunch of unknown points (p[] say) and a number of measurements of azimuths, say Az[i,j] of p[j] from p[i]. You want to find the coordinates of the points.
You'll need to fix one point. This is because if the values of p[] is a solution -- i.e. gave the measured azimuths -- so too is q[] where for some fixed x,
q[i] = p[i] + x
I'll suppose you fix p[0].
You'll also need to fix a distance. This is because if p[] is a solution, so too is q[] where now for some fixed s,
q[i] = p[0] + s*(p[i] - p[0])
I'll suppose you fix dist(p[0], p[1]), and that there is and azimuth Az[1,2]. You'd be best to choose p[0] p[1] so that there is a reliable azimuth between them. Then we can compute p[1].
The usual way to approach such problems is least squares. That is we seek p[] to minimise
Sum square( (Az[i,j] - Azimuth( p[i], p[j]))/S[i,j])
where Az[i,j] is your measurement data
Azimuth( r, s) is the function that gives the azimuth of the point s from the point r
S[i,j] is the 'sd' of the measurement A[i,j] -- the higher the sd of a particular observation is, relative to the others, the less it affects the final result.
The above is a non linear least squares problem. There are many solvers available for this, but generally speaking as well as providing the data -- the Az[] and the S[] -- and the observation model -- the Azimuth function -- you need to provide an initial estimate of the state -- the values sought, in your case p[2] ..
It is highly likely that if your initial estimate is wrong the solver will fail.
One way to find this estimate would be to start with a set K of known point indices and seek to expand it. You would start with K being {0,1}. Then look for points that have as many azimuths as possible to points in K, and for such points estimate geometrically their position from the known points and the azimuths, and add them to K. If at the end you have all the points in K, then you can go on to the least squares. If it isn't its possible that a different pair of initial fixed points might do better, or maybe you are stuck.
The latter case is a real possibility. For example suppose you had points p[0],p[1],p[2],p[3] and azimuths A[0,1], A[1,2], A[1,3], A[2,3].
As above we fix the positions of p[0] and p[1]. But we can't compute positions of p[2] and p[3] because we do not know the distances of 2 or 3 from 1. The 1,2,3 triangle could be scaled arbitrarily and still give the same azimuths.

What to do when KMeans returns fewer than K clusters?

I've implemented K-Means in Java and have a bit of a head scratcher. I select my initial centroids by choosing a random value in each dimension within the range of values of the data points. I've run into cases where this results in one or more of these centroids not ending up being the closet centroid of any data point. So what do I do for the next iteration? Just leave it at its original randomized value? Pick a new random value? Compute as an average of the other centroids? Seems like this isn't accounted for in the original algorithm, but probably I've just missed something.

Most implementations of k-means define initial centroids using actual data points, not random points in the bounding box drawn by the variables. However, some suggestions for solving your actual problem are below.
You could take another data-point at random and make it a new cluster centroid. This is very simple and fast to implement, and shouldn't affect the algorithm adversely.
You could also try making a smarter initial selection of cluster centroids using kmeans++. This algorithm chooses the first centroid randomly, and picks the remaining K-1 centroids to try and maximize the inter-centroid distance. By picking smarter centroids, you are much less likely to encounter the problem of a centroid being assigned zero data points.
If you wanted to be slightly more clever clever, you could use the kmeans++ algorithm to make a new centroid whenever a centroid gets assigned zero data points.

The way I've used it, the initial values were taken as random points from the data set, not random points in the spanned space. That means each cluster has at least one point in it initially. You could still get unlucky with outliers but with any luck you'll be able to detect this and restart with different points. (Provided "K clusters of points" is an adequate description of your data)

Instead of picking random values (which can be pretty meaningless if the space of possible values is large in comparison to the clusters), many implementations pick random points from the dataset as the initial centroids.

RayTracing: When to Normalize a vector?

I am rewriting my ray tracer and just trying to better understand certain aspects of it.
I seem to have down pat the issue regarding normals and how you should multiply them by the inverse of the transpose of a transformation matrix.
What I'm confused about is when I should be normalizing my direction vectors?
I'm following a certain book and sometimes it'll explicitly state to Normalize my vector and other cases it doesn't and I find out that I needed to.
Normalized vector is in the same direction with just unit length 1? So I'm unclear when it is necessary?
Thanks

You never need to normalize a vector unless you are working with the angles between vectors, or unless you are rotating a vector.
That's it.
In the former case, all of your trig functions require your vectors to land on a unit circle, which means the vectors are normalized. In the latter case, you are dividing out the magnitude, rotating the vector, making sure it stays a unit, and then multiplying the magnitude back in. Normalization just goes with the territory.
If someone tells you that coordinate system are defined by n unit vectors, know that i-hat, j-hat, k-hat, and so on can be any arbitrary vector(s) of any length and direction, so long as none of them are parallel. This is the heart of affine transformations.
If someone tries to tell you that the dot product requires normalized vectors, shake your head and smile. The dot product only needs normalized vectors when you are using it to get the angle between two vectors.
But doesn't normalization make the math "simpler"?
Not really -- It adds a magnitude computation and a division. Numbers between 0..1 are no different than numbers between 0..x.
Having said that, you sometimes normalize in order to play well with others. But if you find yourself normalizing vectors as a matter of principle before calling methods, consider using a flag attached to the vector to save yourself a step. Mathematically, it is unimportant, but practically, it can make a huge difference in performance.
So again... it's all about rotating a vector or measuring its angle against another vector. If you aren't doing that, don't waste cycles.

tl;dr: Normalized vectors simplify your math. They also reduce the number of very hard to diagnose visual artifacts in your images.
Normalized vector is in the same direction with just unit length 1? So
I'm unclear when it is necessary?
You almost always want all vectors in a ray tracer to be normalized.
The simplest example is that of the intersection test: where does a bouncing ray hit another object.
Consider a ray where:
p(t) = p_0 + v * t
In this case, a point anywhere along that ray p(t) is defined as an offset from the original point p_0 and an offset along a particular direction v. For every increment of parameter t, the resulting p(t) will move another increment of length equal to the length of the vector v.
Remember, you know p_0 and v. When you are trying to find the point where this ray next hits another object, you have to solve for that t. It is obviously more convenient, if not always obviously necessary, to use normalized vector vs in that representation.
However, that same vector v is used in lighting calculations. Imagine that we have another direction vector u that points towards a lighting source. For the purpose of a very simple shading model, we can define the light at a particular point to be the dot product between those two vectors:
L(p) = v * u
Admittedly, this is a very uninteresting reflection model but it captures the high points of the discussion. A spot on a surface is bright if reflection points towards the light and dim if not.
Now, remember that another way of writing this dot product is the product of the magnitudes of the vectors times the cosine of the angle between them:
L(p) = ||v|| ||u|| cos(theta)
If u and v are of unit length (normalized), then the equation will evaluate to be proportional to the angle between the two vectors. However, if v is not of unit length, say because you didn't bother to normalize after reflecting the vector in the ray model above, now your lighting model has a problem. Spots on the surface using a larger v will be much brighter than spots that do not.

It is necessary to normalize a direction vector whenever you use it in some math that is influenced by its length.
The prime example is the dot product, which is used in most lighting equations. You also sometimes need to normalize vectors that you use in lighting calculations, even if you believe that they are normal.
For example, when using an interpolated normal on a triangle. Common sense tells you that since the normals at the vertices are normal, the vectors you get by interpolating are too. So much for common sense... the truth is that they will be shorter unless they incidentially all point into the same direction. Which means that you will shade the triangle too dark (to make matters worse, the effect is more pronounced the closer the light source gets to the surface, which is a... very funny result).
Another example where a vector might or might not be normalized is the cross product, depending on what you are doing. For example, when using the two cross products to build an orthonormal base, then you must at least normalize once (though if you do it naively, you end up doing it more often).
If you only care about the direction of the resulting "up vector", or about the sign, you don't need to normalize.

I'll answer the opposite question. When do you NOT need to normalize? Almost all calculations related to lighting require unit vectors - the dot product then gives you the cosine of the angle between vectors which is really useful. Some equations can still cope but become more complex (essentially doing the normalization in the equation) That leaves mostly intersection tests.
Equations for many intersection tests can be simplified if you have unit vectors. Some do not require it - for example if you have a plane equation (with a unit normal) you can find the ray-plane intersection without normalizing the ray direction vector. The distance will be in terms of the ray direction vectors length. This might be OK if all you want is to intersect a bunch of those planes (the relative distances will all be correct). But as soon as you want to compare with a different distance - calculated using the normalized ray direction - the distance values will not compare properly.
You might think about normalizing a direction vector AFTER doing some work that does not require it - maybe you have an acceleration structure that can be traversed without a normalized vector. But that isn't relevant either because eventually the ray will hit something and you're going to want to do a lighting/shading calculation with it. So you may as well normalize them from the start...
In other words, any specific calculation may not require a normalized direction vector, but a given direction vector will almost certainly need to be normalized at some point in the process.

Vectors are used to store two conceptually different elements: points in space and directions:
If you are storing a point in space (for example the position of the camera, the origin of the ray, the vertices of triangles) you don't want to normalize, because you would be modifying the value of the vector, and losing the specific position.
If you are storing a direction (for example the camera up, the ray direction, the object normals) you want to normalize, because in this case you are interested not in the specific value of the point, but on the direction it represents, so you don't need the magnitude. Normalization is useful in this case because it simplifies some operations, such as calculating the cosine of two vectors, something that can be done with a dot product if both are normalized.

Insert a point into a finite 2D region with maximum distance to existing points

I have a set of 2D points inside a finite 2D region of space (let's say a world-aligned rectangle to keep things simple for now). What would be an exceedingly efficient way to insert a new point into the set that has a relatively large distance to its new closest neighbour?
I could slowly build a Delaunay triangulation and limit my search to the largest triangles only, but I was hoping someone has a different (better) idea.
Goodwill,
David
Edit:
Forgot to mention that I need to do this thousands of times, every time taking all the previous points into account. I'm looking for an algorithm that doesn't slow down to a crawl as my point set grows.

Use the Bowyer-Watson or other incremental algorithm to maintain the Voronoi diagram. The vertexes of the Voronoi diagram are candidate points, keep all the candidate points in a priority queue ordered by distance to the source points. That should be pretty fast, and optimal (at least, optimal at each step).
Were you looking for something faster and less optimal?

Decomposition to Convex Polygons

This question is a little involved. I wrote an algorithm for breaking up a simple polygon into convex subpolygons, but now I'm having trouble proving that it's not optimal (i.e. minimal number of convex polygons using Steiner points (added vertices)). My prof is adamant that it can't be done with a greedy algorithm such as this one, but I can't think of a counterexample.
So, if anyone can prove my algorithm is suboptimal (or optimal), I would appreciate it.
The easiest way to explain my algorithm with pictures (these are from an older suboptimal version)
What my algorithm does, is extends the line segments around the point i across until it hits a point on the opposite edge.
If there is no vertex within this range, it creates a new one (the red point) and connects to that:
If there is one or more vertices in the range, it connects to the closest one. This usually produces a decomposition with the fewest number of convex polygons:
However, in some cases it can fail -- in the following figure, if it happens to connect the middle green line first, this will create an extra unneeded polygon. To this I propose double checking all the edges (diagonals) we've added, and check that they are all still necessary. If not, remove it:
In some cases, however, this is not enough. See this figure:
Replacing a-b and c-d with a-c would yield a better solution. In this scenario though, there's no edges to remove so this poses a problem. In this case I suggest an order of preference: when deciding which vertex to connect a reflex vertex to, it should choose the vertex with the highest priority:
lowest) closest vertex
med) closest reflex vertex
highest) closest reflex that is also in range when working backwards (hard to explain) --
In this figure, we can see that the reflex vertex 9 chose to connect to 12 (because it was closest), when it would have been better to connect to 5. Both vertices 5 and 12 are in the range as defined by the extended line segments 10-9 and 8-9, but vertex 5 should be given preference because 9 is within the range given by 4-5 and 6-5, but NOT in the range given by 13-12 and 11-12. i.e., the edge 9-12 elimates the reflex vertex at 9, but does NOT eliminate the reflex vertex at 12, but it CAN eliminate the reflex vertex at 5, so 5 should be given preference.
It is possible that the edge 5-12 will still exist with this modified version, but it can be removed during post-processing.
Are there any cases I've missed?
Pseudo-code (requested by John Feminella) -- this is missing the bits under Figures 3 and 5
assume vertices in `poly` are given in CCW order
let 'good reflex' (better term??) mean that if poly[i] is being compared with poly[j], then poly[i] is in the range given by the rays poly[j-1], poly[j] and poly[j+1], poly[j]
for each vertex poly[i]
if poly[i] is reflex
find the closest point of intersection given by the ray starting at poly[i-1] and extending in the direction of poly[i] (call this lower bound)
repeat for the ray given by poly[i+1], poly[i] (call this upper bound)
if there are no vertices along boundary of the polygon in the range given by the upper and lower bounds
create a new vertex exactly half way between the lower and upper bound points (lower and upper will lie on the same edge)
connect poly[i] to this new point
else
iterate along the vertices in the range given by the lower and upper bounds, for each vertex poly[j]
if poly[j] is a 'good reflex'
if no other good reflexes have been found
save it (overwrite any other vertex found)
else
if it is closer then the other good reflexes vertices, save it
else
if no good reflexes have been found and it is closer than the other vertices found, save it
connect poly[i] to the best candidate
repeat entire algorithm for both halves of the polygon that was just split
// no reflex vertices found, then `poly` is convex
save poly
Turns out there is one more case I didn't anticipate: [Figure 5]
My algorithm will attempt to connect vertex 1 to 4, unless I add another check to make sure it can. So I propose stuffing everything "in the range" onto a priority queue using the priority scheme I mentioned above, then take the highest priority one, check if it can connect, if not, pop it off and use the next. I think this makes my algorithm O(r n log n) if I optimize it right.
I've put together a website that loosely describes my findings. I tend to move stuff around, so get it while it's hot.

I believe the regular five pointed star (e.g. with alternating points having collinear segments) is the counterexample you seek.
Edit in response to comments
In light of my revised understanding, a revised answer: try an acute five pointed star (e.g. one with arms sufficiently narrow that only the three points comprising the arm opposite the reflex point you are working on are within the range considered "good reflex points"). At least working through it on paper it appears to give more than the optimal. However, a final reading of your code has me wondering: what do you mean by "closest" (i.e. closest to what)?
Note
Even though my answer was accepted, it isn't the counter example we initially thought. As #Mark points out in the comments, it goes from four to five at exactly the same time as the optimal does.
Flip-flop, flip flop
On further reflection, I think I was right after all. The optimal bound of four can be retained in a acute star by simply assuring that one pair of arms have collinear edges. But the algorithm finds five, even with the patch up.
I get this:
removing dead ImageShack link
When the optimal is this:
removing dead ImageShack link

I think your algorithm cannot be optimal because it makes no use of any measure of optimality. You use other metrics like 'closest' vertices, and checking for 'necessary' diagonals.
To drive a wedge between yours and an optimal algorithm, we need to exploit that gap by looking for shapes with close vertices which would decompose badly. For example (ignore the lines, I found this on the intertubenet):
concave polygon which forms a G or U shape http://avocado-cad.wiki.sourceforge.net/space/showimage/2007-03-19_-_convexize.png
You have no protection against the centre-most point being connected across the concave 'gap', which is external to the polygon.
Your algorithm is also quite complex, and may be overdoing it - just like complex code, you may find bugs in it because complex code makes complex assumptions.
Consider a more extensive initial stage to break the shape into more, simpler shapes - like triangles - and then an iterative or genetic algorithm to recombine them. You will need a stage like this to combine any unnecessary divisions between your convex polys anyway, and by then you may have limited your possible decompositions to only sub-optimal solutions.
At a guess something like:
decompose into triangles
non-deterministically generate a number of recombinations
calculate a quality metric (number of polys)
select the best x% of the recombinations
partially decompose each using triangles, and generate a new set of recombinations
repeat from 4 until some measure of convergence is reached

but vertex 5 should be given preference because 9 is within the range given by 4-5 and 6-5
What would you do if 4-5 and 6-5 were even more convex so that 9 didn't lie within their range? Then by your rules the proper thing to do would be to connect 9 to 12 because 12 is the closest reflex vertex, which would be suboptimal.

Found it :( They're actually quite obvious.
*dead imageshack img*
A four leaf clover will not be optimal if Steiner points are allowed... the red vertices could have been connected.
*dead imageshack img*
It won't even be optimal without Steiner points... 5 could be connected to 14, removing the need for 3-14, 3-12 AND 5-12. This could have been two polygons better! Ouch!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string