Using distance matrix to find coordinate points of set of points - geometry

Given a distance matrix and a set of points, how do you figure out the coordinates of these points?
Edit: This is on a plane.
This question was answered here but in trying different distance matrices, I really couldn't use this answer because the M matrix had negative values, as did my eigenvectors. So when you took the square root, the program (in R) outputs "NaN" for those associated entries.
I'm guessing this will happen every time D(i,j)^2 is greater than D(1,j)^2 + D(i,1)^2.
For example, say I have a distance matrix:
0 73 102 496 432 184
73 0 303 392 436 233
102 303 0 366 207 353
496 392 366 0 172 103
432 436 207 172 0 352
184 233 353 103 352 0
Using the equation M(i,j) = (0.5)(D(1,j)^2+D(i,1)^2-D(i,j)^2), I get (which already has negative entries):
0 0.0 0.0 0.0 0.0 0.0
0 5329.0 -38038.0 48840.5 928.5 -7552.0
0 -38038.0 10404.0 61232.0 77089.5 -40174.5
0 48840.5 61232.0 246016.0 201528.0 134631.5
0 928.5 77089.5 201528.0 186624.0 48288.0
0 -7552.0 -40174.5 134631.5 48288.0 33856.0
Then I get non - zero eigenvalues & eigenvectors:
477718.27 101845.63 16474.30 -13116.72 -100692.49
[,1] [,2] [,3] [,4] [,5]
0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
-0.05928626 0.3205747 0.84148945 0.04869546 -0.42806691
-0.16650486 -0.5670946 -0.04507520 -0.58222690 -0.55647098
-0.73371713 0.2827320 0.07386302 -0.45957443 0.40627254
-0.59727407 -0.4623603 0.07806418 0.64968004 -0.03617241
-0.27144823 0.5309625 -0.52755471 0.15920983 -0.58372335
Since there are both negative eigenvalues and eigenvectors, when we compute
sqrt(eigenvector(i)*eigenvalue(i)), we'll have negative values.
Here is my final output:
[,1] [,2] [,3] [,4] [,5]
0 0.0000 0.00000 0.00000 0.00000
NaN 180.6907 117.74103 NaN 207.61291
NaN NaN NaN 87.38939 236.71174
NaN 169.6910 34.88326 77.64089 NaN
NaN NaN 35.86158 NaN 60.35139
NaN 232.5429 NaN NaN 242.43877
Is this the only clear way of computing the coordinate points without using angles?
If it is, do we have to fix the distance matrix so D(i,j)^2 is not greater than D(1,j)^2 + D(i,1)^2.
Thanks.

Your data is inconsistent
Your coordinates are not consistent with positions of points in ℝ⁴, let alone a space of lower dimension. You can tell that fact by computing the Menger determinant of your squared distance matrix:
D <- as.matrix(read.table(textConnection("\
0 73 102 496 432 184
73 0 303 392 436 233
102 303 0 366 207 353
496 392 366 0 172 103
432 436 207 172 0 352
184 233 353 103 352 0")))
n <- nrow(D)
det(rbind(cbind(D^2, 1), c(rep(1, n), 0)))
# Result: 3.38761e+25
If your coordinates really came from points in a space of dimension less than five, then this determinant would have to be zero. As it is not, your distances are inconsistent, or the points form a simplex in a space of sufficiently high dimension.
But no mater the dimension, your data is still inconsistent since it violates the triangle inequality in several cases:
a b c ac abc ab bc
1 2 4: 496 > 465 = 73 + 392
1 3 4: 496 > 468 = 102 + 366
1 3 5: 432 > 309 = 102 + 207
1 6 4: 496 > 287 = 184 + 103
2 1 3: 303 > 175 = 73 + 102
2 6 4: 392 > 336 = 233 + 103
3 1 6: 353 > 286 = 102 + 184
5 4 6: 352 > 275 = 172 + 103
Going from a to c directly can never take longer than going via b, but according to your data it does.
Simple planar approach
If you had data consistent with points in the plane (i.e. all Menger determinants for combinations of four points evaluate to zero), you could use the following to obtain coordinates:
distance2coordinates <- function(D) {
n <- nrow(D)
maxDist <- which.max(D)
p1 <- ((maxDist - 1) %% n) + 1
p2 <- ((maxDist - 1) %/% n) + 1
x2 <- D[p1, p2]
r1sq <- D[p1,]^2
r2sq <- D[p2,]^2
x <- (r1sq - r2sq + x2^2)/(2*x2)
y <- sqrt(r1sq - x^2)
p3 <- which.max(y)
x3 <- x[p3]
y3 <- y[p3]
plus <- abs(D[p3,]^2 - (x3 - x)^2 - (y3 - y)^2)
minus <- abs(D[p3,]^2 - (x3 - x)^2 - (y3 + y)^2)
y[minus < plus] <- -y[minus < plus]
coords <- data.frame(x = x, y = y)
return(coords)
}
The idea is that you choose two points with maximal distance as starting points. You place on in the origin and the other on the positive x axis. Then you can compute all other x coordinates from this, as the intersection of two circles, following the equations
I: x² + y² = r₁²
II: (x - x₂)² + y² = r₂²
I-II: 2*x*x₂ = r₁² - r₂² + x₂²
Given these x coordinates, you can obtain y coordinates as well, up to sign. You then choose a third point, sufficiently far away from either of these two starting points, to decide on the sign.
This approach makes no attempt at all to handle imprecise input. It assumes exact data, and will only use part of the distance matrix to find the points. It will not find the point set most closely matching all of the input data.
On your data, this will fail, since some arguments to the square root will be negative. This means that the two circles involved don't intersect at all, hence the triangle inequality is violated.
If it is, do we have to fix the distance matrix so D(i,j)^2 is not greater than D(1,j)^2 + D(i,1)^2.
D(i,j) ≤ D(i,k) + D(k,j) would help, i.e. for all triples and without squares. This would ensure that the triangle inequality holds everywhere. The result still need not be planar; for that you'd have to fix all those Menger determinants.

This is a simple python function to calculate what you need, solving hyperspheres.
import sympy
import numpy as np
def give_coords(distances):
"""give coordinates of points for which distances given
coordinates are given relatively. 1st point on origin, 2nd on x-axis, 3rd
x-y plane and so on. Maximum n-1 dimentions for which n is the number
of points
Args:
distanes (list): is a n x n, 2d array where distances[i][j] gives the distance
from i to j assumed distances[i][j] == distances[j][i]
Returns:
numpy.ndarray: cordinates in list form n dim
Examples:
>>> a = sympy.sqrt(2)
>>> distances = [[0,1,1,1,1,1],
[1,0,a,a,a,a],
[1,a,0,a,a,a],
[1,a,a,0,a,a],
[1,a,a,a,0,a],
[1,a,a,a,a,0]]
>>> give_coords(distances)
array([[0, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]], dtype=object)
>>> give_coords([[0, 3, 4], [3, 0, 5], [4, 5, 0]])
array([[0, 0],
[3, 0],
[0, 4]], dtype=object)
"""
distances = np.array(distances)
n = len(distances)
X = sympy.symarray('x', (n, n - 1))
for row in range(n):
X[row, row:] = [0] * (n - 1 - row)
for point2 in range(1, n):
expressions = []
for point1 in range(point2):
expression = np.sum((X[point1] - X[point2]) ** 2)
expression -= distances[point1,point2] ** 2
expressions.append(expression)
X[point2,:point2] = sympy.solve(expressions, list(X[point2,:point2]))[1]
return X

This is Solvable
If you would like to see cartesian-type coordinates which satisfy the distance matrix you provided in your question, then please view the following image.
distances matrix and coordinates
Your input matrix gives the distances between 6 nodes which we shall call a, b, c, d, e, and f. There are a total of 5 dimensions required to be able to assign coordinates to all six nodes which satisfy your distance matrix. Two of these dimensions are imaginary valued -- which is a consequence of breaking the triangle rule. The results were arrived at by using the law of cosines and a bit of number crunching.
a (0, 0, 0, 0, 0)
b (73, 0, 0, 0, 0)
c (-521.07, 510.99i, 0, 0, 0)
d (669.05, -802.08i, 664.62, 0, 0)
e (12.72, -163.83i, 488.13, 158.01i, 0)
f (-103.45, 184.11i, 84.52, 138.06i, 262.62)

Related

Edited: K means clustering and finding points closest to the centroid

I am trying to apply k means to cluster actors based on the information in the following columns
Actors Movies TvGuest Awards Shorts Special LiveShows
Robert De Niro 111 2 6 0 0 0
Jack Nicholson 70 2 4 0 5 0
Marlon Brando 64 2 5 0 0 28
Denzel Washington 25 2 3 24 0 0
Katharine Hepburn 90 1 2 0 0 0
Humphrey Bogart 105 2 1 0 0 52
Meryl Streep 27 2 2 5 0 0
Daniel Day-Lewis 90 2 1 0 71 22
Sidney Poitier 63 2 3 0 0 0
Clark Gable 34 2 4 0 3 0
Ingrid Bergman 22 2 2 3 0 4
Tom Hanks 82 11 6 21 11 22
#began by scaling my data
X = StandardScaler().fit_transform(data)
#used an elbow plot to find optimal k value
sum_of_squared_distances = []
K = range(1,15)
for k in K:
k_means = KMeans(n_clusters=k)
model = k_means.fit(X)
sum_of_squared_distances.append(k_means.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.show()
#found yhat for the calculated k value
kmeans = KMeans(n_clusters=3)
model = kmeans.fit(X)
yhat = kmeans.predict(X)
Unable to figure out create scatter plots by actors.
EDIT:
Is there a way to find which actors are closest to centroids if the centroids were also plotted using
centers = kmeans.cluster_centers_ (The kmeans here refers to Eric's solution below)
plt.scatter(centers[:,0],centers[:,1],color='purple',marker='*',label='centroid')
K means clustering in Pandas - Scatter plot
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Actors', 'Movies', 'TvGuest', "Awards", "Shorts"])
df.loc[0] = ["Robert De Niro", 111, 2, 6, 0]
df.loc[1] = ["Jack Nicholson", 70, 2, 4, 0]
df.loc[2] = ["Marlon Brando", 64, 4, 5, 0]
df.loc[3] = ["Denzel Washington", 25, 2, 3, 24]
df.loc[4] = ["Katharine Hepburn", 90, 1, 2, 0]
df.loc[5] = ["Humphrey Bogart", 105, 2, 1, 0]
df.loc[6] = ["Meryl Streep", 27, 3, 2, 5]
df.loc[7] = ["Daniel Day-Lewis", 90, 2, 1, 0]
df.loc[8] = ["Sidney Poitier", 63, 2, 3, 0]
df.loc[9] = ["Clark Gable", 34, 2, 4, 0]
df.loc[10] = ["Ingrid Bergman", 22, 5, 2, 3]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[['Movies', 'TvGuest', 'Awards']])
df['Cluster'] = y
plt.scatter(df.Movies, df.TvGuest, c=df.Cluster, alpha = 0.6)
plt.title('K-means Clustering 2 dimensions and 4 clusters')
plt.show()
Shows:
Notice the data points presented on the 2 dimensional scatterplot is Movies and TvGuest, however the Kmeans fit was given 3 variables: Movies, TvGuest, Awards. Imagine there is an additional dimension going into the screen which are used to calculate membership to a cluster.
Source links:
https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
https://datascience.stackexchange.com/questions/48693/perform-k-means-clustering-over-multiple-columns
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489
You can calculate Euclidean distance between points and centroid and find the min distance which indicates closest point to centroids
dist = numpy.linalg.norm(centroid-point)

How to write values over matplotlib bar charts without distorted figures

NOTE
The answers here produce a distorted figure; here is the bad result:
Here is the modified code that produces the bad result:
def plot_compare_bar(col1, col2, frame, fig_prefix=''):
frame = frame.sort_values(by=col1)
ind = np.arange(len(frame))
width = 0.4
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(ind, frame[col1], width, color='red', label=col1)
ax.barh(ind + width, frame[col2], width, color='blue', label=col2)
ax.set(
yticks=ind + width, yticklabels=frame['Class Name'],
ylim=[2 * width - 1, len(frame)], title=(
f'{fig_prefix} {col1} vs {col2} evaluation results'))
for i, v in enumerate(frame[col1].values):
ax.text(v + 3, i + .25, str(v), color='red', fontweight='bold')
for i, v in enumerate(frame[col2].values):
ax.text(v + 3, i + .25, str(v), color='blue', fontweight='bold')
ax.legend()
The original question:
import matplotlib.pyplot as plt
import numpy as np
def plot_compare_bar(col1, col2, frame, fig_prefix=''):
frame = frame.sort_values(by=col1)
ind = np.arange(len(frame))
width = 0.4
fig, ax = plt.subplots(figsize=(9, 5))
ax.barh(ind, frame[col1], width, color='red', label=col1)
ax.barh(ind + width, frame[col2], width, color='blue', label=col2)
ax.set(
yticks=ind + width, yticklabels=frame['Class Name'],
ylim=[2 * width - 1, len(frame)], title=(
f'{fig_prefix} {col1} vs {col2} evaluation results'))
ax.legend()
frame is a pandas DataFrame that looks like the following:
Class Name Average Precision Actual Detections True Positives False Positives Combined
2 Palm Tree 91.152760 1379 1428 1292 141 1433
5 Traffic Lights 71.026533 1269 1036 948 88 1036
3 Street Lamp 63.871910 995 848 727 121 848
0 Car 57.535491 3153 1955 1881 104 1985
1 Street Sign 56.925982 1109 704 658 46 704
6 Pedestrian 55.243564 1418 887 835 73 908
13 Road Block 52.182602 363 215 203 14 217
4 Minivan 51.786659 68 41 38 3 41
11 Bus 36.805556 43 18 16 2 18
9 Trash Can 14.444444 90 13 13 0 13
10 Bicycle 5.882353 17 1 1 0 1
8 Flag 5.000000 124 10 7 3 10
7 Fire Hydrant 1.923077 52 1 1 0 1
12 Pickup Truck 0.000000 20 0 0 0 0
14 Delivery Truck 0.000000 4 0 0 0 0
15 Motorcycle 0.000000 3 0 0 0 0
The function I defined above, produces the following plot:
I need every bar's value written next to it which might look like this:
How to modify the function above to do it.
You should not hard-coded the text. Instead, try to extract the values from patches:
def plot_compare_bar(col1, col2, frame, fig_prefix=''):
frame = frame.sort_values(by=col1)
ind = np.arange(len(frame))
width = 0.4
fig, ax = plt.subplots(figsize=(10,10))
ax.barh(ind, frame[col1], width, color='red', label=col1)
ax.barh(ind + width, frame[col2], width, color='blue', label=col2)
ax.set(
yticks=ind + width, yticklabels=frame['Class Name'],
ylim=[2 * width - 1, len(frame)], title=(
f'{fig_prefix} {col1} vs {col2} evaluation results'))
# annotation here
for patch in ax.patches:
# extract information from patch
pw = patch.get_width()
_,y = patch.get_xy()
color = patch.get_facecolor()
ax.text(pw + 3, y + width/2, str(pw),
color=color,verticalalignment='center')
ax.legend(loc='lower right')
Output:

Setting specific bin length in python list

I have a straightforward question but I'm facing issues for conversion.
I have a pandas dataframe column which I converted to a list. It has both positive and negative values:
bin_length = 5
list = [-200, -112, -115, 0, 50, 120, 250]
I need to group these numbers into a bin of length 5.
For example:
-100 to -95 should have a value of -100
-95 to -90 should have a value of -95
Similarly for positive values:
0 to 5 should be 5
5 to 10 should be 10
What I have tried until now:
df = pd.DataFrame(dataframe['rd2'].values.tolist(), columns = ['values'])
bins = np.arange(0, df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
But this doesn't account for negative values and then I get problems in converting the pandas interval into a separate columns for list.
Any help would be amazing.
Setting up the correct lower limit with np.arange:
bins = np.arange(df["values"].min(), df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
print (df)
values bins
0 -200 (-200.001, -195.0]
1 -112 (-115.0, -110.0]
2 -115 (-120.0, -115.0]
3 0 (-5.0, 0.0]
4 50 (45.0, 50.0]
5 120 (115.0, 120.0]
6 250 (245.0, 250.0]
Convert the intervals back to a list:
s = pd.IntervalIndex(df["bins"])
print ([[x,y] for x,y in zip(s.left, s.right)])
[[-200.001, -195.0], [-115.0, -110.0], [-120.0, -115.0], [-5.0, 0.0], [45.0, 50.0], [115.0, 120.0], [245.0, 250.0]]

Performing pair-wise comparisons of some pandas dataframe rows as efficiently as possible

For a given pandas dataframe df, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.
I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).
Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?
An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.
df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(),
how="inner", on=id_str, suffixes=suffixes)
.set_index(['index_1', 'index_2']))
id A_1 B_1 C_1 A_2 B_2 C_2
index_1 index_2
0 0 4 92 79 10 92 79 10
13 4 92 79 10 83 68 69
24 4 92 79 10 67 73 90
25 4 92 79 10 22 31 35
36 4 92 79 10 64 44 20
... .. ... ... ... ... ... ...
9993 9971 7 20 65 92 47 65 21
9977 7 20 65 92 50 35 27
9980 7 20 65 92 43 36 62
9992 7 20 65 92 99 2 17
9993 7 20 65 92 20 65 92

adding a new column to a data frame with X Y coordinates, classifying them according to fixed locations

I have a dataframe with X Y coordinates and another dataframe with fixed locations and would like to add a new column in the first data frame that classifies the X Y coordinates based in the minimum distance to the fixed locations (n=10).
I tried to loop through the first data frame and I need another embedded for loop that compares the X Y coordinates with the fixed locations (for example a1= 200,180). Based on the minimum distance, the X Y coordinates should be classified according to location. How to do?
df1[:10]
X Y
0 224 234
1 258 232
2 225 235
3 257 232
4 226 235
5 257 232
6 226 235
7 257 232
8 224 234
9 258 232
df2 (fixed location)
seats = [[260, 165], [315, 165], [370, 165], [423, 165], [485, 200], [420, 240] , [370, 240], [315, 240] ,[260, 240], [200,200]]
Based on min distances, a new column should be added to the first dataframe classifying the data points according to seats0-10.
import numpy as np
minDist = []
for i in range(df1.shape[0]):
dist_list = []
for f in range(df2.shape[0]):
dist = np.linalg.norm(df1.iloc[i,:]-df2[f,:])
dist_list.append(dist)
minDist.append(dist_list.index(min(dist_list)))
df1['distance'] = minDist

Resources