adding a new column to a data frame with X Y coordinates, classifying them according to fixed locations - python-3.x

I have a dataframe with X Y coordinates and another dataframe with fixed locations and would like to add a new column in the first data frame that classifies the X Y coordinates based in the minimum distance to the fixed locations (n=10).
I tried to loop through the first data frame and I need another embedded for loop that compares the X Y coordinates with the fixed locations (for example a1= 200,180). Based on the minimum distance, the X Y coordinates should be classified according to location. How to do?
df1[:10]
X Y
0 224 234
1 258 232
2 225 235
3 257 232
4 226 235
5 257 232
6 226 235
7 257 232
8 224 234
9 258 232
df2 (fixed location)
seats = [[260, 165], [315, 165], [370, 165], [423, 165], [485, 200], [420, 240] , [370, 240], [315, 240] ,[260, 240], [200,200]]
Based on min distances, a new column should be added to the first dataframe classifying the data points according to seats0-10.

import numpy as np
minDist = []
for i in range(df1.shape[0]):
dist_list = []
for f in range(df2.shape[0]):
dist = np.linalg.norm(df1.iloc[i,:]-df2[f,:])
dist_list.append(dist)
minDist.append(dist_list.index(min(dist_list)))
df1['distance'] = minDist

Related

test/train splits in pycaret using a column for grouping rows that should be in the same split

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET
10 row sample for clarification:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
2 23 34 233
2 623 22 888
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
every unique group_id should be sent to any split in full this way (using 80/20):
TRAIN SET:
group_id measure1 measure2 measure3
1 3455 3425 345
1 6455 825 945
1 6444 225 145
3 3455 3425 345
3 6155 525 645
3 6434 325 845
4 93 345 233
4 693 222 808
TEST SET:
group_id measure1 measure2 measure3
2 23 34 233
2 623 22 888
You can try the following per the documentation
https://pycaret.readthedocs.io/en/latest/api/classification.html
fold_strategy = "groupkfold"
One solution could look like this:
import numpy as np
import pandas as pd
from itertools import combinations
def is_possible_sum(numbers, n):
for r in range(len(numbers)):
for combo in combinations(numbers, r + 1):
if sum(combo) == n:
return combo
print(f'Desired split not possible')
raise ArithmeticError
def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
train_ids = []
occurrences = table[col_identifier].value_counts().to_dict()
required = sum(occurrences.values()) * train_fraction
lengths = is_possible_sum(occurrences.values(), required)
for i in lengths:
for key, value in occurrences.items():
if value == i:
train_ids.append(key)
del occurrences[key] # prevents the same ID from being selected twice
break
train = table[table[col_identifier].isin(train_ids)]
test = table[~table[col_identifier].isin(train_ids)]
return train, test
if __name__ == '__main__':
df = pd.DataFrame()
df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
df['Measurement'] = np.random.random(10)
train_part, test_part = train_test_split(df, 0.8, 'Group_ID')
Some remarks:
This is probably the least elegant way to do it...
It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split.
Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.
First function taken from here: How to check if a sum is possible in array?

Pandas Aggregate columns dynamically

My goal is to aggregate data similar to SAS's "proc summary using types" My starting pandas dataframe could look like this where the database has already done the original group by all dimensions/classification variables and done some aggregate function on the measures.
So in sql this would look like
select gender, age, sum(height), sum(weight)
from db.table
group by gender, age
gender
age
height
weight
F
19
70
123
M
24
72
172
I then would like to summarize the data using pandas to calculate summary rows based on different group bys to come out with this.
gender
age
height
weight
.
.
142
295
.
19
70
123
.
24
72
172
F
.
70
123
M
.
72
172
F
19
70
123
M
24
72
172
Where the first row is agg with no group by
2 and 3 row are agg grouped by age
4 and 5 agg by just gender
and then just the normal rows
My current code looks like this
# normally dynamic just hard coded for this example
measures = {'height':{'stat':'sum'}, 'age':{'stat':'sum'}}
msr_config_dict = {}
for measure in measures:
if measure in message_measures:
stat = measures[measure]['stat']
msr_config_dict[measure] = pd.NamedAgg(measure, stat)
# compute agg with no group by as starting point
df=self.df.agg(**msr_config_dict)
dimensions = ['gender','age'] # also dimensions is dynamic in real life
dim_vars = []
for dim in dimensions:
dim_vars.append(dim)
if len(dim_vars) > 1:
# compute agg of compound dimensions
df_temp = self.df.groupby(dim_vars, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
# always compute agg of solo dimension
df_temp = self.df.groupby(dim, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
With this code I get AttributeError: 'height' is not a valid function for 'Series' object
For the input to agg function I have also tried
{'height':[('height', 'sum')], 'weight':[('weight', 'sum')]} where I am trying to compute the sum of all heights and name the output height. Which also had an attribute error.
I know I will only ever be computing one aggregate function per measure so I would like to dynamically build the input to the pandas agg functon and always rename the stat to itself so I can just append it to the dataframe that I am building with the summary rows.
I am new to pandas coming from SAS background.
Any help would be much appreciated.
IIUC:
cols = ['height', 'weight']
out = pd.concat([df[cols].sum(0).to_frame().T,
df.groupby('age')[cols].sum().reset_index(),
df.groupby('gender')[cols].sum().reset_index(),
df], ignore_index=True)[df.columns].fillna('.')
Output:
>>> out
gender age height weight
0 . . 142 295
1 . 19.0 70 123
2 . 24.0 72 172
3 F . 70 123
4 M . 72 172
5 F 19.0 70 123
6 M 24.0 72 172
Here is a more flexible solution, extending the solution of #Corralien. You can use itertools.combinations to create all the combinations of dimensions and for all length of combination possible.
from itertools import combinations
# your input
measures = {'height':{'stat':'sum'}, 'weight':{'stat':'min'}}
dimensions = ['gender','age']
# change the nested dictionary
msr_config_dict = {key:val['stat'] for key, val in measures.items()}
# concat all possible aggregation
res = pd.concat(
# case with all aggregated
[df.agg(msr_config_dict).to_frame().T]
# cases at least one column to aggregate over
+ [df.groupby(list(_dimCols)).agg(msr_config_dict).reset_index()
# for combinations of length 1, 2.. depending on the number of dimensions
for nb_cols in range(1, len(dimensions))
# all combinations of the specific lenght
for _dimCols in combinations(dimensions, nb_cols)]
# original dataframe
+ [df],
ignore_index=True)[df.columns].fillna('.')
print(res)
# gender age height weight
# 0 . . 142 123
# 1 F . 70 123
# 2 M . 72 172
# 3 . 19.0 70 123
# 4 . 24.0 72 172
# 5 F 19.0 70 123
# 6 M 24.0 72 172

Performing pair-wise comparisons of some pandas dataframe rows as efficiently as possible

For a given pandas dataframe df, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.
I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).
Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?
An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.
df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(),
how="inner", on=id_str, suffixes=suffixes)
.set_index(['index_1', 'index_2']))
id A_1 B_1 C_1 A_2 B_2 C_2
index_1 index_2
0 0 4 92 79 10 92 79 10
13 4 92 79 10 83 68 69
24 4 92 79 10 67 73 90
25 4 92 79 10 22 31 35
36 4 92 79 10 64 44 20
... .. ... ... ... ... ... ...
9993 9971 7 20 65 92 47 65 21
9977 7 20 65 92 50 35 27
9980 7 20 65 92 43 36 62
9992 7 20 65 92 99 2 17
9993 7 20 65 92 20 65 92

Using distance matrix to find coordinate points of set of points

Given a distance matrix and a set of points, how do you figure out the coordinates of these points?
Edit: This is on a plane.
This question was answered here but in trying different distance matrices, I really couldn't use this answer because the M matrix had negative values, as did my eigenvectors. So when you took the square root, the program (in R) outputs "NaN" for those associated entries.
I'm guessing this will happen every time D(i,j)^2 is greater than D(1,j)^2 + D(i,1)^2.
For example, say I have a distance matrix:
0 73 102 496 432 184
73 0 303 392 436 233
102 303 0 366 207 353
496 392 366 0 172 103
432 436 207 172 0 352
184 233 353 103 352 0
Using the equation M(i,j) = (0.5)(D(1,j)^2+D(i,1)^2-D(i,j)^2), I get (which already has negative entries):
0 0.0 0.0 0.0 0.0 0.0
0 5329.0 -38038.0 48840.5 928.5 -7552.0
0 -38038.0 10404.0 61232.0 77089.5 -40174.5
0 48840.5 61232.0 246016.0 201528.0 134631.5
0 928.5 77089.5 201528.0 186624.0 48288.0
0 -7552.0 -40174.5 134631.5 48288.0 33856.0
Then I get non - zero eigenvalues & eigenvectors:
477718.27 101845.63 16474.30 -13116.72 -100692.49
[,1] [,2] [,3] [,4] [,5]
0.00000000 0.0000000 0.00000000 0.00000000 0.00000000
-0.05928626 0.3205747 0.84148945 0.04869546 -0.42806691
-0.16650486 -0.5670946 -0.04507520 -0.58222690 -0.55647098
-0.73371713 0.2827320 0.07386302 -0.45957443 0.40627254
-0.59727407 -0.4623603 0.07806418 0.64968004 -0.03617241
-0.27144823 0.5309625 -0.52755471 0.15920983 -0.58372335
Since there are both negative eigenvalues and eigenvectors, when we compute
sqrt(eigenvector(i)*eigenvalue(i)), we'll have negative values.
Here is my final output:
[,1] [,2] [,3] [,4] [,5]
0 0.0000 0.00000 0.00000 0.00000
NaN 180.6907 117.74103 NaN 207.61291
NaN NaN NaN 87.38939 236.71174
NaN 169.6910 34.88326 77.64089 NaN
NaN NaN 35.86158 NaN 60.35139
NaN 232.5429 NaN NaN 242.43877
Is this the only clear way of computing the coordinate points without using angles?
If it is, do we have to fix the distance matrix so D(i,j)^2 is not greater than D(1,j)^2 + D(i,1)^2.
Thanks.
Your data is inconsistent
Your coordinates are not consistent with positions of points in ℝ⁴, let alone a space of lower dimension. You can tell that fact by computing the Menger determinant of your squared distance matrix:
D <- as.matrix(read.table(textConnection("\
0 73 102 496 432 184
73 0 303 392 436 233
102 303 0 366 207 353
496 392 366 0 172 103
432 436 207 172 0 352
184 233 353 103 352 0")))
n <- nrow(D)
det(rbind(cbind(D^2, 1), c(rep(1, n), 0)))
# Result: 3.38761e+25
If your coordinates really came from points in a space of dimension less than five, then this determinant would have to be zero. As it is not, your distances are inconsistent, or the points form a simplex in a space of sufficiently high dimension.
But no mater the dimension, your data is still inconsistent since it violates the triangle inequality in several cases:
a b c ac abc ab bc
1 2 4: 496 > 465 = 73 + 392
1 3 4: 496 > 468 = 102 + 366
1 3 5: 432 > 309 = 102 + 207
1 6 4: 496 > 287 = 184 + 103
2 1 3: 303 > 175 = 73 + 102
2 6 4: 392 > 336 = 233 + 103
3 1 6: 353 > 286 = 102 + 184
5 4 6: 352 > 275 = 172 + 103
Going from a to c directly can never take longer than going via b, but according to your data it does.
Simple planar approach
If you had data consistent with points in the plane (i.e. all Menger determinants for combinations of four points evaluate to zero), you could use the following to obtain coordinates:
distance2coordinates <- function(D) {
n <- nrow(D)
maxDist <- which.max(D)
p1 <- ((maxDist - 1) %% n) + 1
p2 <- ((maxDist - 1) %/% n) + 1
x2 <- D[p1, p2]
r1sq <- D[p1,]^2
r2sq <- D[p2,]^2
x <- (r1sq - r2sq + x2^2)/(2*x2)
y <- sqrt(r1sq - x^2)
p3 <- which.max(y)
x3 <- x[p3]
y3 <- y[p3]
plus <- abs(D[p3,]^2 - (x3 - x)^2 - (y3 - y)^2)
minus <- abs(D[p3,]^2 - (x3 - x)^2 - (y3 + y)^2)
y[minus < plus] <- -y[minus < plus]
coords <- data.frame(x = x, y = y)
return(coords)
}
The idea is that you choose two points with maximal distance as starting points. You place on in the origin and the other on the positive x axis. Then you can compute all other x coordinates from this, as the intersection of two circles, following the equations
I: x² + y² = r₁²
II: (x - x₂)² + y² = r₂²
I-II: 2*x*x₂ = r₁² - r₂² + x₂²
Given these x coordinates, you can obtain y coordinates as well, up to sign. You then choose a third point, sufficiently far away from either of these two starting points, to decide on the sign.
This approach makes no attempt at all to handle imprecise input. It assumes exact data, and will only use part of the distance matrix to find the points. It will not find the point set most closely matching all of the input data.
On your data, this will fail, since some arguments to the square root will be negative. This means that the two circles involved don't intersect at all, hence the triangle inequality is violated.
If it is, do we have to fix the distance matrix so D(i,j)^2 is not greater than D(1,j)^2 + D(i,1)^2.
D(i,j) ≤ D(i,k) + D(k,j) would help, i.e. for all triples and without squares. This would ensure that the triangle inequality holds everywhere. The result still need not be planar; for that you'd have to fix all those Menger determinants.
This is a simple python function to calculate what you need, solving hyperspheres.
import sympy
import numpy as np
def give_coords(distances):
"""give coordinates of points for which distances given
coordinates are given relatively. 1st point on origin, 2nd on x-axis, 3rd
x-y plane and so on. Maximum n-1 dimentions for which n is the number
of points
Args:
distanes (list): is a n x n, 2d array where distances[i][j] gives the distance
from i to j assumed distances[i][j] == distances[j][i]
Returns:
numpy.ndarray: cordinates in list form n dim
Examples:
>>> a = sympy.sqrt(2)
>>> distances = [[0,1,1,1,1,1],
[1,0,a,a,a,a],
[1,a,0,a,a,a],
[1,a,a,0,a,a],
[1,a,a,a,0,a],
[1,a,a,a,a,0]]
>>> give_coords(distances)
array([[0, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]], dtype=object)
>>> give_coords([[0, 3, 4], [3, 0, 5], [4, 5, 0]])
array([[0, 0],
[3, 0],
[0, 4]], dtype=object)
"""
distances = np.array(distances)
n = len(distances)
X = sympy.symarray('x', (n, n - 1))
for row in range(n):
X[row, row:] = [0] * (n - 1 - row)
for point2 in range(1, n):
expressions = []
for point1 in range(point2):
expression = np.sum((X[point1] - X[point2]) ** 2)
expression -= distances[point1,point2] ** 2
expressions.append(expression)
X[point2,:point2] = sympy.solve(expressions, list(X[point2,:point2]))[1]
return X
This is Solvable
If you would like to see cartesian-type coordinates which satisfy the distance matrix you provided in your question, then please view the following image.
distances matrix and coordinates
Your input matrix gives the distances between 6 nodes which we shall call a, b, c, d, e, and f. There are a total of 5 dimensions required to be able to assign coordinates to all six nodes which satisfy your distance matrix. Two of these dimensions are imaginary valued -- which is a consequence of breaking the triangle rule. The results were arrived at by using the law of cosines and a bit of number crunching.
a (0, 0, 0, 0, 0)
b (73, 0, 0, 0, 0)
c (-521.07, 510.99i, 0, 0, 0)
d (669.05, -802.08i, 664.62, 0, 0)
e (12.72, -163.83i, 488.13, 158.01i, 0)
f (-103.45, 184.11i, 84.52, 138.06i, 262.62)

How can I plot an iteration coordinate(x,y) data with only label in GNUplot?

I have the set of data like this
0 268 195
1 353 199
2 318 209
3 268 232
4 370 238
5 326 253
6 246 265
7 372 284
8 313 290
9 258 297
0 268 196
1 353 199
2 318 209
3 268 233
4 370 238
5 325 253
6 246 265
7 372 284
8 313 290
9 258 297
I would like to use first column for label and second and third for (x,y) plot, however, I would like to plot only one time label without iteration. How should I do?
Thank you for help.
Do you want something like:
plot 'datafile' u 2:3:1 with labels
... I'm not really sure what you mean by "I would like to plot only one time label without iteration" ...
It looks to me like you want to only take 1 unique label. E.g. only one label that is 0, and only 1 label that is 1 etc. For simplicity, I'll take the first with a small python script:
#test.py
import sys
seen = set()
with open(sys.argv[1]) as f:
for line in f:
num,rest = line.split(None,1)
if num not in seen:
seen.add(num)
sys.stdout.write(line)
Now we can plot our file in gnuplot:
plot '< python test.py yourdatafile' u 2:3:1 w labels
Here's a version of test.py which will average the positions of all the labels with the same "value".
import sys
from collections import defaultdict
d = defaultdict(list)
with open(sys.argv[1]) as f:
for line in f:
num,x,y = map(int,line.split())
d[num].append((x,y))
#now average
for k,v in d.items():
x,y = zip(*v)
avg_x = float(sum(x))/len(x)
avg_y = float(sum(y))/len(y)
print k,avg_x,avg_y

Resources