I have a working code to make some calculations and create a dataframe, however it take a considerable amount of time when the number if id:s considered grows (actually, time increases exponentially).
So, here if the situation: I have a dataframe consisting of vectors, one for each id:
id vector
0 A4070270297516241 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,...
1 A4060461064716279 [0, 2, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 A4050500015016271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, ...
3 A4050494283416274 [15, 13, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
4 A4050500876316279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 A4050494111016270 [6, 10, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
6 A4050470673516272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 A4060461035616276 [0, 0, 0, 11, 0, 15, 13, 0, 5, 3, 0, 0, 0, 0, ...
8 A4050500809916271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 A4050500822216279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, ...
10 A4050494817416277 [0, 0, 0, 0, 0, 4, 9, 0, 5, 8, 0, 15, 0, 0, 8,...
11 A4060462005116279 [15, 12, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
12 A4050500802316278 [0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 0, 15, 12, 0, 8...
13 A4050500841416272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 5, ...
14 A4050494856516271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, ...
15 A4060462230216270 [0, 0, 2, 2, 15, 15, 10, 0, 0, 0, 0, 0, 0, 0, ...
16 A4090150867216273 [0, 0, 0, 0, 0, 0, 0, 13, 6, 3, 0, 2, 0, 15, 4...
17 A4060464010916275 [0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
18 A4139311891213540 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
19 A4050500938416279 [0, 10, 11, 6, 6, 2, 4, 0, 0, 0, 0, 0, 0, 0, 0...
20 A4620451871516274 [0, 0, 0, 0, 15, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
21 A4060460331116279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 15, 0, 2,...
I provide a dict at the end of the question to avoid clutter.
Now, what I do is that I determine, for each id which other id is the closest by calculating a weighted distance between each vector and creating a dataframe storing the infomation:
ids = list(set(df1.id))
Closest_identifier = pd.DataFrame(columns = ['id','Identifier','Closest identifier','distance'])
and my code goes like this:
import time
t = time.process_time()
for idnr in ids:
df_identifier = df1[df1['id'] == idnr]
identifier = df_identifier['vector'].to_list()
base_identifier = np.array(df_identifier['vector'].to_numpy().tolist())
Number_of_devices =len(np.nonzero(identifier)[1])
df_other_identifier = df1[df1['id'] != idnr]
other_id = list(set(df_other_identifier.id))
for id_other in other_id:
gf_identifier = df_other_identifier[df_other_identifier['id']==id_other]
identifier_other = np.array(gf_identifier['vector'].to_numpy().tolist())
dist = np.sqrt(np.sum((base_identifier - identifier_other)**2/Number_of_devices))
Closest_identifier = Closest_identifier.append({'id':id_other,'Identifier':base_identifier, 'Closest identifier':identifier_other, 'distance':dist},ignore_index=True)
elapsed_time = time.process_time() - t
print(elapsed_time)
6.0625
To explain what is happening: In the fisrt part of the code, I choose an id and set up all the infomation I need. The number of devices is the number of non zero values of the vector associated to that id (i.e., the number of devices that detected the object with that id). In the second part I compute the distance of that id to all other objects.
So, for each id, I have n-1 rows, where n is the length of my id set. So, for 50 ids, I have 50*50-50 = 2450 rows
The time given here is for 50 ids. For 200, the time for the loops to finish is 120 s, for 400 the time is 871 s. As you can see, time grows exponentially here. I have 1700 ids and it'll take days for this to complete.
My questions is thus: Is there a more efficient way to do this?
Grateful for insights.
Test data
{'id': {0: 'A4070270297516241',
1: 'A4060461064716279',
2: 'A4050500015016271',
3: 'A4050494283416274',
4: 'A4050500876316279',
5: 'A4050494111016270',
6: 'A4050470673516272',
7: 'A4060461035616276',
8: 'A4050500809916271',
9: 'A4050500822216279',
10: 'A4050494817416277',
11: 'A4060462005116279',
12: 'A4050500802316278',
13: 'A4050500841416272',
14: 'A4050494856516271',
15: 'A4060462230216270',
16: 'A4090150867216273',
17: 'A4060464010916275',
18: 'A4139311891213540',
19: 'A4050500938416279',
20: 'A4620451871516274',
21: 'A4060460331116279',
22: 'A4060454590916277',
23: 'A4060454778016276',
24: 'A4060462019716270',
25: 'A4050500945416277',
26: 'A4050494267716279',
27: 'A4090281644816244',
28: 'A4050500929516270',
29: 'N4010442537213363',
30: 'A4050500938216277',
31: 'A4060454598916275',
32: 'A4050494086216273',
33: 'A4060462859616271',
34: 'A4060454600116271',
35: 'A4050494551816276',
36: 'A4610490015816279',
37: 'A4060454605416279',
38: 'A4060454665916270',
39: 'A4060454579316278',
40: 'A4060464023516275',
41: 'A4050500588616272',
42: 'A4050500905516274',
43: 'A4070262442416243',
44: 'A4050500946716271',
45: 'A4070271195016244',
46: 'A4060454663216271',
47: 'A4060454590416272',
48: 'A4060461993616279',
49: 'N4010442139713366'},
'vector': {0: [0,
0,
0,
0,
0,
0,
0,
0,
7,
4,
0,
0,
0,
13,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
1: [0,
2,
0,
0,
6,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
9,
14],
2: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
3,
0,
0,
0,
5,
12,
15,
2,
0,
0,
0,
0,
0,
0],
3: [15,
13,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
5],
4: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
15,
0,
0,
0,
0,
0,
0],
5: [6,
10,
1,
0,
2,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
10,
13,
15],
6: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
15,
7,
2,
0,
0,
0],
7: [0,
0,
0,
11,
0,
15,
13,
0,
5,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
8: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
10,
2,
0,
0],
9: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
8,
15,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0],
10: [0,
0,
0,
0,
0,
4,
9,
0,
5,
8,
0,
15,
0,
0,
8,
2,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0],
11: [15,
12,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
0,
6,
2],
12: [0,
0,
0,
0,
0,
1,
2,
0,
2,
2,
0,
15,
12,
0,
8,
1,
9,
2,
0,
0,
0,
0,
0,
0,
0,
0],
13: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
6,
0,
5,
3,
11,
15,
11,
1,
0,
0,
0,
0,
0,
0],
14: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
3,
0,
7,
12,
14,
1,
0,
0,
0,
0,
0,
0],
15: [0,
0,
2,
2,
15,
15,
10,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
2,
15],
16: [0,
0,
0,
0,
0,
0,
0,
13,
6,
3,
0,
2,
0,
15,
4,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
17: [0,
0,
0,
0,
7,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
15,
8,
2],
18: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
12,
9,
2,
0,
0],
19: [0,
10,
11,
6,
6,
2,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
20: [0,
0,
0,
0,
15,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
4,
14,
13,
11],
21: [0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
0,
5,
15,
0,
2,
3,
10,
9,
0,
0,
0,
0,
0,
0,
0,
0],
22: [0,
0,
0,
2,
7,
15,
15,
0,
2,
3,
0,
15,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
0,
4],
23: [2,
15,
15,
4,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
11],
24: [0,
0,
0,
0,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
6,
15,
14,
2],
25: [0,
9,
13,
15,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
26: [0,
0,
0,
1,
2,
8,
15,
0,
1,
4,
0,
15,
1,
0,
6,
0,
0,
0,
0,
0,
0,
0,
0,
8,
0,
0],
27: [0,
0,
0,
0,
0,
0,
0,
0,
2,
1,
0,
0,
0,
5,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
28: [0,
7,
9,
6,
6,
4,
7,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5],
29: [8,
6,
2,
2,
6,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
7,
11],
30: [0,
10,
11,
15,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
31: [6,
15,
6,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
8],
32: [0,
0,
0,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
2,
11,
8,
9,
2],
33: [0,
0,
0,
0,
0,
11,
15,
0,
2,
4,
0,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
34: [4,
15,
15,
5,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
1,
8],
35: [2,
1,
1,
0,
12,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
9,
15,
14,
12],
36: [0,
0,
0,
0,
0,
0,
5,
15,
4,
2,
0,
0,
0,
1,
2,
0,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0],
37: [0,
0,
0,
0,
0,
11,
15,
0,
15,
15,
0,
14,
0,
3,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
38: [0,
0,
0,
0,
0,
3,
14,
0,
10,
15,
0,
14,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
39: [0,
0,
2,
0,
8,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
15,
15,
5],
40: [0,
0,
0,
3,
0,
4,
10,
5,
15,
14,
0,
2,
2,
13,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
41: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
10,
0,
0,
0,
0,
0,
0],
42: [0,
0,
0,
0,
0,
0,
0,
1,
0,
2,
0,
2,
0,
10,
15,
14,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0],
43: [0,
0,
0,
0,
0,
0,
3,
2,
7,
8,
0,
2,
0,
15,
8,
2,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0],
44: [0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
11,
15,
0,
3,
0,
13,
12,
0,
0,
0,
0,
0,
0,
0,
0],
45: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
11,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0],
46: [0,
0,
0,
0,
0,
3,
11,
0,
15,
15,
0,
15,
2,
9,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
47: [0,
2,
3,
0,
15,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
15,
15,
6],
48: [0,
0,
0,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
0,
11,
7,
9,
3],
49: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
12,
15,
9,
0,
0,
0]}}
Try:
# id1: [0, 0, 0, 1, 1, 1]
m = np.repeat(np.vstack(df1['vector']), df1.shape[0], axis=0)
# id2: [0, 1, 0, 1, 0, 1]
n = np.tile(np.vstack(df1['vector']), (df1.shape[0], 1))
# number of devices for each vector of m
d = np.count_nonzero(m, axis=1, keepdims=True)
# compute the distance
dist = np.sqrt(np.sum((m - n)**2/d, axis=-1))
# create the final dataframe
mi = pd.MultiIndex.from_product([df1['id']] * 2, names=['id1', 'id2'])
out = pd.DataFrame({'vector1': m.tolist(),
'vector2': n.tolist(),
'distance': dist}, index=mi).reset_index()
Output:
>>> out
id1 id2 vector1 vector2 distance
0 A4070270297516241 A4070270297516241 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... 0.000000
1 A4070270297516241 A4060461064716279 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 2, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 13.747727
2 A4070270297516241 A4050500015016271 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, ... 14.628739
3 A4070270297516241 A4050494283416274 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [15, 13, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... 15.033296
4 A4070270297516241 A4050500876316279 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 15.099669
... ... ... ... ... ...
2495 N4010442139713366 A4070271195016244 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0, 0,... 13.916417
2496 N4010442139713366 A4060454663216271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 3, 11, 0, 15, 15, 0, 15, 2, 9,... 21.330729
2497 N4010442139713366 A4060454590416272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 2, 3, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 19.304576
2498 N4010442139713366 A4060461993616279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 12.288206
2499 N4010442139713366 N4010442139713366 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0.000000
[2500 rows x 5 columns]
Performance
%timeit loop_op()
3.75 s ± 88.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vect_corralien()
4.16 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can calculate the squared distances with broadcasting. After you have that you just have to find num_devices for every row and use it to calculate your custom distance. After filling the diagonal with infinite values you can get the minimum index of every row which gives you the closest device.
arr = np.array([value for value in data['vector'].values()])
squared_distances = np.power((arr[:,None,:] - arr[None,:,:]).sum(axis=-1), 2)
num_devices = (arr != 0).sum(axis=1)
distances = np.sqrt(squared_distances / num_devices )
np.fill_diagonal(distances, np.inf)
closest_indentifiers = distances.argmin(axis=0)
You can format the output of the program as you desire