I'm trying to train a model in Keras to suggest the best possible next move when presented with a pawn chess board. the board is represented as a list of 64 integers (0 for empty, 1 for player, 2 for enemy). The output is represented by a list of a field and a direction that the figure on that field should move in, which means I need two ouput layers with size 64 (number of fields) and 5 (number of possible move directions, including two forward and no move for when the game is over).
I have a list of boards and a list of solutions. When I try to fit the model however, I get the above mentioned error.
The exact error message is:
Epoch 1/75
Traceback (most recent call last):
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\main.py", line 75, in <module>
model.fit(train_fig_starts, train_fig_moves, epochs=75)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\lulll\AppData\Local\Temp\__autograph_generated_filej0zia4d5.py", line 15, in tf__train_function
retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
ValueError: in user code:
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1249, in train_function *
return step_function(self, iterator)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1233, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1222, in run_step **
outputs = model.train_step(data)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1024, in train_step
loss = self.compute_loss(x, y, y_pred, sample_weight)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\training.py", line 1082, in compute_loss
return self.compiled_loss(
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\engine\compile_utils.py", line 265, in __call__
loss_value = loss_obj(y_t, y_p, sample_weight=sw)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\losses.py", line 152, in __call__
losses = call_fn(y_true, y_pred)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\losses.py", line 284, in call **
return ag_fn(y_true, y_pred, **self._fn_kwargs)
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\losses.py", line 2176, in binary_crossentropy
backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
File "C:\Users\lulll\Documents\CodeStuff\tfTesting\venv\lib\site-packages\keras\backend.py", line 5688, in binary_crossentropy
bce = target * tf.math.log(output + epsilon())
ValueError: Dimensions must be equal, but are 2 and 64 for '{{node binary_crossentropy/mul}} = Mul[T=DT_FLOAT](binary_crossentropy/Cast, binary_crossentropy/Log)' with input shapes: [?,2], [?,64].
I have absolutely no idea what is causing this. I've searched for the error already, but the only mentions I've found seem to be describing a completely different scenario.
Since it probably helps, here's the code used to create and fit the model:
inputs = tf.keras.layers.Input(shape=64)
x = tf.keras.layers.Dense(32, activation='relu')(inputs)
out_field = tf.keras.layers.Dense(64, name="field")(x)
out_movement = tf.keras.layers.Dense(5, name="movement")(x)
model = tf.keras.Model(inputs=inputs, outputs=[out_field, out_movement])
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])
model.fit(train_fig_starts, train_fig_moves, epochs=75) #train_fig_starts and moves are defined above
EDIT 1: Here's a sample of the dataset I'm using (the whole thing is too long for the character limit)
train_fig_starts = [[0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 2, 1, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 0], [0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 2, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 2, 2, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]]
train_fig_moves = [[0, 0], [0, 0], [0, 0], [0, 0], [15, 2], [15, 2]]
EDIT 2:
I changed it to sparsecategorialcrossentropy since that seems more like what I'm looking for. This is now the model code
inputs = tf.keras.layers.Input(shape=64)
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
out_field = tf.keras.layers.Dense(64, activation="relu", name="field")(x)
out_field = tf.keras.layers.Dense(64, activation="softmax", name="field_softmax")(out_field)
out_movement = tf.keras.layers.Dense(5, activation="relu", name="movement")(x)
out_movement = tf.keras.layers.Dense(5, activation="softmax", name="movement_softmax")(out_movement)
model = tf.keras.Model(inputs=inputs, outputs=[out_field, out_movement])
print(model.summary())
tf.keras.utils.plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
model.compile(optimizer='adam',
loss=[tf.keras.losses.SparseCategoricalCrossentropy(),
tf.keras.losses.SparseCategoricalCrossentropy()],
metrics=['accuracy'])
it still throws an error, this time its the following:
Node: 'sparse_categorical_crossentropy_1/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits'
logits and labels must have the same first dimension, got logits shape [32,5] and labels shape [64]
[[{{node sparse_categorical_crossentropy_1/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_1666]
I have no idea why its like that. Output logits and labels should both be [64, 2]. Since I'm using sparse crossentropy I should be able to use integers in my training data to signify the "index" of the ouput neuron with the highest logit, right? Correct me if I'm wrong. If it helps, here's a diagram of my model:
plot of the model
So I fixed the issue by myself now. Honestly it was a pretty stupid error to make but the error messages didn't really explain well what was going on. I swapped the outputs for one hot encoding and changed the loss to CategorialCrossEntropy, which is also more fitting for a categorisation problem (Sparse didn't work with my integers for some reason). After that I needed to change the label list from a 1dim list containing lists of len = 2 to a 2dim list containing both the field and the move one hots in a separate list. If anyone runs into a similar issue and can't make sense of it, maybe this will help.
I have a working code to make some calculations and create a dataframe, however it take a considerable amount of time when the number if id:s considered grows (actually, time increases exponentially).
So, here if the situation: I have a dataframe consisting of vectors, one for each id:
id vector
0 A4070270297516241 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,...
1 A4060461064716279 [0, 2, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 A4050500015016271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, ...
3 A4050494283416274 [15, 13, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
4 A4050500876316279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 A4050494111016270 [6, 10, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
6 A4050470673516272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 A4060461035616276 [0, 0, 0, 11, 0, 15, 13, 0, 5, 3, 0, 0, 0, 0, ...
8 A4050500809916271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 A4050500822216279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, ...
10 A4050494817416277 [0, 0, 0, 0, 0, 4, 9, 0, 5, 8, 0, 15, 0, 0, 8,...
11 A4060462005116279 [15, 12, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
12 A4050500802316278 [0, 0, 0, 0, 0, 1, 2, 0, 2, 2, 0, 15, 12, 0, 8...
13 A4050500841416272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 5, ...
14 A4050494856516271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, ...
15 A4060462230216270 [0, 0, 2, 2, 15, 15, 10, 0, 0, 0, 0, 0, 0, 0, ...
16 A4090150867216273 [0, 0, 0, 0, 0, 0, 0, 13, 6, 3, 0, 2, 0, 15, 4...
17 A4060464010916275 [0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
18 A4139311891213540 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
19 A4050500938416279 [0, 10, 11, 6, 6, 2, 4, 0, 0, 0, 0, 0, 0, 0, 0...
20 A4620451871516274 [0, 0, 0, 0, 15, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
21 A4060460331116279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 15, 0, 2,...
I provide a dict at the end of the question to avoid clutter.
Now, what I do is that I determine, for each id which other id is the closest by calculating a weighted distance between each vector and creating a dataframe storing the infomation:
ids = list(set(df1.id))
Closest_identifier = pd.DataFrame(columns = ['id','Identifier','Closest identifier','distance'])
and my code goes like this:
import time
t = time.process_time()
for idnr in ids:
df_identifier = df1[df1['id'] == idnr]
identifier = df_identifier['vector'].to_list()
base_identifier = np.array(df_identifier['vector'].to_numpy().tolist())
Number_of_devices =len(np.nonzero(identifier)[1])
df_other_identifier = df1[df1['id'] != idnr]
other_id = list(set(df_other_identifier.id))
for id_other in other_id:
gf_identifier = df_other_identifier[df_other_identifier['id']==id_other]
identifier_other = np.array(gf_identifier['vector'].to_numpy().tolist())
dist = np.sqrt(np.sum((base_identifier - identifier_other)**2/Number_of_devices))
Closest_identifier = Closest_identifier.append({'id':id_other,'Identifier':base_identifier, 'Closest identifier':identifier_other, 'distance':dist},ignore_index=True)
elapsed_time = time.process_time() - t
print(elapsed_time)
6.0625
To explain what is happening: In the fisrt part of the code, I choose an id and set up all the infomation I need. The number of devices is the number of non zero values of the vector associated to that id (i.e., the number of devices that detected the object with that id). In the second part I compute the distance of that id to all other objects.
So, for each id, I have n-1 rows, where n is the length of my id set. So, for 50 ids, I have 50*50-50 = 2450 rows
The time given here is for 50 ids. For 200, the time for the loops to finish is 120 s, for 400 the time is 871 s. As you can see, time grows exponentially here. I have 1700 ids and it'll take days for this to complete.
My questions is thus: Is there a more efficient way to do this?
Grateful for insights.
Test data
{'id': {0: 'A4070270297516241',
1: 'A4060461064716279',
2: 'A4050500015016271',
3: 'A4050494283416274',
4: 'A4050500876316279',
5: 'A4050494111016270',
6: 'A4050470673516272',
7: 'A4060461035616276',
8: 'A4050500809916271',
9: 'A4050500822216279',
10: 'A4050494817416277',
11: 'A4060462005116279',
12: 'A4050500802316278',
13: 'A4050500841416272',
14: 'A4050494856516271',
15: 'A4060462230216270',
16: 'A4090150867216273',
17: 'A4060464010916275',
18: 'A4139311891213540',
19: 'A4050500938416279',
20: 'A4620451871516274',
21: 'A4060460331116279',
22: 'A4060454590916277',
23: 'A4060454778016276',
24: 'A4060462019716270',
25: 'A4050500945416277',
26: 'A4050494267716279',
27: 'A4090281644816244',
28: 'A4050500929516270',
29: 'N4010442537213363',
30: 'A4050500938216277',
31: 'A4060454598916275',
32: 'A4050494086216273',
33: 'A4060462859616271',
34: 'A4060454600116271',
35: 'A4050494551816276',
36: 'A4610490015816279',
37: 'A4060454605416279',
38: 'A4060454665916270',
39: 'A4060454579316278',
40: 'A4060464023516275',
41: 'A4050500588616272',
42: 'A4050500905516274',
43: 'A4070262442416243',
44: 'A4050500946716271',
45: 'A4070271195016244',
46: 'A4060454663216271',
47: 'A4060454590416272',
48: 'A4060461993616279',
49: 'N4010442139713366'},
'vector': {0: [0,
0,
0,
0,
0,
0,
0,
0,
7,
4,
0,
0,
0,
13,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
1: [0,
2,
0,
0,
6,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
9,
14],
2: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
3,
0,
0,
0,
5,
12,
15,
2,
0,
0,
0,
0,
0,
0],
3: [15,
13,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
5],
4: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
15,
0,
0,
0,
0,
0,
0],
5: [6,
10,
1,
0,
2,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
10,
13,
15],
6: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
15,
7,
2,
0,
0,
0],
7: [0,
0,
0,
11,
0,
15,
13,
0,
5,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
8: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
10,
2,
0,
0],
9: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
8,
15,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0],
10: [0,
0,
0,
0,
0,
4,
9,
0,
5,
8,
0,
15,
0,
0,
8,
2,
0,
0,
0,
0,
0,
0,
0,
5,
0,
0],
11: [15,
12,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
0,
6,
2],
12: [0,
0,
0,
0,
0,
1,
2,
0,
2,
2,
0,
15,
12,
0,
8,
1,
9,
2,
0,
0,
0,
0,
0,
0,
0,
0],
13: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
6,
0,
5,
3,
11,
15,
11,
1,
0,
0,
0,
0,
0,
0],
14: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
3,
0,
7,
12,
14,
1,
0,
0,
0,
0,
0,
0],
15: [0,
0,
2,
2,
15,
15,
10,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
2,
15],
16: [0,
0,
0,
0,
0,
0,
0,
13,
6,
3,
0,
2,
0,
15,
4,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
17: [0,
0,
0,
0,
7,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
15,
8,
2],
18: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
12,
9,
2,
0,
0],
19: [0,
10,
11,
6,
6,
2,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
20: [0,
0,
0,
0,
15,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
4,
14,
13,
11],
21: [0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
0,
5,
15,
0,
2,
3,
10,
9,
0,
0,
0,
0,
0,
0,
0,
0],
22: [0,
0,
0,
2,
7,
15,
15,
0,
2,
3,
0,
15,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
0,
4],
23: [2,
15,
15,
4,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
11],
24: [0,
0,
0,
0,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
6,
15,
14,
2],
25: [0,
9,
13,
15,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
26: [0,
0,
0,
1,
2,
8,
15,
0,
1,
4,
0,
15,
1,
0,
6,
0,
0,
0,
0,
0,
0,
0,
0,
8,
0,
0],
27: [0,
0,
0,
0,
0,
0,
0,
0,
2,
1,
0,
0,
0,
5,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
28: [0,
7,
9,
6,
6,
4,
7,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
5],
29: [8,
6,
2,
2,
6,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
7,
11],
30: [0,
10,
11,
15,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
31: [6,
15,
6,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
8],
32: [0,
0,
0,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
2,
11,
8,
9,
2],
33: [0,
0,
0,
0,
0,
11,
15,
0,
2,
4,
0,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
34: [4,
15,
15,
5,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
2,
1,
8],
35: [2,
1,
1,
0,
12,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
9,
15,
14,
12],
36: [0,
0,
0,
0,
0,
0,
5,
15,
4,
2,
0,
0,
0,
1,
2,
0,
3,
0,
0,
0,
0,
0,
0,
0,
0,
0],
37: [0,
0,
0,
0,
0,
11,
15,
0,
15,
15,
0,
14,
0,
3,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4],
38: [0,
0,
0,
0,
0,
3,
14,
0,
10,
15,
0,
14,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
39: [0,
0,
2,
0,
8,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
15,
15,
5],
40: [0,
0,
0,
3,
0,
4,
10,
5,
15,
14,
0,
2,
2,
13,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
41: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
15,
10,
0,
0,
0,
0,
0,
0],
42: [0,
0,
0,
0,
0,
0,
0,
1,
0,
2,
0,
2,
0,
10,
15,
14,
6,
2,
0,
0,
0,
0,
0,
0,
0,
0],
43: [0,
0,
0,
0,
0,
0,
3,
2,
7,
8,
0,
2,
0,
15,
8,
2,
4,
0,
0,
0,
0,
0,
0,
0,
0,
0],
44: [0,
0,
0,
0,
0,
0,
0,
0,
0,
5,
0,
11,
15,
0,
3,
0,
13,
12,
0,
0,
0,
0,
0,
0,
0,
0],
45: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
11,
0,
0,
0,
1,
0,
0,
0,
0,
0,
0,
0,
0,
0],
46: [0,
0,
0,
0,
0,
3,
11,
0,
15,
15,
0,
15,
2,
9,
5,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0],
47: [0,
2,
3,
0,
15,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
4,
15,
15,
6],
48: [0,
0,
0,
0,
2,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
3,
0,
11,
7,
9,
3],
49: [0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
12,
15,
9,
0,
0,
0]}}
Try:
# id1: [0, 0, 0, 1, 1, 1]
m = np.repeat(np.vstack(df1['vector']), df1.shape[0], axis=0)
# id2: [0, 1, 0, 1, 0, 1]
n = np.tile(np.vstack(df1['vector']), (df1.shape[0], 1))
# number of devices for each vector of m
d = np.count_nonzero(m, axis=1, keepdims=True)
# compute the distance
dist = np.sqrt(np.sum((m - n)**2/d, axis=-1))
# create the final dataframe
mi = pd.MultiIndex.from_product([df1['id']] * 2, names=['id1', 'id2'])
out = pd.DataFrame({'vector1': m.tolist(),
'vector2': n.tolist(),
'distance': dist}, index=mi).reset_index()
Output:
>>> out
id1 id2 vector1 vector2 distance
0 A4070270297516241 A4070270297516241 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... 0.000000
1 A4070270297516241 A4060461064716279 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 2, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 13.747727
2 A4070270297516241 A4050500015016271 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, ... 14.628739
3 A4070270297516241 A4050494283416274 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [15, 13, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... 15.033296
4 A4070270297516241 A4050500876316279 [0, 0, 0, 0, 0, 0, 0, 0, 7, 4, 0, 0, 0, 13, 0,... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 15.099669
... ... ... ... ... ...
2495 N4010442139713366 A4070271195016244 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0, 0,... 13.916417
2496 N4010442139713366 A4060454663216271 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 3, 11, 0, 15, 15, 0, 15, 2, 9,... 21.330729
2497 N4010442139713366 A4060454590416272 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 2, 3, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 19.304576
2498 N4010442139713366 A4060461993616279 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 12.288206
2499 N4010442139713366 N4010442139713366 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0.000000
[2500 rows x 5 columns]
Performance
%timeit loop_op()
3.75 s ± 88.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit vect_corralien()
4.16 ms ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can calculate the squared distances with broadcasting. After you have that you just have to find num_devices for every row and use it to calculate your custom distance. After filling the diagonal with infinite values you can get the minimum index of every row which gives you the closest device.
arr = np.array([value for value in data['vector'].values()])
squared_distances = np.power((arr[:,None,:] - arr[None,:,:]).sum(axis=-1), 2)
num_devices = (arr != 0).sum(axis=1)
distances = np.sqrt(squared_distances / num_devices )
np.fill_diagonal(distances, np.inf)
closest_indentifiers = distances.argmin(axis=0)
You can format the output of the program as you desire