Related
The documentation for scipy's binned_statistic_2d function gives an example for a 2D histogram:
from scipy import stats
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
Makes sense, but I'm now trying to implement a custom function. The custom function description is given as:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
I wasn't sure exactly how to implement this, so I thought I'd check my understanding by writing a custom function that reproduces the count option. I tried
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, None, custom_func, bins=[binx, biny])
but this generates an error like so:
556 # Make sure `values` match `sample`
557 if(statistic != 'count' and Vlen != Dlen):
558 raise AttributeError('The number of `values` elements must match the '
559 'length of each `sample` dimension.')
561 try:
562 M = len(bins)
AttributeError: The number of `values` elements must match the length of each `sample` dimension.
How is this custom function supposed to be defined?
The reason for this error is that when using a custom statistic function (or any non-count statistic), you have to pass some array or list of arrays to the values parameter (with the number of elements matching the number in x). You can't just leave it as None as in your example, even though it is irrelevant and does not get used when computing counts of data points in each bin.
So, to match the results, you can just pass the same x object to the values parameter:
def custom_func(values):
return len(values)
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = stats.binned_statistic_2d(x, y, x, custom_func, bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
The result matches that of the count statistic:
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=[binx, biny])
print(ret)
# BinnedStatistic2dResult(statistic=array([[2., 1.],
# [1., 0.]]), x_edge=array([0. , 0.5, 1. ]), y_edge=array([2. , 2.5, 3. ]), binnumber=array([5, 6, 5, 9]))
I have a tensor 'input_sentence_embed' with shape torch.Size([1, 768])
There is a dataframe 'matched_df' which looks like
INCIDENT_NUMBER enc_rep
0 INC000030884498 [[tensor(-0.2556), tensor(0.0188), tensor(0.02...
1 INC000029956111 [[tensor(-0.3115), tensor(0.2535), tensor(0.20..
2 INC000029555353 [[tensor(-0.3082), tensor(0.2814), tensor(0.24...
3 INC000029555338 [[tensor(-0.2759), tensor(0.2604), tensor(0.21...
Shape of each tensor element in dataframe looks like
matched_df['enc_rep'].iloc[0].size()
torch.Size([1, 768])
I want to find euclidean / cosine similarity between 'input_sentence_embed' and each row of 'matched_df' efficently.
If they were scalar values, I could have easily broadcasted 'input_sentence_embed' as a new column in 'matched_df' and then find cosine similarity between two columns.
I am struggling with two problems
How to broadcast 'input_sentence_embed' as a new column to the
'matched_df'
How to find cosine similarity between tensors stored
in two column
May be someone can also suggest me other easier methods to achieve the end goal of finding similarity between a tensor value and all tensors stored in a column of dataframe efficently.
Input data:
import pandas as pd
import numpy as np
from torch import tensor
match_df = pd.DataFrame({'INCIDENT_NUMBER': ['INC000030884498',
'INC000029956111',
'INC000029555353',
'INC000029555338'],
'enc_rep': [[[tensor(0.2971), tensor(0.4831), tensor(0.8239), tensor(0.2048)]],
[[tensor(0.3481), tensor(0.8104) , tensor(0.2879), tensor(0.9747)]],
[[tensor(0.2210), tensor(0.3478), tensor(0.2619), tensor(0.2429)]],
[[tensor(0.2951), tensor(0.6698), tensor(0.9654), tensor(0.5733)]]]})
input_sentence_embed = [[tensor(0.0590), tensor(0.3919), tensor(0.7821) , tensor(0.1967)]]
How to broadcast 'input_sentence_embed' as a new column to the 'matched_df'
match_df["input_sentence_embed"] = [input_sentence_embed] * len(match_df)
How to find cosine similarity between tensors stored in two column
a = np.vstack(match_df["enc_rep"])
b = np.hstack(input_sentence_embed)
match_df["cosine_similarity"] = a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))
Output result:
INCIDENT_NUMBER enc_rep input_sentence_embed cosine_similarity
0 INC000030884498 [[tensor(0.2971), tensor(0.4831), tensor(0.823... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.446067
1 INC000029956111 [[tensor(0.3481), tensor(0.8104), tensor(0.287... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.377775
2 INC000029555353 [[tensor(0.2210), tensor(0.3478), tensor(0.261... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.201116
3 INC000029555338 [[tensor(0.2951), tensor(0.6698), tensor(0.965... [[tensor(0.0590), tensor(0.3919), tensor(0.782... 0.574257
Basics
I suppose you are trying to calculate the similarity or closeness of two vectors via:
euclidean distance between vectors or
cosine between vectors
Cosine similarity
For cosine similarity, you need:
Norm of each vector -> You can use linalg.norm
Cosine of vectors -> You can use dot product (inner or dot)
https://en.wikipedia.org/wiki/Cosine_similarity
For instance A = [0.8, 0.9] and B = [1.0, 0.0], then the cosine similarity of A and B is:
A = np.array([0.8, 0.9])
B = np.array([1.0, 0.0])
EA = np.linalg.norm(A)
EB = np.linalg.norm(B)
NA = A / EA
NB = B / EB
COS_A_B = np.dot(NA, NB)
COS_A_B
---
0.6643638388299198
So if we can get get two vectors (rows) A and B from the enc_rep column, then we can calculate the cosine between them.
Pandas
We need to figure out how to run those cosine calculations on the same column.
C = np.array([0.5, 0.3])
df = pd.DataFrame(columns=['ID','enc_rep'])
df.loc[0] = [1, A]
df.loc[1] = [2, B]
df.loc[2] = [3, C]
df
---
ID enc_rep
0 1 [0.8, 0.9]
1 2 [1.0, 0.0]
2 3 [0.5, 0.3]
One naive way is to create a cartesian product of the enc_rep column itself.
cartesian_df = df['enc_rep'].to_frame().merge(df['enc_rep'], how='cross')
cartesian_df
---
enc_rep_x enc_rep_y
0 [0.8, 0.9] [0.8, 0.9]
1 [0.8, 0.9] [1.0, 0.0]
2 [0.8, 0.9] [0.5, 0.3]
3 [1.0, 0.0] [0.8, 0.9]
4 [1.0, 0.0] [1.0, 0.0]
5 [1.0, 0.0] [0.5, 0.3]
6 [0.5, 0.3] [0.8, 0.9]
7 [0.5, 0.3] [1.0, 0.0]
8 [0.5, 0.3] [0.5, 0.3]
Take the cosine between enc_rep_x and enc_rep_y.
def f(x, y):
nx = x / np.linalg.norm(x)
ny = y / np.linalg.norm(y)
return np.dot(nx, ny)
cartesian_df['cosine'] = cartesian_df.apply(lambda row: f(row.enc_rep_x, row.enc_rep_y), axis=1)
cartesian_df
---
enc_rep_x enc_rep_y cosine
0 [0.8, 0.9] [0.8, 0.9] 1.000000
1 [0.8, 0.9] [1.0, 0.0] 0.664364
2 [0.8, 0.9] [0.5, 0.3] 0.954226
3 [1.0, 0.0] [0.8, 0.9] 0.664364
4 [1.0, 0.0] [1.0, 0.0] 1.000000
5 [1.0, 0.0] [0.5, 0.3] 0.857493
6 [0.5, 0.3] [0.8, 0.9] 0.954226
7 [0.5, 0.3] [1.0, 0.0] 0.857493
8 [0.5, 0.3] [0.5, 0.3] 1.000000
However, if the number of rows are large, it will create a huge dataframe with duplicates. If the size is not an issue, then you can drop one column and take unique rows.
Hope this gives an idea on how. Regarding the details of the shape is 2 dimension vs 1, etc, please figure them out on your own.
In a nested list (list of lists), how can I remove the elements that have all the entries as zero.
For instance: values =
[[1.1, 3.0], [2.5, 5.2], [4.7, 8.2], [69.2, 36.6], [0.7, 0.0], [0.0, 0.0], [0.4, 17.9], [14.7, 29.1], [6.8, 0.0], [0.0, 0.0]]
should change to
[[1.1, 3.0], [2.5, 5.2], [4.7, 8.2], [69.2, 36.6], [0.7, 0.0], [0.4, 17.9], [14.7, 29.1], [6.8, 0.0]]
Note: nested list could have n number of elements, not just 2.
Trying to do this to crop two other lists. something like:
for label, color, value in zip(labels, colors, values):
if any(value) in values: #this check needs update
new_labels.append(label)
new_colors.append(color)
Take advantage of the fact that 0.0 is falsey and filter using any().
result = [sublist for sublist in arr if any(sublist)]
I am trying to get the columns from dataframe whose correlation with another column is greater than certain values like below.
df.loc[:, (df.corr()['col'] <= -0.05) | (df.corr()['col'] >= 0.05)]
But I am getting below error,
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
Also if I try to select the columns with variance > 1, I get the same error,
df.loc[;df.var() > 1 ].
Why I am getting indexing error. I want to filter the columns of dataframe if correlation of that column with another columns is between -0.05 and 0.05.
Can someone assist in resolving this issue. I am not sure where I am going wrong
I think I found what's your problem.
First I tried to build my own testing set, unfortunately everything worked nicely:
df = pd.DataFrame({
"col": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
"A": [1.1, 1.0, 1.0, 1.0, 1.0, 1.1],
"B": [1.0, 2.1, 3.0, 3.9, 5.0, 6.0]
})
df.loc[:, (df.corr()['col'] <= -0.05) | (df.corr()['col'] >= 0.05)]
I got :
col B
0 1.0 1.0
1 2.0 2.1
2 3.0 3.0
3 4.0 3.9
4 5.0 5.0
5 6.0 6.0
But then, after reading again your error, I thought maybe there are some columns in your data the corr() method is just ignoring such as column with an object dtype.
If I build a new testing set with textual columns, I get the same error as you:
df = pd.DataFrame({
"col": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
"A": [1.1, 1.0, 1.0, 1.0, 1.0, 1.1],
"B": [1.0, 2.1, 3.0, 3.9, 5.0, 6.0],
"C": ["A", "B", "C", "D", "E", "F"]
})
df.corr()['col'] >= 0.05
df.loc[:, (df.corr()['col'] <= -0.05) | (df.corr()['col'] >= 0.05)]
Then I got:
pandas.core.indexing.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
One way of fixing this is by doing so:
df = df.drop(columns=df.corr().query("-0.05 < col < 0.05").index)
Note: Please remind you'll have quicker and more relevant answers if you provide a full sample of the non-working code so that your error can be reproduced easily ;)
I'm trying to write simple code that will take floats in two lists, find the mean between the two numbers in the same position in each list, and generate a new list with the updated means. For example, with list_1 and list_2,
list_1: [1.0, 2.0, 3.0, 4.0, 5.0]
list_2: [6.0, 7.0, 8.0, 9.0, 10.0]
list_3: []
for i in list_1:
for x in list_2:
list_3.append((x+i)/2)
print (list_3)
Find the mean between floats in two lists and create a new list such that:
list_3 = [3.5, 4.5, 5.5, 6.5, 7.5]
I tried creating a for loop but (for obvious reasons) doesn't iterate the way that I want it to. The output is:
[3.5, 4.0, 4.5, 5.0, 5.5, 4.0, 4.5, 5.0, 5.5, 6.0, 4.5, 5.0, 5.5, 6.0, 6.5, 5.0, 5.5, 6.0, 6.5, 7.0, 5.5, 6.0, 6.5, 7.0, 7.5]
any help would be greatly appreciated!
You can do that with a generator expresion like:
Code:
[sum(x)/len(x) for x in zip(list_1, list_2)]
How:
The function zip() allows easy iteration through multiple lists at the same time. From there these values can be fed into sum() and len() as shown.
Test Code:
list_1 = [1.0, 2.0, 3.0, 4.0, 5.0]
list_2 = [6.0, 7.0, 8.0, 9.0, 10.0]
list_3 = [sum(x)/len(x) for x in zip(list_1, list_2)]
print(list_3)
Results:
[3.5, 4.5, 5.5, 6.5, 7.5]