Performing pair-wise comparisons of some pandas dataframe rows as efficiently as possible - python-3.x

For a given pandas dataframe df, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.
I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).
Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?

An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.
df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(),
how="inner", on=id_str, suffixes=suffixes)
.set_index(['index_1', 'index_2']))
id A_1 B_1 C_1 A_2 B_2 C_2
index_1 index_2
0 0 4 92 79 10 92 79 10
13 4 92 79 10 83 68 69
24 4 92 79 10 67 73 90
25 4 92 79 10 22 31 35
36 4 92 79 10 64 44 20
... .. ... ... ... ... ... ...
9993 9971 7 20 65 92 47 65 21
9977 7 20 65 92 50 35 27
9980 7 20 65 92 43 36 62
9992 7 20 65 92 99 2 17
9993 7 20 65 92 20 65 92

Related

Pandas Aggregate columns dynamically

My goal is to aggregate data similar to SAS's "proc summary using types" My starting pandas dataframe could look like this where the database has already done the original group by all dimensions/classification variables and done some aggregate function on the measures.
So in sql this would look like
select gender, age, sum(height), sum(weight)
from db.table
group by gender, age
gender
age
height
weight
F
19
70
123
M
24
72
172
I then would like to summarize the data using pandas to calculate summary rows based on different group bys to come out with this.
gender
age
height
weight
.
.
142
295
.
19
70
123
.
24
72
172
F
.
70
123
M
.
72
172
F
19
70
123
M
24
72
172
Where the first row is agg with no group by
2 and 3 row are agg grouped by age
4 and 5 agg by just gender
and then just the normal rows
My current code looks like this
# normally dynamic just hard coded for this example
measures = {'height':{'stat':'sum'}, 'age':{'stat':'sum'}}
msr_config_dict = {}
for measure in measures:
if measure in message_measures:
stat = measures[measure]['stat']
msr_config_dict[measure] = pd.NamedAgg(measure, stat)
# compute agg with no group by as starting point
df=self.df.agg(**msr_config_dict)
dimensions = ['gender','age'] # also dimensions is dynamic in real life
dim_vars = []
for dim in dimensions:
dim_vars.append(dim)
if len(dim_vars) > 1:
# compute agg of compound dimensions
df_temp = self.df.groupby(dim_vars, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
# always compute agg of solo dimension
df_temp = self.df.groupby(dim, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
With this code I get AttributeError: 'height' is not a valid function for 'Series' object
For the input to agg function I have also tried
{'height':[('height', 'sum')], 'weight':[('weight', 'sum')]} where I am trying to compute the sum of all heights and name the output height. Which also had an attribute error.
I know I will only ever be computing one aggregate function per measure so I would like to dynamically build the input to the pandas agg functon and always rename the stat to itself so I can just append it to the dataframe that I am building with the summary rows.
I am new to pandas coming from SAS background.
Any help would be much appreciated.
IIUC:
cols = ['height', 'weight']
out = pd.concat([df[cols].sum(0).to_frame().T,
df.groupby('age')[cols].sum().reset_index(),
df.groupby('gender')[cols].sum().reset_index(),
df], ignore_index=True)[df.columns].fillna('.')
Output:
>>> out
gender age height weight
0 . . 142 295
1 . 19.0 70 123
2 . 24.0 72 172
3 F . 70 123
4 M . 72 172
5 F 19.0 70 123
6 M 24.0 72 172
Here is a more flexible solution, extending the solution of #Corralien. You can use itertools.combinations to create all the combinations of dimensions and for all length of combination possible.
from itertools import combinations
# your input
measures = {'height':{'stat':'sum'}, 'weight':{'stat':'min'}}
dimensions = ['gender','age']
# change the nested dictionary
msr_config_dict = {key:val['stat'] for key, val in measures.items()}
# concat all possible aggregation
res = pd.concat(
# case with all aggregated
[df.agg(msr_config_dict).to_frame().T]
# cases at least one column to aggregate over
+ [df.groupby(list(_dimCols)).agg(msr_config_dict).reset_index()
# for combinations of length 1, 2.. depending on the number of dimensions
for nb_cols in range(1, len(dimensions))
# all combinations of the specific lenght
for _dimCols in combinations(dimensions, nb_cols)]
# original dataframe
+ [df],
ignore_index=True)[df.columns].fillna('.')
print(res)
# gender age height weight
# 0 . . 142 123
# 1 F . 70 123
# 2 M . 72 172
# 3 . 19.0 70 123
# 4 . 24.0 72 172
# 5 F 19.0 70 123
# 6 M 24.0 72 172

Iterating over a list of arrays to use as input in a function

In the code below, how can I iterate over y to get all the 5 groups of 2 arrays each to use as input to func?
I know I could just do :
func(y[0],y[1])
func(y[2],y[3])...etc....
But I cant code the lines above because I can have hundreds of arrays in y
import numpy as np
import itertools
# creating an array with 100 samples
array = np.random.rand(100)
# making the array an iterator
iter_array = iter(array)
# Cerating a list of list to store 10 list of 10 elements each
n = 10
result = [[] for _ in range(n)]
# Starting the list creating
for _ in itertools.repeat(None, 10):
for i in range(n):
result[i].append(next(iter_array))
# Casting the lists to arrays
y=np.array([np.array(xi) for xi in result], dtype=object)
#list to store the results of the calculation below
result_func =[]
#applying a function take takes 2 arrays as input
#I have 10 arrays within y, so I need to perfom the function below 5 times: [0,1],[2,3],[4,5],[6,7],[8,9]
a = func(y[0],y[1])
# Saving the result
result_func.append(a)
You could use list comprehension:
result_func = [func(y[i], y[i+1]) for i in range(0, 10, 2)]
or the general for loop:
for i in range(0, 10, 2):
result_func.append(funct(y[i], y[i+1]))
Because of numpy's fill-order when reshaping, you could reshape the array to have
a variable depth (depending on the number of arrays)
a height of two
the same width as the number of elements in each input row
Thus when filling it will fill two rows before needing to increase the depth by one.
Iterating over this array results in a series of matrices (one for each depthwise layer). Each matrix has two rows, which comes out to be y[0], y[1], y[2], y[3], and so on.
For examples sake say the inner arrays each have length 6, and that there are 8 of them in total (so that there are 4 function calls):
import numpy as np
elems_in_row = 6
y = np.array(
[[1,2,3,4,5,6],
[7,8,9,10,11,12],
[13,14,15,16,17,18],
[19,20,21,22,23,24],
[25,26,27,28,29,30],
[31,32,33,34,35,36],
[37,38,39,40,41,42],
[43,44,45,46,47,48],
])
# the `-1` makes the number of rows be inferred from the input array.
y2 = y.reshape((-1,2,elems_in_row))
for ar1,ar2 in y2:
print("1st:", ar1)
print("2nd:", ar2)
print("")
output:
1st: [1 2 3 4 5 6]
2nd: [ 7 8 9 10 11 12]
1st: [13 14 15 16 17 18]
2nd: [19 20 21 22 23 24]
1st: [25 26 27 28 29 30]
2nd: [31 32 33 34 35 36]
1st: [37 38 39 40 41 42]
2nd: [43 44 45 46 47 48]
As a sidenote, if your function outputs simple values (like integers or floats) and does not have side-effects like IO, it may perhaps be possible to use apply_along_axis to create the output array directly without explicitly iterating over the pairs.

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

How to sort a pandas dataframe by the standard deviations of its columns?

Given the following example :
from statistics import stdev
d = pd.DataFrame({"a": [45, 55], "b": [5, 95], "c": [30, 70]})
stds = [stdev(d[c]) for c in d.columns]
With output:
In [87]: d
Out[87]:
a b c
0 45 5 30
1 55 95 70
In [91]: stds
Out[91]: [7.0710678118654755, 63.63961030678928, 28.284271247461902]
I would like to be able to sort the columns of the dataframe by their
standard deviations, resulting in the following
b c a
0 5 30 45
1 95 70 55
you are looking for:
d.iloc[:,(-d.std()).argsort()]
Out[8]:
b c a
0 5 30 45
1 95 70 55
You can get the column order like this:
column_order = d.std().sort_values(ascending=False).index
# >>> column_order
# Index(['b', 'c', 'a'], dtype='object')
And then sort the columns like this:
d[column_order]
b c a
0 5 30 45
1 95 70 55

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

Resources