Pandas Aggregate columns dynamically - python-3.x

My goal is to aggregate data similar to SAS's "proc summary using types" My starting pandas dataframe could look like this where the database has already done the original group by all dimensions/classification variables and done some aggregate function on the measures.
So in sql this would look like
select gender, age, sum(height), sum(weight)
from db.table
group by gender, age
gender
age
height
weight
F
19
70
123
M
24
72
172
I then would like to summarize the data using pandas to calculate summary rows based on different group bys to come out with this.
gender
age
height
weight
.
.
142
295
.
19
70
123
.
24
72
172
F
.
70
123
M
.
72
172
F
19
70
123
M
24
72
172
Where the first row is agg with no group by
2 and 3 row are agg grouped by age
4 and 5 agg by just gender
and then just the normal rows
My current code looks like this
# normally dynamic just hard coded for this example
measures = {'height':{'stat':'sum'}, 'age':{'stat':'sum'}}
msr_config_dict = {}
for measure in measures:
if measure in message_measures:
stat = measures[measure]['stat']
msr_config_dict[measure] = pd.NamedAgg(measure, stat)
# compute agg with no group by as starting point
df=self.df.agg(**msr_config_dict)
dimensions = ['gender','age'] # also dimensions is dynamic in real life
dim_vars = []
for dim in dimensions:
dim_vars.append(dim)
if len(dim_vars) > 1:
# compute agg of compound dimensions
df_temp = self.df.groupby(dim_vars, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
# always compute agg of solo dimension
df_temp = self.df.groupby(dim, as_index=False).agg(msr_config_dict)
df = df.append(df_temp, ignore_index=True)
With this code I get AttributeError: 'height' is not a valid function for 'Series' object
For the input to agg function I have also tried
{'height':[('height', 'sum')], 'weight':[('weight', 'sum')]} where I am trying to compute the sum of all heights and name the output height. Which also had an attribute error.
I know I will only ever be computing one aggregate function per measure so I would like to dynamically build the input to the pandas agg functon and always rename the stat to itself so I can just append it to the dataframe that I am building with the summary rows.
I am new to pandas coming from SAS background.
Any help would be much appreciated.

IIUC:
cols = ['height', 'weight']
out = pd.concat([df[cols].sum(0).to_frame().T,
df.groupby('age')[cols].sum().reset_index(),
df.groupby('gender')[cols].sum().reset_index(),
df], ignore_index=True)[df.columns].fillna('.')
Output:
>>> out
gender age height weight
0 . . 142 295
1 . 19.0 70 123
2 . 24.0 72 172
3 F . 70 123
4 M . 72 172
5 F 19.0 70 123
6 M 24.0 72 172

Here is a more flexible solution, extending the solution of #Corralien. You can use itertools.combinations to create all the combinations of dimensions and for all length of combination possible.
from itertools import combinations
# your input
measures = {'height':{'stat':'sum'}, 'weight':{'stat':'min'}}
dimensions = ['gender','age']
# change the nested dictionary
msr_config_dict = {key:val['stat'] for key, val in measures.items()}
# concat all possible aggregation
res = pd.concat(
# case with all aggregated
[df.agg(msr_config_dict).to_frame().T]
# cases at least one column to aggregate over
+ [df.groupby(list(_dimCols)).agg(msr_config_dict).reset_index()
# for combinations of length 1, 2.. depending on the number of dimensions
for nb_cols in range(1, len(dimensions))
# all combinations of the specific lenght
for _dimCols in combinations(dimensions, nb_cols)]
# original dataframe
+ [df],
ignore_index=True)[df.columns].fillna('.')
print(res)
# gender age height weight
# 0 . . 142 123
# 1 F . 70 123
# 2 M . 72 172
# 3 . 19.0 70 123
# 4 . 24.0 72 172
# 5 F 19.0 70 123
# 6 M 24.0 72 172

Related

filtering rows in one dataframe based on two columns of another dataframe

I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb

pandas long (very long + index) to wide format conversion

I have a single column within a dataframe which comprises both the index (virus) and the data to tabulate and wish to convert to wide format.
Input data
virus1
AGCTGAGTGAG # sequence
40.1 # score 1
23 # score 2
102 # score 3
virus2
AGCTGAGTGAG # sequence
43.4 # score 1
32 # score 2
101 # score 3
virus3
AGTTGAGTGAG # sequence
41.3 # score 1
35 # score 2
100 # score 3
.... >100 inputs
Dataframe output
sequence score1 score2 score3
virus1 AGCTGAGTGAG 40.1 43.4 41.3
virus2 AGCTGAGTGAG 23 32 35
virus3 AGTTGAGTGAG 102 101 100
I attempted to import the data into a single dataframe and move the rows into columns of a new dataframe
Code
df = pd.read_csv(file, sep='\n', header=None)
index_labels = df.iloc[::4].astype(str)
dfvirus = pd.DataFrame(index=labels)
dfvirus['sequence'] = df.iloc[1::5].astype(str)
dfvirus['score1'] = df.iloc[2::5].astype(float)
dfvirus['score2'] = df.iloc[3::5].astype(int)
dfvirus['score3'] = df.iloc[4::5].astype(int)
The above didn't work I get NaN or nan for the values of e.g. dfvirus['sequence'].head() depending on whether the input is a number or a string. I could do this by constructing a hierarchical index, but that would mean looping a very long index into a list.
Moving from long to wide format is a common issue and I would be grateful if you could show a simpler solution or where I'm going wrong here.
You can do:
df = pd.read_csv(file, sep='\n', header=None)
new_df = pd.DataFrame(df.values.reshape(-1,5),
columns=['virus','sequence','score1','score2','score3']
)
Output
virus sequence score1 score2 score3
0 virus1 AGCTGAGTGAG 40.1 23 102
1 virus2 AGCTGAGTGAG 43.4 32 101
2 virus3 AGTTGAGTGAG 41.3 35 100

Performing pair-wise comparisons of some pandas dataframe rows as efficiently as possible

For a given pandas dataframe df, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.
I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).
Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?
An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.
df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(),
how="inner", on=id_str, suffixes=suffixes)
.set_index(['index_1', 'index_2']))
id A_1 B_1 C_1 A_2 B_2 C_2
index_1 index_2
0 0 4 92 79 10 92 79 10
13 4 92 79 10 83 68 69
24 4 92 79 10 67 73 90
25 4 92 79 10 22 31 35
36 4 92 79 10 64 44 20
... .. ... ... ... ... ... ...
9993 9971 7 20 65 92 47 65 21
9977 7 20 65 92 50 35 27
9980 7 20 65 92 43 36 62
9992 7 20 65 92 99 2 17
9993 7 20 65 92 20 65 92

Get total of Pandas column and row

I have a Pandas data frame, as shown below,
a b c
A 100 60 60
B 90 44 44
A 70 50 50
Now, I would like to get the total of column and row, skip c, as shown below,
a b sum
A 170 110 280
B 90 44 134
So I do not know how to do, I'm in trouble, please help me, thank you, guys.
My example dataframe is:
df = pd.DataFrame(dict(a=[100, 90,70], b=[60, 44,50],c=[60, 44,50]),index=["A", "B","A"])
(
df.groupby(level=0)['a','b'].sum()
.assign(sum=lambda x: x.sum(1))
)
Use:
#remove unnecessary column
df = df.drop('c', 1)
#get sum of rows
df['sum'] = df.sum(1)
#get sum per index
df = df.sum(level=0)
print (df)
a b sum
A 170 110 280
B 90 44 134
df["sum"] = df[["a","b"]].sum(axis=1) #Column-wise sum of "a" and "b"
df[["a", "b", "sum"]] #show all columns but not "c"
The pandas way is:
#create sum column
df['sum'] = df['a']+df['b']
#remove colimn c
df = df[['a', 'b', 'sum']]

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

Resources