Aggregating using custom function and several colums in pandas - pandas-groupby

Suppose I have the following data frame:
group num value
a 3 20
a 5 5
b 5 10
b 10 5
b 2 25
Now, I want to compute the weighted average of columns num and value grouping by column group. Using tidyverse packages in R, this is straightforward:
> library(tidyverse)
> df <- tribble(
~group , ~num , ~value,
"a" , 3 , 20,
"a" , 5 , 5,
"b" , 5 , 10,
"b" , 10 , 5,
"b" , 2 , 25
)
> df %>%
group_by(group) %>%
summarise(new_value = sum(num * value) / sum(num))
# A tibble: 2 x 2
group new_value
<chr> <dbl>
1 a 10.6
2 b 8.82
Using Pandas in Python, I can make all intermediary computation beforehand, and then use sum() to sum up the variables, and then perform the division using transform() like this:
import pandas as pd
from io import StringIO
data = StringIO(
"""
group,num,value
a,3,20
a,5,5
b,5,10
b,10,5
b,2,25
""")
df = pd.read_csv(data)
df["tmp_value"] = df["num"] * df["value"]
df = df.groupby(["group"]) \
[["num", "tmp_value"]] \
.sum() \
.transform(lambda x : x["tmp_value"] / x["num"], axis="columns")
print(df)
# group
# a 10.625000
# b 8.823529
# dtype: float64
Note that we explicitly need first to subset the columns of interest ([["num", "tmp_value"]]), compute the sum (sum()), and then the average/division using transform(). In R, we write this in just one simple step, much more compact and readable, IMHO.
Now, how can I accomplish that elegancy using Pandas? In other words, can it be more clean, elegant, and mainly easy to read as we do in R?

#an_drade - There has been a very similar stackoverflow question that provides the solution:
Pandas DataFrame aggregate function using multiple columns
The solution to your question is based on the above post by creating a python function:
df=pd.DataFrame([['a',3,20],['a',5,5],['b',5,10],['b',10,5],['b',2,25]],columns=['group','num','value'])
def wavg(group):
d = group['num']
w = group['value']
return (d*w).sum() / d.sum()
final=df.groupby("group").apply(wavg)
group
a 10.625000
b 8.823529
dtype: float64

This is the "R way" you wanted:
>>> from datar import f
>>> from datar.tibble import tribble
>>> from datar.dplyr import group_by, summarise
>>> from datar.base import sum
>>> # or if you are lazy:
>>> # from datar.all import *
>>>
>>> df = tribble(
... f.group , f.num , f.value,
... "a" , 3 , 20,
... "a" , 5 , 5,
... "b" , 5 , 10,
... "b" , 10 , 5,
... "b" , 2 , 25
... )
>>> df >> \
... group_by(f.group) >> \
... summarise(new_value = sum(f.num * f.value) / sum(f.num))
group new_value
<object> <float64>
0 a 10.625000
1 b 8.823529
I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

Related

featuretools: manual derivation of the features generated by dfs?

Code example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
# Normalized one more time
es = es.normalize_entity(
new_entity_id="device",
base_entity_id="sessions",
index="device",
)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["std",],
groupby_trans_primitives=['cum_count'],
max_depth=2
)
And I'd like to look into STD(sessions.CUM_COUNT(device) by customer_id) feature deeper:
And I tried to generate this feature manually but had different result:
df = ft.demo.load_mock_customer(return_single_table=True)
a = df.groupby("customer_id")['device'].cumcount()
a.name = "cumcount_device"
a = pd.concat([df, a], axis=1)
b = a.groupby("customer_id")['cumcount_device'].std()
>>> b
customer_id
1 36.517
2 26.991
3 26.991
4 31.610
5 22.949
Name: cumcount_device, dtype: float64
What am I missing?
Thanks for the question. The calculation needs to be based on the data frame from sessions.
df = es['sessions'].df
cumcount = df['device'].groupby(df['customer_id']).cumcount()
std = cumcount.groupby(df['customer_id']).std()
std.round(3).loc[feature_matrix.index]
customer_id
5 1.871
4 2.449
1 2.449
3 1.871
2 2.160
dtype: float64
You should get the same output as in DFS.

How to get only different words from two pandas.DataFrame columns

I have a DataFrame with columns id, keywords1 and keywords2. I would like to get only words from column keywords2 that are not in the column keywords1. Also I need to clean my new column with different words from meaningless words like phph, wfgh... I'm only interested in English words.
Example:
data = [[1, 'detergent', 'detergent for cleaning stains'], [2, 'battery charger', 'wwfgh, old, glass'], [3, 'sunglasses, black, metal', 'glass gggg jik xxx,'], [4, 'chemicals, flammable', 'chemicals, phph']]
df = pd.DataFrame(data, columns = ['id', 'keywords1','keywords2'])
df
Try:
import numpy as np
#we split to get words - by every sequence of 1, or more non-letters characters
df["keywords1"]=df["keywords1"].str.split("[^\w+]").map(set)
df["keywords2"]=df["keywords2"].str.split("[^\w+]").map(set)
df["keywords3"]=np.bitwise_and(np.bitwise_xor(df["keywords1"], df["keywords2"]), df["keywords2"])
#optional-if you wish to keep it as a string, and not set:
df["keywords3"]=df["keywords3"].str.join(", ")
Outputs:
id ... keywords3
0 1 ... cleaning, for, stains
1 2 ... , wwfgh, glass, old
2 3 ... jik, xxx, glass, gggg
3 4 ... phph
Let's try:
def words_diff(words1, words2)
kw1=words1.str.split()
kw2= words2.str.split()
diff=[x for x in kw2 if x not in kw1]
return diff
df['diff'] = df.apply(lambda x: words_diff(x['keywords1'] , x['keywords2'] ), axis=1)

Python Passing Dynamic Table Name in For Loop

table_name = []
counter=0
for year in ['2017', '2018', '2019']:
table_name.append(f'temp_df_{year}')
print(table_name[counter])
table_name[counter] = pd.merge(table1, table2.loc[table2.loc[:, 'year'] == year, :], left_on='col1', right_on='col1', how='left')
counter += 1
temp_df_2017
The print statement outputs are correct:
temp_df_2017,
temp_df_2018,
temp_df_2019
However, when I try to see what's in temp_df_2017, I get an error: name 'temp_df_2017' is not defined
I would like to create those three tables. How can I make this work?
PS: ['2017', '2018', '2019'] list will vary. It can be a list of quarters. That's why I want to do this in a loop, instead of using the merge statement 3x.
I think the easiest/most practical approach would be to create a dictionary to store names/df.
import pandas as pd
import numpy as np
# Create dummy data
data = np.arange(9).reshape(3,3)
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
df
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df_year_names = ['2017', '2018', '2019']
dict_of_dfs = {}
for year in df_year_names:
df_name = f'some_name_year_{year}'
dict_of_dfs[df_name] = df
dict_of_dfs.keys()
Out:
dict_keys(['some_name_year_2017', 'some_name_year_2018', 'some_name_year_2019'])
Then to access a particular year:
dict_of_dfs['some_name_year_2018']
Out:
a b c
0 0 1 2
1 3 4 5
2 6 7 8

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

How to iterate over dfs and append data with combine names

i have this problem to solve, this is a continuation of a previus question How to iterate over pandas df with a def function variable function and the given answer worked perfectly, but now i have to append all the data in a 2 columns dataframe (Adduct_name and mass).
This is from the previous question:
My goal: i have to calculate the "adducts" for a given "Compound", both represents numbes, but for eah "Compound" there are 46 different "Adducts".
Each adduct is calculated as follow:
Adduct 1 = [Exact_mass*M/Charge + Adduct_mass]
where exact_mass = number, M and Charge = number (1, 2, 3, etc) according to each type of adduct, Adduct_mass = number (positive or negative) according to each adduct.
My data: 2 data frames. One with the Adducts names, M, Charge, Adduct_mass. The other one correspond to the Compound_name and Exact_mass of the Compounds i want to iterate over (i just put a small data set)
Adducts: df_al
import pandas as pd
data = [["M+3H", 3, 1, 1.007276], ["M+3Na", 3, 1, 22.989], ["M+H", 1, 1,
1.007276], ["2M+H", 1, 2, 1.007276], ["M-3H", 3, 1, -1.007276]]
df_al = pd.DataFrame(data, columns=["Ion_name", "Charge", "M", "Adduct_mass"])
Compounds: df
import pandas as pd
data1 = [[1, "C3H64O7", 596.465179], [2, "C30H42O7", 514.293038], [4,
"C44H56O8", 712.397498], [4, "C24H32O6S", 448.191949], [5, "C20H28O3",
316.203834]]
df = pd.DataFrame(data1, columns=["CdId", "Formula", "exact_mass"])
The solution to this problem was:
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 5.
for i in range(5):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
Output
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
Now that is the rigth calculations but i need now a file where:
-only exists 2 columns (Name and mass)
-All the different adducts are appended one after another
desired out put
Name Mass
a_M+3H 199.82902
a_M+3Na 221.810726
a_M+H 597.472455
a_2M+H 1193.937634
a_M-3H 197.814450
b_M+3H 514.293038
.
.
.
c_M+3H
and so on.
Also i need to combine the name of the respective compound with the ion form (M+3H, M+H, etc).
At this point i have no code for that.
I would apprecitate any advice and a better approach since the begining.
This part is an update of the question above:
Is posible to obtain and ouput like this one:
Name Mass RT
a_M+3H 199.82902 1
a_M+3Na 221.810726 1
a_M+H 597.472455 1
a_2M+H 1193.937634 1
a_M-3H 197.814450 1
b_M+3H 514.293038 3
.
.
.
c_M+3H 2
The RT is the same value for all forms of a compound, in this example is RT for a =1, b = 3, c =2, etc.
Is posible to incorporate (Keep this column) from the data set df (which i update here below)?. As you can see that df has more columns like "Formula" and "RT" which desapear after calculations.
import pandas as pd
data1 = [[a, "C3H64O7", 596.465179, 1], [b, "C30H42O7", 514.293038, 3], [c,
"C44H56O8", 712.397498, 2], [d, "C24H32O6S", 448.191949, 4], [e, "C20H28O3",
316.203834, 1.5]]
df = pd.DataFrame(data1, columns=["Name", "Formula", "exact_mass", "RT"])
Part three! (sorry and thank you)
this is a trial i did on a small data set (df) using the code below, with the same df_al of above.
df=
Code
#Defining variables for calculation
df_name = df_al["Ion_name"]
df_mass = df_al["Adduct_mass"]
df_div = df_al["Charge"]
df_M = df_al["M"]
df_ID= df["Name"]
#Defining the RT dictionary
RT = dict(zip(df["Name"], df["RT"]))
#Removing RT column
df=df.drop(columns=["RT"])
#Defining general function
def Adduct(x,i):
return x*df_M[i]/df_div[i] + df_mass[i]
#Applying general function in a range from 0 to 46.
for i in range(47):
df[df_name.loc[i]] = df['exact_mass'].map(lambda x: Adduct(x,i))
df
output
#Melting
df = pd.melt(df, id_vars=['Name'], var_name = "Adduct", value_name= "Exact_mass", value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
df['RT'] = df.Name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
del df['Name']
del df['Adduct']
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
output
Why NaN?
Here is how I will go about it, pandas.melt comes to rescue:
import pandas as pd
import numpy as np
from io import StringIO
s = StringIO('''
Name exact_mass M+3H M+3Na M+H 2M+H M-3H
0 a 596.465179 199.829002 221.810726 597.472455 1193.937634 197.814450
1 b 514.293038 172.438289 194.420013 515.300314 1029.593352 170.423737
2 c 712.397498 238.473109 260.454833 713.404774 1425.802272 236.458557
3 d 448.191949 150.404592 172.386316 449.199225 897.391174 148.390040
4 e 316.203834 106.408554 128.390278 317.211110 633.414944 104.39400
''')
df = pd.read_csv(s, sep="\s+")
df = pd.melt(df, id_vars=['Name'], value_vars=[x for x in df.columns if 'Name' not in x and 'exact' not in x])
df['name'] = df.apply(lambda x:x[0] + "_" + x[1], axis=1)
del df['Name']
del df['variable']
RT = {'a':1, 'b':2, 'c':3, 'd':5, 'e':1.5}
df['RT'] = df.name.apply(lambda x: RT[x[0]] if x[0] in RT else np.nan)
df
Here is the output:

Resources