Pandas pivot table group summary - python-3.x

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'group':['s','s','s','p','p','p'],
'section':['a','b','b','a','a','b']
})
group section
0 s a
1 s b
2 s b
3 p a
4 p a
5 p b
I'd like a count of the number of sections per group and the maximum number of rows per section for each group. Like this:
group section count max min
s 2 2 1
p 2 2 1

IIUC you can use:
import pandas as pd
import numpy as np
df = pd.DataFrame({'group':['s','s','s','s','p','p','p','p','p'],
'section':['b','b','b','a','a','b','a','a','b']
})
print (df)
group section
0 s b
1 s b
2 s b
3 s a
4 p a
5 p b
6 p a
7 p a
8 p b
print (df.groupby(['group', 'section']).size() )
group section
p a 3
b 2
s a 1
b 3
dtype: int64
print (df.groupby(['group', 'section']).size().groupby(level=1).agg([len, min, max]) )
len min max
section
a 2 1 3
b 2 2 3
Or maybe you can change len to nunique:
print (df.groupby(['group', 'section']).size().groupby(level=1).agg(['nunique', min, max]) )
nunique min max
section
a 2 1 3
b 2 2 3
Or in need by first level of multiindex:
print (df.groupby(['group', 'section']).size().groupby(level=0).agg([len, min, max]) )
len min max
group
p 2 2 3
s 2 1 3
print (df.groupby(['group', 'section']).size().groupby(level=0).agg(['nunique', min, max]) )
nunique min max
group
p 2 2 3
s 2 1 3

You can achieve this by grouping on 'group' generate the value_counts and then grouping again:
In [91]:
df.groupby('group')['section'].apply(pd.Series.value_counts).groupby(level=1).agg(['nunique','max','min'])
Out[91]:
nunique max min
a 2 2 1
b 2 2 1
To get close to the desired result you can do this:
In [102]:
df.groupby('group')['section'].apply(pd.Series.value_counts).reset_index().drop('level_1', axis=1).groupby('group',as_index=False).agg(['nunique','max','min'])
Out[102]:
section
nunique max min
group
p 2 2 1
s 2 2 1

Related

Check whether the two string columns contain each other in Python

Given a small dataset as follows:
id a b
0 1 lol lolec
1 2 rambo ram
2 3 ki pio
3 4 iloc loc
4 5 strip rstrip
5 6 lambda lambda
I would like to create a new column c based on the following criterion:
If a is equal or substring of b or vise versa, then create a new column c with value 1, otherwise keep it as 0.
How could I do that in Pandas or Python?
The expected result:
id a b c
0 1 lol lolec 1
1 2 rambo ram 1
2 3 ki pio 0
3 4 iloc loc 1
4 5 strip rstrip 1
5 6 lambda lambda 1
To check whether a is in b or b is in a, we can use:
df.apply(lambda x: x.a in x.b, axis=1)
df.apply(lambda x: x.b in x.a, axis=1)
Use zip and list comprehension:
df['c'] = [int(a in b or b in a) for a, b in zip(df.a, df.b)]
df
id a b c
0 1 lol lolec 1
1 2 rambo ram 1
2 3 ki pio 0
3 4 iloc loc 1
4 5 strip rstrip 1
5 6 lambda lambda 1
Or use apply, just combine both conditions with or:
df['c'] = df.apply(lambda r: int(r.a in r.b or r.b in r.a), axis=1)

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

Python create a column based on the values of each row of another column

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15
Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

In Pandas, how to filter against other dataframe with Multi-Index

I have two dataframes. The first one (df1) has a Multi-Index A,B.
The second one (df2) has those fields A and B as columns.
How do I filter df2 for a large dataset (2 million rows in each) to get only the rows in df2 where A and B are not in the multi index of df1
import pandas as pd
df1 = pd.DataFrame([(1,2,3),(1,2,4),(1,2,4),(2,3,4),(2,3,1)],
columns=('A','B','C')).set_index(['A','B'])
df2 = pd.DataFrame([(7,7,1,2,3),(7,7,1,2,4),(6,6,1,2,4),
(5,5,6,3,4),(2,7,2,2,1)],
columns=('X','Y','A','B','C'))
df1:
C
A B
1 2 3
2 4
2 4
2 3 4
3 1
df2 before filtering:
X Y A B C
0 7 7 1 2 3
1 7 7 1 2 4
2 6 6 1 2 4
3 5 5 6 3 4
4 2 7 2 2 1
df2 wanted result:
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Create MultiIndex in df2 by A,B columns and filter by Index.isin with ~ for invert boolean mask with boolean indexing:
df = df2[~df2.set_index(['A','B']).index.isin(df1.index)]
print (df)
X Y A B C
3 5 5 6 3 4
4 2 7 2 2 1
Another similar solution with MultiIndex.from_arrays:
df = df2[~pd.MultiIndex.from_arrays([df2['A'],df2['B']]).isin(df1.index)]
Another solution by #Sandeep Kadapa:
df = df2[df2[['A','B']].ne(df1.reset_index()[['A','B']]).any(axis=1)]

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!
use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Resources