In Pandas, how to compute value counts on bins and sum value in 1 other column - python-3.x

I have Pandas dataframe like:
df =
col1 col2
23 75
25 78
22 120
I want to specify bins: 0-100 and 100-200 and divide col2 in those bins compute its value counts and sum col1 for the values that lie in those bins.
So:
df_output:
col2_range count col1_cum
0-100 2 48
100-200 1 22
Getting the col2_range and count is pretty simple:
import numpy as np
a = np.arange(0,200, 100)
bins = a.tolist()
counts = data['col1'].value_counts(bins=bins, sort=False)
How do I get to sum col2 though?

IIUC, try using pd.cut to create bins and groupby those bins:
g = pd.cut(df['col2'],
bins=[0, 100, 200, 300, 400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True)['col1'].agg(['count','sum']).reset_index()
Output:
col2 count sum
0 0-99 2 48
1 100-199 1 22
I think I misread the original post:
g = pd.cut(df['col2'],
bins=[0,100,200,300,400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True).agg(col1_count=('col1','count'),
col2_sum=('col2','sum'),
col1_sum=('col1','sum')).reset_index()
Output:
col2 col1_count col2_sum col1_sum
0 0-99 2 153 48
1 100-199 1 120 22

Related

How to filter a dataframe using a cumulative sum of a column as parameter

I have this df:
df=pd.DataFrame({'Name':['John','Mike','Lucy','Mary','Andy'],
'Age':[10,23,13,12,15],
'%':[20,20,10,25,25]})
I want to filter this df by taking from row 0 to row n until the sum of column % = 50
I don't want to sort the % column or the df, I just need to get it's first row where % column sums 50
The output is:
filtered=pd.DataFrame({'Name':['John','Mike','Lucy'],'Age':[10,23,13],'%':[20,20,10]})
cumsum, boolean index and slice using the loc or iloc accessor
df.iloc[:(df['%'].cumsum()==50).idxmax()+1,:]
Name Age %
0 John 10 20
1 Mike 23 20
2 Lucy 13 10

Applying "percentage of group total" to a column in a grouped dataframe

I have a dataframe from which I generate another dataframe using following code as under:
df.groupby(['Cat','Ans']).agg({'col1':'count','col2':'sum'})
This gives me following result:
Cat Ans col1 col2
A Y 100 10000.00
N 40 15000.00
B Y 80 50000.00
N 40 10000.00
Now, I need percentage of group totals for each group (level=0, i.e. "Cat") instead of count or sum.
For getting count percentage instead of count value, I could do this:
df['Cat'].value_counts(normalize=True)
But here I have sub-group "Ans" under the "Cat" group. And I need the percentage to be on each Cat group level and not the whole total.
So, expectation is:
Cat Ans col1 .. col3
A Y 100 .. 71.43 #(100/(100+40))*100
N 40 .. 28.57
B Y 80 .. 66.67
N 40 .. 33.33
Similarly, col4 will be percentage of group-total for col2.
Is there a function or method available for this?
How do we do this in an efficient way for large data?
You can use the level argument of DataFrame.sum (to perform a groupby) and have pandas take care of the index alignment for the division.
df['col3'] = df['col1']/df['col1'].sum(level='Cat')*100
col1 col2 col3
Cat Ans
A Y 100 10000.0 71.428571
N 40 15000.0 28.571429
B Y 80 50000.0 66.666667
N 40 10000.0 33.333333
For multiple columns you can loop the above, or have pandas align those too. I add a suffix to distinguish the new columns from the original columns when joining back with concat.
df = pd.concat([df, (df/df.sum(level='Cat')*100).add_suffix('_pct')], axis=1)
col1 col2 col1_pct col2_pct
Cat Ans
A Y 100 10000.0 71.428571 40.000000
N 40 15000.0 28.571429 60.000000
B Y 80 50000.0 66.666667 83.333333
N 40 10000.0 33.333333 16.666667

In pandas dataframe, how to make one column act on all the others?

Consider the small following dataframe:
import pandas as pd
value1 = [15, 20, 50, 70]
value2 = [15, 80, 45, 30]
base = [175, 150, 200, 125]
df = pd.DataFrame({"val1": value1, "val2": value2, "base": base})
df
val1 val2 base
0 15 15 175
1 20 80 150
2 50 45 200
3 70 30 125
Actually, there are much more rows and much more val*** columns...
I would like to express the figures given in the columns val*** as percent of their corresponding base (in the same row); as an example, 70 (last in val1) should become (70/125)*100, (which is 56), or 30 (last in val2) should become (30/125)*100 (which is 28) ; and so on for every figure.
I am sure the solution lies in a correct use of assign or apply and lambda, but I can't find how to do it ...
We can filter the val like columns then divide these columns by the base column along axis=0 followed by multiplication with 100 to calculate the percentage
df.filter(like='val').div(df['base'], axis=0).mul(100).add_suffix('%')
val1% val2%
0 8.571429 8.571429
1 13.333333 53.333333
2 25.000000 22.500000
3 56.000000 24.000000

Get total of Pandas column and row

I have a Pandas data frame, as shown below,
a b c
A 100 60 60
B 90 44 44
A 70 50 50
Now, I would like to get the total of column and row, skip c, as shown below,
a b sum
A 170 110 280
B 90 44 134
So I do not know how to do, I'm in trouble, please help me, thank you, guys.
My example dataframe is:
df = pd.DataFrame(dict(a=[100, 90,70], b=[60, 44,50],c=[60, 44,50]),index=["A", "B","A"])
(
df.groupby(level=0)['a','b'].sum()
.assign(sum=lambda x: x.sum(1))
)
Use:
#remove unnecessary column
df = df.drop('c', 1)
#get sum of rows
df['sum'] = df.sum(1)
#get sum per index
df = df.sum(level=0)
print (df)
a b sum
A 170 110 280
B 90 44 134
df["sum"] = df[["a","b"]].sum(axis=1) #Column-wise sum of "a" and "b"
df[["a", "b", "sum"]] #show all columns but not "c"
The pandas way is:
#create sum column
df['sum'] = df['a']+df['b']
#remove colimn c
df = df[['a', 'b', 'sum']]

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

Resources