Applying "percentage of group total" to a column in a grouped dataframe - python-3.x

I have a dataframe from which I generate another dataframe using following code as under:
df.groupby(['Cat','Ans']).agg({'col1':'count','col2':'sum'})
This gives me following result:
Cat Ans col1 col2
A Y 100 10000.00
N 40 15000.00
B Y 80 50000.00
N 40 10000.00
Now, I need percentage of group totals for each group (level=0, i.e. "Cat") instead of count or sum.
For getting count percentage instead of count value, I could do this:
df['Cat'].value_counts(normalize=True)
But here I have sub-group "Ans" under the "Cat" group. And I need the percentage to be on each Cat group level and not the whole total.
So, expectation is:
Cat Ans col1 .. col3
A Y 100 .. 71.43 #(100/(100+40))*100
N 40 .. 28.57
B Y 80 .. 66.67
N 40 .. 33.33
Similarly, col4 will be percentage of group-total for col2.
Is there a function or method available for this?
How do we do this in an efficient way for large data?

You can use the level argument of DataFrame.sum (to perform a groupby) and have pandas take care of the index alignment for the division.
df['col3'] = df['col1']/df['col1'].sum(level='Cat')*100
col1 col2 col3
Cat Ans
A Y 100 10000.0 71.428571
N 40 15000.0 28.571429
B Y 80 50000.0 66.666667
N 40 10000.0 33.333333
For multiple columns you can loop the above, or have pandas align those too. I add a suffix to distinguish the new columns from the original columns when joining back with concat.
df = pd.concat([df, (df/df.sum(level='Cat')*100).add_suffix('_pct')], axis=1)
col1 col2 col1_pct col2_pct
Cat Ans
A Y 100 10000.0 71.428571 40.000000
N 40 15000.0 28.571429 60.000000
B Y 80 50000.0 66.666667 83.333333
N 40 10000.0 33.333333 16.666667

Related

In Pandas, how to compute value counts on bins and sum value in 1 other column

I have Pandas dataframe like:
df =
col1 col2
23 75
25 78
22 120
I want to specify bins: 0-100 and 100-200 and divide col2 in those bins compute its value counts and sum col1 for the values that lie in those bins.
So:
df_output:
col2_range count col1_cum
0-100 2 48
100-200 1 22
Getting the col2_range and count is pretty simple:
import numpy as np
a = np.arange(0,200, 100)
bins = a.tolist()
counts = data['col1'].value_counts(bins=bins, sort=False)
How do I get to sum col2 though?
IIUC, try using pd.cut to create bins and groupby those bins:
g = pd.cut(df['col2'],
bins=[0, 100, 200, 300, 400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True)['col1'].agg(['count','sum']).reset_index()
Output:
col2 count sum
0 0-99 2 48
1 100-199 1 22
I think I misread the original post:
g = pd.cut(df['col2'],
bins=[0,100,200,300,400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True).agg(col1_count=('col1','count'),
col2_sum=('col2','sum'),
col1_sum=('col1','sum')).reset_index()
Output:
col2 col1_count col2_sum col1_sum
0 0-99 2 153 48
1 100-199 1 120 22

Counting multiples columns and list the counts in separate columns and retain a column

I have the following Dataframe:
id coord_id val1 val2 record val3
0 snp chr15_1-1000 1.0 0.9 xx12 2
1 snv chr15_1-1000 1.0 0.7 yy12 -4
2 ins chr15_1-1000 0.01 0.7 jj12 -4
3 ins chr15_1-1000 1.0 1.5 zzy1 -5
4 ins chr15_1-1000 1.0 1.5 zzy1 -5
5 del chr10_2000-4000 0.1 1.2 j112 12
6 del chr10_2000-4000 0.4 1.1 jh12 15
I am trying to count the number of times each coord_id appears by each id but keeping the val1 column in the resulting table but only to include a range of the value in that column so for instance, I am trying accomplish the following result:
id snp snv ins del total val1
chr15_1-1000 1 1 3 0 5 0.01-1.0
chr10_2000-4000 0 0 0 2 2 0.1-0.4
I want to sort it in ascending order by the column total.
So much appreciate it in advance.
First pivot into id columns with count aggregation and margin sums. Then join() with the val1 min-max strings:
(df.pivot_table(index='coord_id', columns='id', values='val1',
aggfunc='count', fill_value=0,
margins=True, margins_name='total')
.join(df.groupby('coord_id').val1.agg(lambda x: f'{x.min()}-{x.max()}'))
.sort_values('total', ascending=False)
.drop('total'))
# del ins snp snv total val1
# coord_id
# chr15_1-1000 0 3 1 1 5 0.01-1.0
# chr10_2000-4000 2 0 0 0 2 0.1-0.4
I suggest making two computations separately -- get the range and count the frequency.
temp = test_df.groupby(['coord_id']).agg({'val1': ['min', 'max']})
temp.columns = temp.columns.get_level_values(1)
temp['val1'] = temp['min'].astype(str) + '-' + temp['max'].astype(str)
Then,
temp2 = test_df.groupby(['coord_id', 'id']).count().unstack('id').fillna(0)
temp2.columns = temp2.columns.get_level_values(1)
And, finally, merging
answer = pd.concat([temp, temp2], axis=1)

Is there a Pandas way to group dates that are less than 2 minutes apart in a dataframe?

i would like to make a groupby on my data to put together dates that are close. (less than 2 minutes)
Here an example of what i get
> datas = [['A', 51, 'id1', '2020-05-27 05:50:43.346'], ['A', 51, 'id2',
> '2020-05-27 05:51:08.347'], ['B', 45, 'id3', '2020-05-24
> 17:23:55.142'],['B', 45, 'id4', '2020-05-24 17:23:30.141'], ['C', 34,
> 'id5', '2020-05-23 17:31:10.341']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'cold_id',
> 'dates'])
The 2 first rows have close dates, same for the 3th and 4th rows, 5th row is alone.
I would like to get something like this :
> datas = [['A', 51, 'id1 id2', 'date_1'], ['B', 45, 'id3 id4',
> 'date_2'], ['C', 34, 'id5', 'date_3']]
>
> df = pd.DataFrame(datas, columns = ['col1', 'col2', 'col_id',
> 'dates'])
Making it in a pythonic way is not that hard, but i have to make it on big dataframe, a pandas way using groupby method would be much efficient.
After apply a datetime method on the dates column i tried :
> df.groupby([df['dates'].dt.date]).agg(','.join)
but the .dt.date method gives a date every day and not every 2 minutes.
Do you have a solution ?
Thank you
Looking at the output seems like we are trying to group dates by 2 min freq and col1, col2.
Code
df['dates'] = pd.to_datetime(df.dates)
df.groupby([pd.Grouper(key='dates', freq='2 min'),
'col1', 'col2']).agg(','.join).reset_index().sort_values('col1').reset_index(drop=True)
Output
dates col1 col2 cold_id
0 2020-05-27 05:50:00 A 51 id1,id2
1 2020-05-24 17:22:00 B 45 id3,id4
2 2020-05-23 17:30:00 C 34 id5
Using only dates, this is what I did to classifiate your rows :
First of all, convert date to timestamp to compare them easily :
from datetime import datetime
import time
df["dates"] = df["dates"].apply(lambda x : int(time.mktime(datetime.strptime(x,"%Y-%m-%d %H:%M:%S.%f").timetuple())))
Then, sort them by date :
df = df.sort_values("dates")
Finally, using this answer, I create a column group in order to identify close dates. The first line add 1 into group column if dates are close enough (120 seconds in our case). The second line will fill in the column group to remove the NaN and assign groups :
df.loc[(df.dates.shift() < df.dates - 120),"group"] = 1
df['group'] = df['group'].cumsum().ffill().fillna(0)
This gives me :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4 1590333810 1.00
2 B 45 id3 1590333835 1.00
0 A 51 id1 1590551443 2.00
1 A 51 id2 1590551468 2.00
Now, to concatenate your cold_id, you groupby your groups and join the cold_id of each group by using transform:
df["cold_id"] = df.groupby(["group"],as_index=False)["cold_id"].transform(lambda x: ','.join(x))
df = df.drop_duplicates(subset=["cold_id"])
This finally gives you this dataframe :
col1 col2 cold_id dates group
4 C 34 id5 1590247870 0.00
3 B 45 id4,id3 1590333810 1.00
0 A 51 id1,id2 1590551443 2.00

select the first n largest groups from grouped data frames

Data frame(df) structure
col1 col2
x 3131
y 9647
y 9648
z 9217
y 9652
x 23
grouping:
grouped = df.groupby(col1)
I want to select first 2 largest groups i.e.,
y 9647
y 9648
y 9652
and
x 3131
x 23
How can I do that using pandas. I've achieved it using list but that makes it clumsy again as it becomes a list of tuples and I've to convert them back to data frame types
Use value_counts with indexing index and filter rows by isin in boolean indexing:
df1 = df[df['col1'].isin(df['col1'].value_counts().index[:2])]
print (df1)
col1 col2
0 x 3131
1 y 9647
2 y 9648
4 y 9652
5 x 23
If need DataFrames by top groups use dictionary comprehension with enumerate:
dfs = {i: df[df['col1'].eq(x)] for i, x in enumerate(df['col1'].value_counts().index[:2], 1)}
print (dfs)
{1: col1 col2
1 y 9647
2 y 9648
4 y 9652, 2: col1 col2
0 x 3131
5 x 23}
print (dfs[1])
col1 col2
1 y 9647
2 y 9648
4 y 9652
print (dfs[2])
col1 col2
0 x 3131
5 x 23

Looking for the Max Sum, based on Criteria and Unique Values

Col1 Col2 Col3
a 3 x
b 2 x
c 2 x
a 1 x
b 3 x
c 1 y
a 2 y
b 1 y
c 3 y
Using the table above, can anyone give me a formula to find:
The max sum of Col2 when Col3=X per each unique value in Col1
(Answer should be 5, would be 4 based on Col3=Y)
Create a PivotTable with Col3 as FILTERS (select x), Col1 for ROWS and Sum of Col2 for VALUES. Uncheck Show grand totals for Columns and then for whichever column contains Sum of Col2 take the maximum, say:
=MAX(F:F)
Well it's not ideal but it works:
Column D put an array formula in for Max If:
in D2: =MAX(IF($C$2:$C$10=C2,SUM(IF($A$2:$A$10=A2,IF($C$2:$C$10=C2,$B$2:$B$10)))))
Change the ranges obviously.
Then in E2 put this: =MAX(IF($C$2:$C$10=C2,$D$2:$D$10))
These are both array formulas so after inputting them you must press CTRL-SHIFT-ENTER not just enter.
Then drag down.
There may be a way to combine these but my array formula knowledge is limited
Here are the results:
Col1 Col2 Col3 Sum of max per col 1 Max of col 4 per col 3
a 3 x 4 5
b 2 x 5 5
c 2 x 2 5
a 1 x 4 5
b 3 x 5 5
c 1 y 4 4
a 2 y 2 4
b 1 y 1 4
c 3 y 4 4
If you don't use CTRL-SHIFT-ENTER you will get 18 and 5 all the way down.

Resources