How to group a dataframe by multiple columns, sum and sort the totals in descending order? - python-3.x

Given the following dataframe:
user_id col1 col2
1 A 4
1 A 22
1 A 112
1 B -0.22222
1 B 9
1 C 0
2 A -1
2 A -5
2 K NA
And I want to group by user_id and col1 and count. Then to sort the counts within the groups in descending order.
Here is what I'm trying to do but I don't get the right output:
df[["user_id", "col1"]]. \
groupby(["user_id", "col1"]). \
agg(counts=("col1","count")). \
reset_index(). \
sort_values(["user_id", "col1", "counts"], ascending=False)
Please advise what should I change to make it work.
Expected output:
user_id col1 counts
1 A 3
B 2
C 1
2 A 2
K 1

Use GroupBy.size:
In [199]: df.groupby(['user_id', 'col1']).size()
Out[199]:
user_id col1
1 A 3
B 2
C 1
2 A 2
K 1
OR:
In [201]: df.groupby(['user_id', 'col1']).size().reset_index(name='counts')
Out[201]:
user_id col1 counts
0 1 A 3
1 1 B 2
2 1 C 1
3 2 A 2
4 2 K 1
EDIT:
In [206]: df.groupby(['user_id', 'col1']).agg({'col2': 'size'})
Out[206]:
col2
user_id col1
1 A 3
B 2
C 1
2 A 2
K 1
EDIT-2: For sorting, use:
In [213]: df.groupby(['user_id', 'col1'])['col2'].size().sort_values(ascending=False)
Out[213]:
user_id col1
1 A 3
2 A 2
1 B 2
2 K 1
1 C 1
Name: col2, dtype: int64

Using the main idea from Mayank answer:
df.groupby(["id_user","col1"]).size().reset_index(name="counts").sort_values(["id_user", "col1"], ascending=False)
Solved my issue.

Related

Group by and drop duplicates in pandas dataframe

I have a pandas dataframe as below. I want to group by based on all the three columns and retain the group with the max of Col1.
import pandas as pd
df = pd.DataFrame({'col1':['A', 'A', 'A', 'A', 'B', 'B'], 'col2':['1', '1', '1', '1', '2', '3'], 'col3':['5', '5', '2', '2', '2', '3']})
df
col1 col2 col3
0 A 1 5
1 A 1 5
2 A 1 2
3 A 1 2
4 B 2 2
5 B 3 3
My expected output
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I tried below code, but it return me the last row of each group, instead I want to sort by col3 and keep the group with max col3
df.drop_duplicates(keep='last', subset=['col1','col2','col3'])
col1 col2 col3
1 A 1 5
3 A 1 2
4 B 2 2
5 B 3 3
For Example: Here I want to drop 1st group because 2 < 5, so I want to keep the group with col3 as 5
df.sort_values(by=['col1', 'col2', 'col3'], ascending=False)
a_group = df.groupby(['col1', 'col2', 'col3'])
for name, group in a_group:
group = group.reset_index(drop=True)
print(group)
col1 col2 col3
0 A 1 2
1 A 1 2
col1 col2 col3
0 A 1 5
1 A 1 5
col1 col2 col3
0 B 2 2
col1 col2 col3
0 B 3 3
You cant group on all columns since the col you wish to retain max for has different values. Instead dont include that column in the group and consider others:
col_to_max = 'col3'
i = df.columns ^ [col_to_max]
out = df[df[col_to_max] == df.groupby(list(i))[col_to_max].transform('max')]
print(out)
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
So we can do
out = df[df.col3==df.groupby(['col1','col2'])['col3'].transform('max')]
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3
I believe you can use groupby with nlargest(2). Also make sure that your 'col3' is a numerical one.
>>> df['col3'] = df['col3'].astype(int)
>>> df.groupby(['col1','col2'])['col3'].nlargest(2).reset_index().drop('level_2',axis=1)
col1 col2 col3
0 A 1 5
1 A 1 5
2 B 2 2
3 B 3 3
You can get index which doesn't has col3 max value and duplicated index and drop the intersection
ind = df.assign(max = df.groupby("col1")["col3"].transform("max")).query("max != col3").index
ind2 = df[df.duplicated(keep=False)].index
df.drop(set(ind).intersection(ind2))
col1 col2 col3
0 A 1 5
1 A 1 5
4 B 2 2
5 B 3 3

Pandas: Create different dataframes from an unique multiIndex dataframe

I would like to know how to pass from a multiindex dataframe like this:
A B
col1 col2 col1 col2
1 2 12 21
3 1 2 0
To two separated dfs. df_A:
col1 col2
1 2
3 1
df_B:
col1 col2
12 21
2 0
Thank you for the help
I think here is better use DataFrame.xs for selecting by first level:
print (df.xs('A', axis=1, level=0))
col1 col2
0 1 2
1 3 1
What need is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby(level=0, axis=1):
globals()['df_' + str(i)] = g.droplevel(level=0, axis=1)
print (df_A)
col1 col2
0 1 2
1 3 1
Better is create dictionary of DataFrames:
d = {i:g.droplevel(level=0, axis=1)for i, g in df.groupby(level=0, axis=1)}
print (d['A'])
col1 col2
0 1 2
1 3 1

Add rows to dataframe using the string values from column

I want to add rows to a dataframe based on a columns values for each row so a string value of (1:2:3) will create a new column and add rows for that column as described in the example below:
I have this kind of data:
Col1 | Col2
1 | 1:2:3
2 | 4:5
I want to transform it to look like this:
Col1 | Col2
1 | 1
1 | 2
1 | 3
2 | 4
2 | 5
I know that this can be done using nested for loops, but I'm sure there's a better way to do it.
Do split and explode
df=df.assign(Col2=df.Col2.str.split(':')).explode('Col2')
Out[161]:
Col1 Col2
0 1 1
0 1 2
0 1 3
1 2 4
1 2 5
df = pd.DataFrame({'Col1':[1,2],'Col2':['1:2:3','4:5']})
Split the values in Col2 so they are lists and explode.
>>> df['Col2'] = df.apply(lambda x: x['Col2'].split(':'), axis = 1)
>>> df.explode('Col2')
Col1 Col2
0 1 1
0 1 2
0 1 3
1 2 4
1 2 5

Shuffle pandas columns

I have the following data frame:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
2 7 8 9 2
and I would like to have a shuffled output like :
Col3 Col1 Col2 Type
0 3 1 2 1
1 6 4 5 1
2 9 7 8 2
How to achieve this?
Use DataFrame.sample with axis=1:
df = df.sample(frac=1, axis=1)
If need last column not changed position:
a = df.columns[:-1].to_numpy()
np.random.shuffle(a)
print (a)
['Col3' 'Col1' 'Col2']
df = df[np.append(a, ['Type'])]
print (df)
Col2 Col3 Col1 Type
0 3 1 2 1
1 6 4 5 1
2 9 7 8 2

How to perform arithmetic operations with specific elements of a dataframe?

I am trying to understand how to perform arithmetic operations on a dataframe in python.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[2,38,7,5],'col2':[1,3,2,4]})
print (unsorted_df.sum())
This is what I'm getting (in terms of the output), but I want to have more control over which sum I am getting.
col1 52
col2 10
dtype: int64
Just wondering how I would add individual elements in the dataframe together.
Your question is not very clear but still I will try to cover all possible scenarios,
Input:
df
col1 col2
0 2 1
1 38 3
2 7 2
3 5 4
If you want the sum of columns,
df.sum(axis = 0)
Output:
col1 52
col2 10
dtype: int64
If you want the sum of rows,
df.sum(axis = 1)
0 3
1 41
2 9
3 9
dtype: int64
If you want to add a list of numbers into a column,
num = [1, 2, 3, 4]
df['col1'] = df['col1'] + num
df
Output:
col1 col2
0 3 1
1 40 3
2 10 2
3 9 4
If you want to add a list of numbers into a row,
num = [1, 2]
df.loc[0] = df.loc[0] + num
df
Output:
col1 col2
0 3 3
1 38 3
2 7 2
3 5 4
If you want to add a single number to a column,
df['col1'] = df['col1'] + 2
df
Output:
col1 col2
0 4 1
1 40 3
2 9 2
3 7 4
If you want to add a single number to a row,
df.loc[0] = df.loc[0] + 2
df
Output:
col1 col2
0 4 3
1 38 3
2 7 2
3 5 4
If you want to add a number to any number(an element of row i and column j),
df.iloc[1,1] = df.iloc[1,1] + 5
df
Output:
col1 col2
0 2 1
1 38 8
2 7 2
3 5 4

Resources