Sorting pivot table (multi index) - python-3.x

I'm trying to sort a pivot table's values in descending order after putting two "row labels" (Excel term) on the pivot.
sample data:
x = pd.DataFrame({'col1':['a','a','b','c','c', 'a','b','c', 'a','b','c'],
'col2':[ 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'col3':[ 1,.67,0.5, 2,.65, .75,2.25,2.5, .5, 2,2.75]})
print(x)
col1 col2 col3
0 a 1 1.00
1 a 1 0.67
2 b 1 0.50
3 c 1 2.00
4 c 1 0.65
5 a 2 0.75
6 b 2 2.25
7 c 2 2.50
8 a 3 0.50
9 b 3 2.00
10 c 3 2.75
To create the pivot, I'm using the following function:
pt = pd.pivot_table(x, index = ['col1', 'col2'], values = 'col3', aggfunc = np.sum)
print(pt)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 1 0.50
2 2.25
3 2.00
c 1 2.65
2 2.50
3 2.75
In words, this variable pt is first sorted by col1, then by values of col2 within col1 then by col3 within all of those. This is great, but I would like to sort by col3 (the values) while keeping the groups that were broken out in col2 (this column can be any order and shuffled around).
The target output would look something like this (col3 in descending order with any order in col2 with that group of col1):
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50
I have tried the code below, but this just sorts the entire pivot table values and loses the grouping (I'm looking for sorting within the group).
pt.sort_values(by = 'col3', ascending = False)
For guidance, a similar question was asked (and answered) here, but I was unable to get a successful output with the provided output:
Pandas: Sort pivot table
The error I get from that answer is ValueError: all keys need to be the same shape

You need reset_index for DataFrame, then sort_values by col1 and col3 and last set_index for MultiIndex:
df = df.reset_index()
.sort_values(['col1','col3'], ascending=[True, False])
.set_index(['col1','col2'])
print (df)
col3
col1 col2
a 1 1.67
2 0.75
3 0.50
b 2 2.25
3 2.00
1 0.50
c 3 2.75
1 2.65
2 2.50

Related

How to reorder values across columns in a pandas DataFrame?

How can I reorder the values, in each column for each row in ascending order?
My DataFrame:
data = pd.DataFrame({'date': ['1/1/2021','1/1/2021','1/2/2021'],
'col1': [7,2,6],
'col2': [2,4,8],
'col3': [1,2,7]
})
print(data)
date col1 col2 col3
0 1/1/2021 7 2 1
1 1/1/2021 2 4 2
2 1/2/2021 6 8 7
However, I need to reorder the values in each row to be in ascending order, across the columns. So, the end result needs to look like;
date col1 col2 col3
0 1/1/2021 1 2 7
1 1/1/2021 2 2 4
2 1/2/2021 6 7 8
You can use np.sort along axis=1 to sort the columns col1, col2 and col3 of the dataframe in the ascending order:
cols = ['col1', 'col2', 'col3']
data.loc[:, cols] = np.sort(data[cols], axis=1)
>>> data
date col1 col2 col3
0 1/1/2021 1 2 7
1 1/1/2021 2 2 4
2 1/2/2021 6 7 8
you can np.sort and then df.join
data[['date']].join(pd.DataFrame(np.sort(data.drop('date',1),axis=1)).add_prefix('col'))
date col0 col1 col2
0 1/1/2021 1 2 7
1 1/1/2021 2 2 4
2 1/2/2021 6 7 8

groupby column in pandas

I am trying to groupby columns value in pandas but I'm not getting.
Example:
Col1 Col2 Col3
A 1 2
B 5 6
A 3 4
C 7 8
A 11 12
B 9 10
-----
result needed grouping by Col1
Col1 Col2 Col3
A 1,3,11 2,4,12
B 5,9 6,10
c 7 8
but I getting this ouput
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000025BEB4D6E50>
I am getting using excel power query with function group by and count all rows, but I canĀ“t get the same with python and pandas. Any help?
Try this
(
df
.groupby('Col1')
.agg(lambda x: ','.join(x.astype(str)))
.reset_index()
)
it outputs
Col1 Col2 Col3
0 A 1,3,11 2,4,12
1 B 5,9 6,10
2 C 7 8
Very good I created solution between 0 and 0:
df[df['A'] != 0].groupby((df['A'] == 0).cumsum()).sub()
It will group column between 0 and 0 and sum it

pandas fill column with random numbers with a total for each row

I've got a pandas dataframe like this:
id foo
0 A col1
1 A col2
2 B col1
3 B col3
4 D col4
5 C col2
I'd like to create four additional columns based on unique values in foo column. col1,col2, col3, col4
id foo col1 col2 col3 col4
0 A col1 75 20 5 0
1 A col2 20 80 0 0
2 B col1 82 10 8 0
3 B col3 5 4 80 11
4 D col4 0 5 10 85
5 C col2 12 78 5 5
The logic for creating the columns is as follows:
if foo = col1 then col1 contains a random number between 75-100 and the other columns (col2, col3, col4) contains random numbers, such that the total for each row is 100
I can manually create a new column and assign a random number, but I'm unsure how to include the logic of sum for each row of 100.
Appreciate any help!
My two cents
d=[]
s=np.random.randint(75,100,size=6)
for x in 100-s:
a=np.random.randint(100, size=3)
b=np.random.multinomial(x, a /a.sum())
d.append(b.tolist())
s=[np.random.choice(x,4,replace= False) for x in np.column_stack((s,np.array(d))) ]
df=pd.concat([df,pd.DataFrame(s,index=df.index)],1)
df
id foo 0 1 2 3
0 A col1 16 1 7 76
1 A col2 4 2 91 3
2 B col1 4 4 1 91
3 B col3 78 8 8 6
4 D col4 8 87 3 2
5 C col2 2 0 11 87
IIUC,
df['col1'] = df.apply(lambda x: np.where(x['foo'] == 'col1', np.random.randint(75,100), np.random.randint(0,100)), axis=1)
df['col2'] = df.apply(lambda x: np.random.randint(0,100-x['col1'],1)[0], axis=1)
df['col3'] = df.apply(lambda x: np.random.randint(0,100-x[['col1','col2']].sum(),1)[0], axis=1)
df['col4'] = 100 - df[['col1','col2','col3']].sum(1).astype(int)
df[['col1','col2','col3','col4']].sum(1)
Output:
id foo col1 col2 col3 col4
0 A col1 92 2 5 1
1 A col2 60 30 0 10
2 B col1 89 7 3 1
3 B col3 72 12 0 16
4 D col4 41 52 3 4
5 C col2 72 2 22 4
My Approach
import numpy as np
def weird(lower, upper, k, col, cols):
first_num = np.random.randint(lower, upper)
delta = upper - first_num
the_rest = np.random.rand(k - 1)
the_rest = the_rest / the_rest.sum() * (delta)
the_rest = the_rest.astype(int)
the_rest[-1] = delta - the_rest[:-1].sum()
key = lambda x: x != col
return dict(zip(sorted(cols, key=key), [first_num, *the_rest]))
def f(c): return weird(75, 100, 4, c, ['col1', 'col2', 'col3', 'col4'])
df.join(pd.DataFrame([*map(f, df.foo)]))
id foo col1 col2 col3 col4
0 A col1 76 2 21 1
1 A col2 11 76 11 2
2 B col1 75 4 10 11
3 B col3 0 1 97 2
4 D col4 5 4 13 78
5 C col2 9 77 6 8
If we subtract the numbers between 75-100 by 75, the problem become generating a table of random number between 0-25 whose each row sums to 25. That can be solve by reverse cumsum:
num_cols = 4
# generate random number and sort them in each row
a = np.sort(np.random.randint(0,25, (len(df), num_cols)), axis=1)
# create a dataframe and attach a last column with values 25
new_df = pd.DataFrame(a)
new_df[num_cols] = 25
# compute the difference, which are our numbers and add to the dummies:
dummies = pd.get_dummies(df.foo) * 75
dummies += new_df.diff(axis=1).fillna(new_df[0]).values
And dummies is
col1 col2 col3 col4
0 76.0 13.0 2.0 9.0
1 1.0 79.0 2.0 4.0
2 76.0 5.0 8.0 9.0
3 1.0 3.0 79.0 10.0
4 1.0 2.0 1.0 88.0
5 1.0 82.0 1.0 7.0
which can be concatenated to the original dataframe.

match multiple columns within the same row

Table 1. I have a table that looks like this:
X Y Z
1 a p
2 a p
6 b p
7 c p
9 c p
Table 2. I have a different table that looks like this:
Col1 Col2 Col3 Col4
Row1 p p p
Row2 a b c
Row3 1
Row4 2
Row5 3
Row6 4
Row7 5
Row8 6
Row9 7
Row10 8
Row11 9
I want to mark "TRUE" when rows of table 1 match with values of its column in Table 1. As a result for example:
Col1 Col2 Col3 Col4
Row1 p p p
Row2 a b c
Row3 1 TRUE
Row4 2 TRUE
Row5 3
Row6 4
Row7 5
Row8 6 TRUE
Row9 7 TRUE
Row10 8
Row11 9 TRUE
Here is what I have tried so far. This is the formula for Col2 Row3:
=IFERROR(IF(AND(AND(MATCH(Col1Row3,X:X,0), MATCH(Col2Row1,Z:Z,0)), MATCH(Col2Row2,Y:Y,0)), "TRUE", ""),"")
I think it's not working because I am not containing the matches within the same row. How can I achieve my result?
Also, I do not want to specify a specific row in the formula because I have thousands of rows in Table 1, and Table 2 has to select values among those thousands of rows.
Use COUNTIFS
=IF(COUNTIFS($F:$F,$A3,$G:$G,B$2,$H:$H,B$1),TRUE,"")

Pandas: Calculate Median of Group over Columns

Given the following data frame:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', 'A','A','A','B','B'],
'COL2' : ['AA','AA','BB','BB','BB','BB'],
'COL3' : [2,3,4,5,4,2],
'COL4' : [0,1,2,3,4,2]})
df
COL1 COL2 COL3 COL4
0 A AA 2 0
1 A AA 3 1
2 A BB 4 2
3 A BB 5 3
4 B BB 4 4
5 B BB 2 2
I would like, as efficiently as possible (i.e. via groupby and lambda x or better), to find the median of columns 3 and 4 for each distinct group of columns 1 and 2.
The desired result is as follows:
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3
5 B BB 2 2 3
Thanks in advance!
You already had the idea -- groupby COL1 and COL2 and calculate median.
m = df.groupby(['COL1', 'COL2'])[['COL3','COL4']].apply(np.median)
m.name = 'MEDIAN'
print df.join(m, on=['COL1', 'COL2'])
COL1 COL2 COL3 COL4 MEDIAN
0 A AA 2 0 1.5
1 A AA 3 1 1.5
2 A BB 4 2 3.5
3 A BB 5 3 3.5
4 B BB 4 4 3.0
5 B BB 2 2 3.0
df.groupby(['COL1', 'COL2']).median()[['COL3','COL4']]

Resources