Pandas Pivot Table Slice Off Level 0 of Index - python-3.x

Given the following data frame and pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I want the pivot table exactly how it is, minus index level 0, like this:
D
B C
x a 7
b 4
y a 1
b 5
z a 3
x a 5
y b 3
z a 1
b 6
Thanks in advance!

You can selectively drop an index level using reset_index with param drop=True:
In [95]:
table.reset_index('A', drop=True)
Out[95]:
D
B C
x a 7
b 4
y a 1
b 5
z a 3
x a 5
y b 3
z a 1
b 6

You can use droplevel on index:
table.index = table.index.droplevel(0).

Related

Order the rows of one dataframe (column with duplicates) based on a column of another dataframe in Python

I have two dataframes df1 and df2. I want to order df1 based on a column SET (which has duplicates for SET column but not other columns) in the order of column SETf column dataframe df2 .
df1 :-
SET Date cust_ID TYPE amt total flag LEVEL
A 6/10/2019 113252981 R 1317 16237 Y 3
C 6/18/2019 112010871 R 4582 12455 Y 2
B 6/22/2019 204671333 S 2364 24311 Y 1
B 6/22/2019 202770598 S 4721 10582 Y 1
B 6/22/2019 202706466 S 1904 25343 N 2
B 6/22/2019 202669668 S 3713 25166 N 1
B 6/22/2019 202754932 T 4792 16888 Y 2
D 6/7/2019 120304631 P 4968 25297 Y 2
D 6/7/2019 112353651 P 1622 14384 Y 3
D 6/7/2019 112349221 P 4721 15878 Y 3
D 6/8/2019 111197161 P 4490 25489 N 2
E 6/8/2019 137049981 Q 4409 10842 Y 2
A 6/8/2019 137281821 Q 1060 24085 Y 2
C 6/8/2019 136390501 Q 1649 13626 N 2
C 6/9/2019 136326431 Q 3822 13599 N 2
df2 :-
s_no SETf
1 B
2 D
3 C
4 A
5 E
I want to sort rows of df1 based on the same order of SETf of df2.
What I tried :-
df1 =df1.set_index('SET')
df1= df1.reindex(df2.index['SETf'])
df1= df1.reset_index()
It does not work as I have duplicates in SET in df1.In addition to doing that I want to order the rows based on LEVEL ascending within each SET and flag
In your second dataframe create if your s_no column is unique and ascending [1,2,3,4,etc.], then merge the two dataframes and sort by the s_no column you merged in and then drop it:
df1 = pd.merge(df1, df2[['SETf', 's_no']].rename({'SETf':'SET'}, axis=1), how='left',on='SET')
df1 = df1.sort_values(['s_no', 'flag', 'LEVEL']).drop('s_no', axis=1)
df1
Out[490]:
SET Date cust_ID TYPE amt total flag LEVEL
5 B 6/22/2019 202669668 S 3713 25166 N 1
4 B 6/22/2019 202706466 S 1904 25343 N 2
2 B 6/22/2019 204671333 S 2364 24311 Y 1
3 B 6/22/2019 202770598 S 4721 10582 Y 1
6 B 6/22/2019 202754932 T 4792 16888 Y 2
10 D 6/8/2019 111197161 P 4490 25489 N 2
7 D 6/7/2019 120304631 P 4968 25297 Y 2
8 D 6/7/2019 112353651 P 1622 14384 Y 3
9 D 6/7/2019 112349221 P 4721 15878 Y 3
13 C 6/8/2019 136390501 Q 1649 13626 N 2
14 C 6/9/2019 136326431 Q 3822 13599 N 2
1 C 6/18/2019 112010871 R 4582 12455 Y 2
12 A 6/8/2019 137281821 Q 1060 24085 Y 2
0 A 6/10/2019 113252981 R 1317 16237 Y 3
11 E 6/8/2019 137049981 Q 4409 10842 Y 2

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Column label of max in pandas

I am trying to extract maximum value in row and contributing column label from pandas dataframe. For example,
A B C D
index
x 0 1 2 3
y 3 2 1 0
I expect the following output,
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
I tried the following,
df['Maxv'] = df.apply(max,axis=1)
df['Con'] = df.idxmax(axis='rows')
It returned only the max column and 'NaN' for Con column. What is the error here?
Thanks in Advance.
AP
Need axis='columns' or axis=1 in DataFrame.idxmax:
df['Con'] = df.idxmax(axis='columns')
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
Or:
df['Con'] = df.idxmax(axis=1)
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
You get NaNs, because data are not align to index:
print (df.idxmax(axis='rows'))
A y
B y
C x
D x
dtype: object

python3 modifying rows in a dataframe based on a condition

I have a dataframe something like
A B C
1 4 x
2 8 y
3 7 z
4 12 y
5 10 b
i need to modify column B based on condition something like
if B <= 5 then B = 1
if B > 5 and B <= 10 then B = 2
if B > 10 and B < 15 then B = 3
so that my dataframe becomes
A B C
1 1 x
2 2 y
3 2 z
4 3 y
5 2 b
i am okay if I have to add a new column first and then drop column B. Could anyone help please?
You should use the apply function to implement this.
def check(row):
if (row['B']) <= 5:
return 1
elif (row['B'] > 5) and (row['B'] <= 10):
return 2
elif (row['B'] > 10) and (row['B'] <= 15):
return 3
These would apply the function to each row and then you can perform the checks.
df['B'] = df.apply(check, axis = 1)
Then the resulting DF would look like:
A B C
1 1 x
2 2 y
3 2 z
4 3 y
5 2 b
More documentation available here.

Resources