Order the rows of one dataframe (column with duplicates) based on a column of another dataframe in Python - python-3.x

I have two dataframes df1 and df2. I want to order df1 based on a column SET (which has duplicates for SET column but not other columns) in the order of column SETf column dataframe df2 .
df1 :-
SET Date cust_ID TYPE amt total flag LEVEL
A 6/10/2019 113252981 R 1317 16237 Y 3
C 6/18/2019 112010871 R 4582 12455 Y 2
B 6/22/2019 204671333 S 2364 24311 Y 1
B 6/22/2019 202770598 S 4721 10582 Y 1
B 6/22/2019 202706466 S 1904 25343 N 2
B 6/22/2019 202669668 S 3713 25166 N 1
B 6/22/2019 202754932 T 4792 16888 Y 2
D 6/7/2019 120304631 P 4968 25297 Y 2
D 6/7/2019 112353651 P 1622 14384 Y 3
D 6/7/2019 112349221 P 4721 15878 Y 3
D 6/8/2019 111197161 P 4490 25489 N 2
E 6/8/2019 137049981 Q 4409 10842 Y 2
A 6/8/2019 137281821 Q 1060 24085 Y 2
C 6/8/2019 136390501 Q 1649 13626 N 2
C 6/9/2019 136326431 Q 3822 13599 N 2
df2 :-
s_no SETf
1 B
2 D
3 C
4 A
5 E
I want to sort rows of df1 based on the same order of SETf of df2.
What I tried :-
df1 =df1.set_index('SET')
df1= df1.reindex(df2.index['SETf'])
df1= df1.reset_index()
It does not work as I have duplicates in SET in df1.In addition to doing that I want to order the rows based on LEVEL ascending within each SET and flag

In your second dataframe create if your s_no column is unique and ascending [1,2,3,4,etc.], then merge the two dataframes and sort by the s_no column you merged in and then drop it:
df1 = pd.merge(df1, df2[['SETf', 's_no']].rename({'SETf':'SET'}, axis=1), how='left',on='SET')
df1 = df1.sort_values(['s_no', 'flag', 'LEVEL']).drop('s_no', axis=1)
df1
Out[490]:
SET Date cust_ID TYPE amt total flag LEVEL
5 B 6/22/2019 202669668 S 3713 25166 N 1
4 B 6/22/2019 202706466 S 1904 25343 N 2
2 B 6/22/2019 204671333 S 2364 24311 Y 1
3 B 6/22/2019 202770598 S 4721 10582 Y 1
6 B 6/22/2019 202754932 T 4792 16888 Y 2
10 D 6/8/2019 111197161 P 4490 25489 N 2
7 D 6/7/2019 120304631 P 4968 25297 Y 2
8 D 6/7/2019 112353651 P 1622 14384 Y 3
9 D 6/7/2019 112349221 P 4721 15878 Y 3
13 C 6/8/2019 136390501 Q 1649 13626 N 2
14 C 6/9/2019 136326431 Q 3822 13599 N 2
1 C 6/18/2019 112010871 R 4582 12455 Y 2
12 A 6/8/2019 137281821 Q 1060 24085 Y 2
0 A 6/10/2019 113252981 R 1317 16237 Y 3
11 E 6/8/2019 137049981 Q 4409 10842 Y 2

Related

Complete DataFrame with missing steps python

I have a pandas data frame which misses some rows. It actually has the following format:
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
...
An observation should pass through every steps. I mean id ==1 has step 1 and 3 but misses step 2 (which I don't want). id==2 has step 1 and 2 and there is no step 3 and this is fine because there is no gap. id ==5 has step 2 but doesn't have step 1 so I am missing a line there.
I need to add some rows to complete the steps, I would keep var1 var2 and id as the same.
I would like to obtain this df :
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
1 2 a h
5 1 f m
...
It would be awesome if anyone could help with a smooth solution
You can try pivot the table then ffill and bfill:
(df.pivot(index='id', columns='step')
.groupby(level=0, axis=1)
.apply(lambda x: x.ffill().bfill())
.stack()
.reset_index()
)
Output:
id step var1 var2
0 1 1 a h
1 1 2 e l
2 1 3 d k
3 2 1 b i
4 2 2 e l
5 2 3 d k
6 3 1 c g
7 3 2 e l
8 3 3 d k
9 5 1 c g
10 5 2 f m
11 5 3 d k
12 6 1 g n
13 6 2 f m
14 6 3 d k

pandas how to convert a two-dimension dataframe to a one-dimension dataframe

suppose I have a dataframe with multi columns.
a b c
1
2
3
How to convert it to a single columns dataframe
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
please note that the former is a Dataframe other than Panel
Use melt:
df = df.reset_index().melt('index', var_name='col').set_index('index')[['col']]
print (df)
col
index
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
Or numpy.repeat and numpy.tile with DataFrame constructor::
a = np.repeat(df.columns, len(df))
b = np.tile(df.index, len(df.columns))
df = pd.DataFrame(a, index=b, columns=['col'])
print (df)
col
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
another way is,
pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0])
Output:
1
0
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
For exact output:
use sort_values
print pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0]).sort_values(by=[1])
1
0
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c

Column label of max in pandas

I am trying to extract maximum value in row and contributing column label from pandas dataframe. For example,
A B C D
index
x 0 1 2 3
y 3 2 1 0
I expect the following output,
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
I tried the following,
df['Maxv'] = df.apply(max,axis=1)
df['Con'] = df.idxmax(axis='rows')
It returned only the max column and 'NaN' for Con column. What is the error here?
Thanks in Advance.
AP
Need axis='columns' or axis=1 in DataFrame.idxmax:
df['Con'] = df.idxmax(axis='columns')
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
Or:
df['Con'] = df.idxmax(axis=1)
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
You get NaNs, because data are not align to index:
print (df.idxmax(axis='rows'))
A y
B y
C x
D x
dtype: object

Pandas Pivot Table Slice Off Level 0 of Index

Given the following data frame and pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I want the pivot table exactly how it is, minus index level 0, like this:
D
B C
x a 7
b 4
y a 1
b 5
z a 3
x a 5
y b 3
z a 1
b 6
Thanks in advance!
You can selectively drop an index level using reset_index with param drop=True:
In [95]:
table.reset_index('A', drop=True)
Out[95]:
D
B C
x a 7
b 4
y a 1
b 5
z a 3
x a 5
y b 3
z a 1
b 6
You can use droplevel on index:
table.index = table.index.droplevel(0).

Pandas Dynamic Stack

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','b','c','d'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','bar',0,1]]
df
foo bar 0 1
0 a e i m
1 b f j NaN
2 c g k o
3 d h NaN p
...which resulted from a previous procedure that produced columns 0 and 1 (and may have produced more or fewer columns than 0 and 1 depending on the data):
I want to somehow stack (if that's the correct term) the data so that each value of 0 and 1 (ignoring NaNs) produces a new row like this:
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
You probably noticed that the common field is foo.
It will likely occur that there are more common fields in my actual data set.
Also, I'm not sure how important it is that the index values repeat in the end result across values of foo. As long as the data is correct, that's my main concern.
Update:
What if I have 2+ common fields like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','a','b','b'],
'foo2':['a2','b2','c2','d2'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','foo2','bar',0,1]]
df
foo foo2 bar 0 1
0 a a2 e i m
1 a b2 f j NaN
2 b c2 g k o
3 b d2 h NaN p
You can use set_index, stack and reset_index:
print df.set_index('foo').stack().reset_index(level=1, drop=True).reset_index(name='bar')
foo bar
0 a e
1 a i
2 a m
3 b f
4 b j
5 c g
6 c k
7 c o
8 d h
9 d p
If you need index, use melt:
print pd.melt(df.reset_index(),
id_vars=['index', 'foo'],
value_vars=['bar', 0, 1],
value_name='bar')
.sort_values('index')
.set_index('index', drop=True)
.dropna()
.drop('variable', axis=1)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
Or use not well known lreshape:
print pd.lreshape(df.reset_index(), {'bar': ['bar', 0, 1]})
.sort_values('index')
.set_index('index', drop=True)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p

Resources