Find the Rank value based on the several columns in Pandas dataframe - python-3.x

My objective is to find the best code from the dataframe.For an Id No and pdt_No i have to find out the best code using the prob value.So i tried Rank method but it gives overall rank not specific to each Id No.
Input
Id No Pdt_No code prob
1 pdt1 HHL 0.000000
1 pdt3 HHL 50.000000
1 pdt2 HHL 0.000000
1 pdt5 HHL 50.000000
1 pdt8 HHL 100.000000
1 pdt1 HHL 50.000000
1 pdt2 HHL 100.000000
3 pdt1 HHM 0.000000
3 pdt1 HHM 0.000000
3 pdt1 HHM 25.000000
3 pdt4 HHM 33.333333
3 pdt3 HHM 33.333333
3 pdt2 HHM 0.000000
3 pdt2 HHM 50.000000
4 pdt5 ERS 0.000000
4 pdt2 ERS 0.000000
4 pdt2 MKL 100.000000
4 pdt2 MKL 50.000000
4 pdt5 MKL 5.000000
5 pdt1 MKM 0.000000
5 pdt1 MKM 100.000000
5 pdt1 MKM 33.333333
5 pdt1 LPM 63.333333
5 pdt2 LPM 0.000000
5 pdt2 LPM 0.000000
5 pdt2 LPM 33.333333
5 pdt2 LPM 100.000000
what I have tried is
df['rank']=df.groupby(['Id No','Pdt_No'])['prob'].rank(ascending=False)
output
Id No Pdt_No code prob rank
1 pdt1 HHL 0.000000 2
1 pdt3 HHL 50.000000 1
1 pdt2 HHL 0.000000 2
1 pdt5 HHL 50.000000 1
1 pdt8 HHL 100.000000 1
1 pdt1 HHL 50.000000 1
1 pdt2 HHL 100.000000 1
3 pdt1 HHM 0.000000 2
3 pdt1 HHM 0.000000 2
3 pdt1 HHM 25.000000 1
3 pdt4 HHM 33.333333 1
3 pdt3 HHM 33.333333 1
3 pdt2 HHM 0.000000 2
3 pdt2 HHM 50.000000 1
4 pdt5 ERS 0.000000 2
4 pdt2 ERS 0.000000 3
4 pdt2 MKL 100.000000 1
4 pdt2 MKL 50.000000 2
4 pdt5 MKL 5.000000 1
5 pdt1 MKM 0.000000 4
5 pdt1 MKM 100.000000 1
5 pdt1 MKM 33.333333 3
5 pdt1 LPM 63.333333 2
5 pdt2 LPM 0.000000 3
5 pdt2 LPM 0.000000 3
5 pdt2 LPM 33.333333 2
5 pdt2 LPM 100.000000 1

Related

How to get the max value from previous N rows of a record in Pandas?

I have the following Pandas DataFrame:
date value
2021-01-01 10
2021-01-02 5
2021-01-03 7
2021-01-04 1
2021-01-05 12
2021-01-06 8
2021-01-07 9
2021-01-08 8
2021-01-09 4
2021-01-10 3
I need to get the max value from the previous N-1 rows (counting the current record) and make an operation. For example:
For N=3 and the operation = current_row / MAX (previous_N-1_rows_and_current), this should be the result:
date value Operation
2021-01-01 10 10/10
2021-01-02 5 5/10
2021-01-03 7 7/10
2021-01-04 1 1/7
2021-01-05 12 12/12
2021-01-06 8 8/12
2021-01-07 9 9/12
2021-01-08 8 8/9
2021-01-09 4 4/9
2021-01-10 3 3/8
If it's possible, in the spirit of the pythonic way.
Thanks and regards.
We can calculate rolling max over the value column then divide value column by this rolling max to get the result
df['op'] = df['value'] / df.rolling(3, min_periods=1)['value'].max()
date value op
0 2021-01-01 10 1.000000
1 2021-01-02 5 0.500000
2 2021-01-03 7 0.700000
3 2021-01-04 1 0.142857
4 2021-01-05 12 1.000000
5 2021-01-06 8 0.666667
6 2021-01-07 9 0.750000
7 2021-01-08 8 0.888889
8 2021-01-09 4 0.444444
9 2021-01-10 3 0.375000
You can use .rolling:
df["Operation"] = df.rolling(3, min_periods=1)["value"].apply(
lambda x: x.iat[-1] / x.max()
)
print(df)
Prints:
date value Operation
0 2021-01-01 10 1.000000
1 2021-01-02 5 0.500000
2 2021-01-03 7 0.700000
3 2021-01-04 1 0.142857
4 2021-01-05 12 1.000000
5 2021-01-06 8 0.666667
6 2021-01-07 9 0.750000
7 2021-01-08 8 0.888889
8 2021-01-09 4 0.444444
9 2021-01-10 3 0.375000

normalizing pandas dataframe

Given this dataframe:
HOUSEID PERSONID HHSTATE TRPMILES
0 20000017 1 IN 22.000000
1 20000017 1 IN 0.222222
2 20000017 1 IN 22.000000
3 20000017 2 IN 22.000000
4 20000017 2 IN 0.222222
5 20000017 2 IN 0.222222
6 20000231 1 TX 3.000000
7 20000231 1 TX 2.000000
8 20000231 1 TX 6.000000
9 20000231 1 TX 5.000000
I want to normalize TRPMILES based on the max value of HHSTATE:
HOUSEID PERSONID HHSTATE TRPMILES
0 20000017 1 IN 1
1 20000017 1 IN 0.009999
2 20000017 1 IN 1
3 20000017 2 IN 1
4 20000017 2 IN 0.009999
5 20000017 2 IN 0.009999
6 20000231 1 TX 0.500000
7 20000231 1 TX 0.333333
8 20000231 1 TX 1
9 20000231 1 TX 0.833333
Here is what I have tried:
df=df.div(df['TRPMILES'].max(level=[2]),level=2).reset_index()
I have a million rows with 50 different values for HHSTATE.
can you give any hints?
I think the following will work for you:
df["max_trpmiles"] = df.groupby("HHSTATE")["TRPMILES"].transform("max")
df["TRPMILES"] /= df["max_trpmiles"]
df = df.drop("max_trpmiles", axis=1)

NaN values in product id

Am concatenating my two dataframes base_df and base_df1 with base_df having product id and base_df1 as sales, profit and discount.
base_df1
sales profit discount
0 0.050090 0.000000 0.262335
1 0.110793 0.000000 0.260662
2 0.309561 0.864121 0.241432
3 0.039217 0.591474 0.260687
4 0.070205 0.000000 0.263628
base_df['Product ID']
0 FUR-ADV-10000002
1 FUR-ADV-10000108
2 FUR-ADV-10000183
3 FUR-ADV-10000188
4 FUR-ADV-10000190
final_df=pd.concat([base_df1,base_df], axis=0, ignore_index=True,sort=False)
But my final_df.head() having NaN values in product_id column, what might be the issue.
sales Discount profit product id
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
Try using axis=1:
final_df=pd.concat([base_df1,base_df], axis=1, sort=False)
Output:
sales profit discount ProductID
0 0.050090 0.000000 0.262335 FUR-ADV-10000002
1 0.110793 0.000000 0.260662 FUR-ADV-10000108
2 0.309561 0.864121 0.241432 FUR-ADV-10000183
3 0.039217 0.591474 0.260687 FUR-ADV-10000188
4 0.070205 0.000000 0.263628 FUR-ADV-10000190
With axis=0 you are stacking the dataframes vertically and with pandas using intrinsic data alignment, meaning aligning the data with the indexes, you are generating the following dataframe:
final_df=pd.concat([base_df1,base_df], axis=0, sort=False)
sales profit discount ProductID
0 0.050090 0.000000 0.262335 NaN
1 0.110793 0.000000 0.260662 NaN
2 0.309561 0.864121 0.241432 NaN
3 0.039217 0.591474 0.260687 NaN
4 0.070205 0.000000 0.263628 NaN
0 NaN NaN NaN FUR-ADV-10000002
1 NaN NaN NaN FUR-ADV-10000108
2 NaN NaN NaN FUR-ADV-10000183
3 NaN NaN NaN FUR-ADV-10000188
4 NaN NaN NaN FUR-ADV-10000190

Transposing multi index dataframe in pandas

HID gen views
1 1 20
1 2 2532
1 3 276
1 4 1684
1 5 779
1 6 200
1 7 545
2 1 20
2 2 7478
2 3 750
2 4 7742
2 5 2643
2 6 208
2 7 585
3 1 21
3 2 4012
3 3 2019
3 4 1073
3 5 3372
3 6 8
3 7 1823
3 8 22
this is a sample section of a data frame, where HID and gen are indexes.
how can it be transformed like this
HID 1 2 3 4 5 6 7 8
1 20 2532 276 1684 779 200 545 nan
2 20 7478 750 7742 2643 208 585 nan
3 21 4012 2019 1073 3372 8 1823 22
Its called pivoting i.e
df.reset_index().pivot('HID','gen','views')
gen 1 2 3 4 5 6 7 8
HID
1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0
Use unstack:
df = df['views'].unstack()
If need also HID column add reset_index + rename_axis:
df = df['views'].unstack().reset_index().rename_axis(None, 1)
print (df)
HID 1 2 3 4 5 6 7 8
0 1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
1 2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
2 3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0

Pandas Modify Dataframe

I have a dataframe as below
0 1 2 3 4 5
0 0.428519 0.000000 0.0 0.541096 0.250099 0.345604
1 0.056650 0.000000 0.0 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
3 0.849066 0.559117 0.0 0.374447 0.424247 0.586254
4 0.317644 0.000000 0.0 0.271171 0.586686 0.424560
I would like to modify it as below
0 0 0.428519
0 1 0.000000
0 2 0.0
0 3 0.541096
0 4 0.250099
0 5 0.345604
1 0 0.056650
1 1 0.000000
........
Use stack with reset_index:
df1 = df.stack().reset_index()
df1.columns = ['col1','col2','col3']
print (df1)
col1 col2 col3
0 0 0 0.428519
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.541096
4 0 4 0.250099
5 0 5 0.345604
6 1 0 0.056650
7 1 1 0.000000
8 1 2 0.000000
9 1 3 0.000000
10 1 4 0.000000
11 1 5 0.000000
12 2 0 0.000000
13 2 1 0.000000
14 2 2 0.000000
15 2 3 0.000000
16 2 4 0.000000
17 2 5 0.000000
18 3 0 0.849066
19 3 1 0.559117
20 3 2 0.000000
21 3 3 0.374447
22 3 4 0.424247
23 3 5 0.586254
24 4 0 0.317644
25 4 1 0.000000
26 4 2 0.000000
27 4 3 0.271171
28 4 4 0.586686
29 4 5 0.424560
Numpy solution with numpy.tile and numpy.repeat, flattening is by numpy.ravel:
df2 = pd.DataFrame({
"col1": np.repeat(df.index, len(df.columns)),
"col2": np.tile(df.columns, len(df.index)),
"col3": df.values.ravel()})
print (df2)
col1 col2 col3
0 0 0 0.428519
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.541096
4 0 4 0.250099
5 0 5 0.345604
6 1 0 0.056650
7 1 1 0.000000
8 1 2 0.000000
9 1 3 0.000000
10 1 4 0.000000
11 1 5 0.000000
12 2 0 0.000000
13 2 1 0.000000
14 2 2 0.000000
15 2 3 0.000000
16 2 4 0.000000
17 2 5 0.000000
18 3 0 0.849066
19 3 1 0.559117
20 3 2 0.000000
21 3 3 0.374447
22 3 4 0.424247
23 3 5 0.586254
24 4 0 0.317644
25 4 1 0.000000
26 4 2 0.000000
27 4 3 0.271171
28 4 4 0.586686
29 4 5 0.424560

Resources