Pandas Pivot table without aggregating

Pandas Pivot table without aggregating - python-3.x

I have a dataframe df as:
Acct_Id Acct_Nm Srvc_Id Phone_Nm Phone_plan_value Srvc_Num
51 Roger 789 Pixel 30 1
51 Roger 800 iPhone 25 2
51 Roger 945 Galaxy 40 3
78 Anjay 100 Nokia 50 1
78 Anjay 120 Oppo 30 2
32 Rafa 456 HTC 35 1
I want to transform the dataframe so I can have 1 row per Acct_Id and Acct_Nm as:
Acct_Id Acct_Nm Srvc_Num_1 Srvc_Num_2 Srvc_Num_3
Srvc_Id Phone_Nm Phone_plan_value Srvc_Id Phone_Nm Phone_plan_value Srvc_Id Phone_Nm Phone_plan_value
51 Roger 789 Pixel 30 800 iPhone 25 945 Galaxy 40
78 Anjay 100 Nokia 50 120 Oppo 30
32 Rafa 456 HTC 35
I am not sure how to achieve the same in pandas.

More like a pivot problem , but need swaplevel and sort_index
df.set_index(['Acct_Id','Acct_Nm','Srvc_Num']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('Srvc_Num_')
Out[289]:
Srvc_Num Srvc_Num_1 \
Srvc_Num_Phone_Nm Srvc_Num_Phone_plan_value Srvc_Num_Srvc_Id
Acct_Id Acct_Nm
32 Rafa HTC 35.0 456.0
51 Roger Pixel 30.0 789.0
78 Anjay Nokia 50.0 100.0
Srvc_Num Srvc_Num_2 \
Srvc_Num_Phone_Nm Srvc_Num_Phone_plan_value Srvc_Num_Srvc_Id
Acct_Id Acct_Nm
32 Rafa None NaN NaN
51 Roger iPhone 25.0 800.0
78 Anjay Oppo 30.0 120.0
Srvc_Num Srvc_Num_3
Srvc_Num_Phone_Nm Srvc_Num_Phone_plan_value Srvc_Num_Srvc_Id
Acct_Id Acct_Nm
32 Rafa None NaN NaN
51 Roger Galaxy 40.0 945.0
78 Anjay None NaN NaN
And here is pivot_table
pd.pivot_table(df,index=['Acct_Id','Acct_Nm'],columns=['Srvc_Num'],values=['Phone_Nm','Phone_plan_value','Srvc_Id'],aggfunc='first')

How about:
df.set_index(['Acct_Id', 'Acct_Nm', 'Srvc_Num']).unstack().swaplevel(0, 1, axis = 1).sort_index(axis = 1)

Related

Fill NaN values from its Previous Value pandas

I have below Data from the excel sheet and i want every NaN to be filled from Just its previous value even if its one or more NaN. I tried with ffill() method but doesn't solve the purpose because it takes very First value before NaN of the column and populated that to all NaN.
Could someone help pls.
My Dtaframe:
import pandas as pd
df = pd.read_excel("Example-sheat.xlsx",sheet_name='Sheet1')
#df = df.fillna(method='ffill')
#df = df['AuthenticZTed domaTT controller'].ffill()
print(df)
My DataFrame output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 NaN TTv1670
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 NaN xxgb-gen
8 ZTPGRKMIK1DC200.example.com TTva038
9 DU2RDCRDC1DC204.example.com TTv0071
10 NaN ttv0032
11 KT1MUC02DUDC201.example.com TTv0073
12 NaN TTv0679
13 TN1SZZ67DC200.example.com TTv1180
14 TT1NDZ45DC202.example.com TTv1181
15 TT1BLR01APDC200.example.com TTv0859
16 TN1SZZ67DC200.example.com xxg2089
17 NaN TTv1846
18 ZTPGRKMIK1DC200.example.com TTvtp064
19 PR1CPQ01DC200.example.com TTv0950
20 PR1CPQ01DC200.example.com TTc7005
21 NaN TTv0678
22 KT1MUC02DUDC201.example.com TTv257032798
23 PR1CPQ01DC200.example.com xxg2016
24 NaN TTv0313
25 TT1BLR01APDC200.example.com TTc4901
26 NaN TTv0710
27 DU2RDCRDC1DC204.example.com xxg3008
28 NaN TTv1080
29 PR1CPQ01DC200.example.com xxg2022
30 NaN xxg2057
31 NaN TTv1522
32 TN1SZZ67DC200.example.com TTv258998881
33 PR1CPQ01DC200.example.com TTv259064418
34 ZTPGRKMIK1DC200.example.com TTv259129955
35 TT1BLR01APDC200.example.com xxg2034
36 NaN TTv259326564
37 TNHSZPBCD2DC200.example.com TTv259129952
38 KT1MUC02DUDC201.example.com TTv259195489
39 ZTPGRKMIK1DC200.example.com TTv0683
40 ZTPGRKMIK1DC200.example.com TTv0885
41 TT1BLR01APDC200.example.com dbexh
42 NaN TTvtp065
43 TN1PEK01APDC200.example.com TTvtp057
44 ZTPGRKMIK1DC200.example.com TTvtp007
45 NaN TTvtp063
46 TT1BLR01APDC200.example.com TTvtp032
47 KTphbgsa11dc201.example.com TTvtp046
48 NaN TTvtp062
49 PR1CPQ01DC200.example.com TTv0235
50 NaN TTv0485
51 TT1NDZ45DC202.example.com TTv0236
52 NaN TTv0486
53 PR1CPQ01DC200.example.com TTv0237
54 NaN TTv0487
55 TT1BLR01APDC200.example.com TTv0516
56 TN1CQI02DC200.example.com TTv1285
57 TN1PEK01APDC200.example.com TTv0440
58 NaN liv9007
59 HR1GDL28DC200.example.com TTv0445
60 NaN tuv006
61 FTGFTPTP34DC203.example.com TTv0477
62 NaN tuv002
63 TN1CQI02DC200.example.com TTv0534
64 TN1SZZ67DC200.example.com TTv0639
65 NaN TTv0825
66 NaN TTv1856
67 TT1BLR01APDC200.example.com TTva101
68 TN1SZZ67DC200.example.com TTv1306
69 KTphbgsa11dc201.example.com TTv1072
70 NaN webx02
71 KT1MUC02DUDC201.example.com TTv1310
72 PR1CPQ01DC200.example.com TTv1151
73 TN1CQI02DC200.example.com TTv1165
74 NaN tuv90
75 TN1SZZ67DC200.example.com TTv1065
76 KTphbgsa11dc201.example.com TTv1737
77 NaN ramn01
78 HR1GDL28DC200.example.com ramn02
79 NaN ptb001
80 HR1GDL28DC200.example.com ptn002
81 NaN ptn003
82 TN1SZZ67DC200.example.com TTs0057
83 PR1CPQ01DC200.example.com TTs0058
84 NaN TTs0058-duplicZTe
85 PR1CPQ01DC200.example.com xxg2080
86 KTphbgsa11dc204.example.com xxg2081
87 TN1PEK01APDC200.example.com xxg2082
88 NaN xxg3002
89 TN1SZZ67DC200.example.com xxg2084
90 NaN xxg3005
91 ZTPGRKMIK1DC200.example.com xxg2086
92 NaN xxg3007
93 KT1MUC02DUDC201.example.com xxg2098
94 NaN xxg3014
95 TN1PEK01APDC200.example.com xxg2026
96 NaN xxg2094
97 TN1PEK01APDC200.example.com livtp005
98 KT1MUC02DUDC201.example.com xxg2059
99 ZTPGRKMIK1DC200.example.com acc9102
100 NaN xxg2111
101 TN1CQI02DC200.example.com xxgtp009
Desired Output:
AuthenticZTed domaTT controller source KTvice naHR
0 ZTPGRKMIK1DC200.example.com TTv1614
1 TT1NDZ45DC202.example.com TTv1459
2 TT1NDZ45DC202.example.com TTv1495
3 TT1NDZ45DC202.example.com TTv1670 <---
4 TT1NDZ45DC202.example.com TTv1048
5 TN1CQI02DC200.example.com TTv1001
6 DU2RDCRDC1DC204.example.com TTva082
7 DU2RDCRDC1DC204.example.com xxgb-gen <---

1- You are already close to your solution, just use shift() with ffill() it should work.
df = df.apply(lambda x: x.fillna(df['AuthenticZTed domaTT controller']).shift()).ffill()
2- As Quang Suggested that in the comments aso works..
df['AuthenticZTed domaTT controller'] = df['AuthenticZTed domaTT controller'].ffill()
3- or you can also try follows
df = df.fillna({var: df['AuthenticZTed domaTT controller'].shift() for var in df}).ffill()
4- other way around you can define a cols variable if you have multiple columns and then loop through it.
cols = ['AuthenticZTed domaTT controller', 'source KTvice naHR']
for cols in df.columns:
df[cols] = df[cols].ffill()
print(df)
OR
df.loc[:,cols] = df.loc[:,cols].ffill()

Rolling window percentile rank over a multi-index Pandas DataFrame

I am creating a percentile rank over a rolling window of time and would like help refining my approach.
My DataFrame has a multi-index with the first level set to datetime and the second set to an identifier. Ultimately, I’d like the rolling window to evaluate the trailing n periods, including the current period, and produce the corresponding percentile ranks.
I referenced the posts shown below but found they were working with the data a bit differently than how I intend to. In those posts, the final functions group results by identifier and then by datetime, whereas I'm looking to use rolling panels of data in my function (dates and identifiers).
using rolling functions on multi-index dataframe in pandas
Panda rolling window percentile rank
This is an example of what I am after.
Create a sample DataFrame:
num_days = 5
np.random.seed(8675309)
stock_data = {
"AAPL": np.random.randint(1, max_value, size=num_days),
"MSFT": np.random.randint(1, max_value, size=num_days),
"WMT": np.random.randint(1, max_value, size=num_days),
"TSLA": np.random.randint(1, max_value, size=num_days)
}
dates = pd.date_range(
start="2013-01-03",
periods=num_days,
freq=BDay()
)
sample_df = pd.DataFrame(stock_data, index=dates)
sample_df = sample_df.stack().to_frame(name='data')
sample_df.index.names = ['date', 'ticker']
Which outputs:
date ticker
2013-01-03 AAPL 2
MSFT 93
TSLA 39
WMT 21
2013-01-04 AAPL 141
MSFT 43
TSLA 205
WMT 20
2013-01-07 AAPL 256
MSFT 93
TSLA 103
WMT 25
2013-01-08 AAPL 233
MSFT 60
TSLA 13
WMT 104
2013-01-09 AAPL 19
MSFT 120
TSLA 282
WMT 293
The code below breaks out the sample_df into 2 day increments and produces a rank vs. ranking over a rolling window of time. So it's close, but not what I'm after.
sample_df.reset_index(level=1, drop=True)[['data']] \
.apply(
lambda x: x.groupby(pd.Grouper(level=0, freq='2d')).rank()
)
I then tried what's shown below without much luck either.
from scipy.stats import rankdata
def rank(x):
return rankdata(x, method='ordinal')[-1]
sample_df.reset_index(level=1, drop=True) \
.rolling(window="2d", min_periods=1) \
.apply(
lambda x: rank(x)
)
I finally arrived at the output I'm looking for but the formula seems a bit contrived, so I'm hoping to identify a more elegant approach if one exists.
import numpy as np
import pandas as pd
from pandas.tseries.offsets import BDay
window_length = 1
target_column = "data"
def rank(df, target_column, ids, window_length):
percentile_ranking = []
list_of_ids = []
date_index = df.index.get_level_values(0).unique()
for date in date_index:
rolling_start_date = date - BDay(window_length)
first_date = date_index[0] + BDay(window_length)
trailing_values = df.loc[rolling_start_date:date, target_column]
# Only calc rolling percentile after the rolling window has lapsed
if date < first_date:
pass
else:
percentile_ranking.append(
df.loc[date, target_column].apply(
lambda x: stats.percentileofscore(trailing_values, x, kind="rank")
)
)
list_of_ids.append(df.loc[date, ids])
ranks, output_ids = pd.concat(percentile_ranking), pd.concat(list_of_ids)
df = pd.DataFrame(
ranks.values, index=[ranks.index, output_ids], columns=["percentile_rank"]
)
return df
ranks = rank(
sample_df.reset_index(level=1),
window_length=1,
ids='ticker',
target_column="data"
)
sample_df.join(ranks)
I get the feeling that my rank function is more than what's needed here. I appreciate any ideas/feedback to help in simplifying this code to arrive at the output below. Thank you!
data percentile_rank
date ticker
2013-01-03 AAPL 2 NaN
MSFT 93 NaN
TSLA 39 NaN
WMT 21 NaN
2013-01-04 AAPL 141 87.5
MSFT 43 62.5
TSLA 205 100.0
WMT 20 25.0
2013-01-07 AAPL 256 100.0
MSFT 93 50.0
TSLA 103 62.5
WMT 25 25.0
2013-01-08 AAPL 233 87.5
MSFT 60 37.5
TSLA 13 12.5
WMT 104 75.0
2013-01-09 AAPL 19 25.0
MSFT 120 62.5
TSLA 282 87.5
WMT 293 100.0

Edited: The original answer was taking 2d groups without the rolling effect, and just grouping the first two days that appeared. If you want rolling by every 2 days:
Dataframe pivoted to keep the dates as index and ticker as columns
pivoted = sample_df.reset_index().pivot('date','ticker','data')
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 2 93 39 21
2013-01-04 141 43 205 20
2013-01-07 256 93 103 25
2013-01-08 233 60 13 104
2013-01-09 19 120 282 293
Now we can apply a rolling function and consider all stocks in the same window within that rolling
from scipy.stats import rankdata
def pctile(s):
wdw = sample_df.loc[s.index,:].values.flatten() ##get all stock values in the period
ranked = rankdata(wdw) / len(wdw)*100 ## their percentile
return ranked[np.where(wdw == s[len(s)-1])][0] ## return this value's percentile
pivoted_pctile = pivoted.rolling('2D').apply(pctile, raw=False)
Output
ticker AAPL MSFT TSLA WMT
date
2013-01-03 25.0 100.0 75.0 50.0
2013-01-04 87.5 62.5 100.0 25.0
2013-01-07 100.0 50.0 75.0 25.0
2013-01-08 87.5 37.5 12.5 75.0
2013-01-09 25.0 62.5 87.5 100.0
To get the original format back, we just melt the results:
pd.melt(pivoted_pctile.reset_index(),'date')\
.sort_values(['date', 'ticker']).reset_index()
Output
value
date ticker
2013-01-03 AAPL 25.0
MSFT 100.0
TSLA 75.0
WMT 50.0
2013-01-04 AAPL 87.5
MSFT 62.5
TSLA 100.0
WMT 25.0
2013-01-07 AAPL 100.0
MSFT 50.0
TSLA 75.0
WMT 25.0
2013-01-08 AAPL 87.5
MSFT 37.5
TSLA 12.5
WMT 75.0
2013-01-09 AAPL 25.0
MSFT 62.5
TSLA 87.5
WMT 100.0
If you prefer in one execution:
pd.melt(
sample_df\
.reset_index()\
.pivot('date','ticker','data')\
.rolling('2D').apply(pctile, raw=False)\
.reset_index(),'date')\
.sort_values(['date', 'ticker']).set_index(['date','ticker'])
Note that on day 7 this is different than what you displayed. This is actually rolling, so in day 7, because there is no day 6, the values are ranked only for that day, as the window of data is only 4 values and windows don't look forward. This differs from your result for that day.
Original
Is this something you might be looking for? I combined the groupby on the date (2 days) with transform so the number of observations is the same as the series provided. As you can see I kept the first observation of the window group.
df = sample_df.reset_index()
df['percentile_rank'] = df.groupby([pd.Grouper(key='date',freq='2D')]['data']\
.transform(lambda x: x.rank(ascending=True)/len(x)*100)
Output
Out[19]:
date ticker data percentile_rank
0 2013-01-03 AAPL 2 12.5
1 2013-01-03 MSFT 93 75.0
2 2013-01-03 WMT 39 50.0
3 2013-01-03 TSLA 21 37.5
4 2013-01-04 AAPL 141 87.5
5 2013-01-04 MSFT 43 62.5
6 2013-01-04 WMT 205 100.0
7 2013-01-04 TSLA 20 25.0
8 2013-01-07 AAPL 256 100.0
9 2013-01-07 MSFT 93 50.0
10 2013-01-07 WMT 103 62.5
11 2013-01-07 TSLA 25 25.0
12 2013-01-08 AAPL 233 87.5
13 2013-01-08 MSFT 60 37.5
14 2013-01-08 WMT 13 12.5
15 2013-01-08 TSLA 104 75.0
16 2013-01-09 AAPL 19 25.0
17 2013-01-09 MSFT 120 50.0
18 2013-01-09 WMT 282 75.0
19 2013-01-09 TSLA 293 100.0

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN

I am doing kind of research and need to delete the raws containing some values which are not in a specific range using Python.
My Dataset in Excel:
I want to replace the big values of column A (not within range 1-20) with NaN. Replace Big values of column B (not within range 21-40) and so on.
Now I want to drop/ delete the raws contains the NaN values
Expected output should be like:

You can try this to solve your problem. Here, I tried to simulate your problem and solve it with below given code:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in range(1,10,1) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in range(10,20,1) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in range(20,30,1) else x)
print(data)
data = data.dropna()
print(data)
Orignal data:
A B C
0 1 10 20
1 2 11 22
2 4 15 25
3 8 20 30
4 12 25 35
5 18 40 55
6 20 45 60
Output with NaN:
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.0 30.0
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Final Output:
A B C
4 12.0 25.0 35.0
5 18.0 40.0 55.0
6 20.0 45.0 60.0
Try this for non-integer numbers:
import numpy as np
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
data['A'] = data['A'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(1.00,10.00,0.01)) else x)
data['B'] = data['B'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(10.00,20.00,0.01)) else x)
data['C'] = data['C'].apply(lambda x: np.nan if x in (round(y,2) for y in np.arange(20.00,30.00,0.01)) else x)
print(data)
data = data.dropna()
print(data)
Output:
A B C
0 1.25 10.56 20.11
1 2.39 11.19 22.92
2 4.00 15.65 25.27
3 8.89 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN 20.31 30.15
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48
A B C
4 12.15 25.91 35.64
5 18.29 40.15 55.98
6 20.46 45.00 60.48

try this,
df= df.drop(df.index[df.idxmax()])
O/P:
A B C D
0 1 21 41 61
1 2 22 42 62
2 3 23 43 63
3 4 24 44 64
4 5 25 45 65
5 6 26 46 66
6 7 27 47 67
7 8 28 48 68
8 9 29 49 69
13 14 34 54 74
14 15 35 55 75
15 16 36 56 76
16 17 37 57 77
17 18 38 58 78
18 19 39 59 79
19 20 40 60 80
use idxmax and drop the returned index.

why am I getting a too many indexers error?

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3

I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Lookup Pandas Dataframe comparing different size data frames

I have two pandas df that look like this
df1
Amount Price
0 5 50
1 10 53
2 15 55
3 30 50
4 45 61
df2
Used amount
0 4.5
1 1.2
2 6.2
3 4.1
4 25.6
5 31
6 19
7 15
I am trying to insert a new column on df2 that will give provide the price from the df1, df1 and df2 have different size, df1 is smaller
I am expecting something like this
df3
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31 61
6 19 50
7 15 55
I am thinking to solve this, with something like this function
def price_function(key, table):
used_amount_df2 = (row[0] for row in df1)
price = filter(lambda x: x < key, used_amount_df1)

Here is my own solution
1st approach:
from itertools import product
import pandas as pd
df2=df2.reset_index()
DF=pd.DataFrame(list(product(df2.Usedamount, df1.Amount)), columns=['l1', 'l2'])
DF['DIFF']=(DF.l1-DF.l2)
DF=DF.loc[DF.DIFF<=0,]
DF=DF.sort_values(['l1','DIFF'],ascending=[True,False]).drop_duplicates(['l1'],keep='first')
df1.merge(DF,left_on='Amount',right_on='l2',how='left').merge(df2,left_on='l1',right_on='Usedamount',how='right').loc[:,['index','Usedamount','Price']].set_index('index').sort_index()
Out[185]:
Usedamount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
2nd using pd.merge_asof I recommend this
df2=df2.rename({'Used amount':Amount}).sort_values('Amount')
df2=df2.reset_index()
pd.merge_asof(df2,df1,on='Amount',allow_exact_matches=True,direction='forward')\
.set_index('index').sort_index()
Out[206]:
Amount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55

Using pd.IntervalIndex you can
In [468]: df1.index = pd.IntervalIndex.from_arrays(df1.Amount.shift().fillna(0),df1.Amount)
In [469]: df1
Out[469]:
Amount Price
(0.0, 5.0] 5 50
(5.0, 10.0] 10 53
(10.0, 15.0] 15 55
(15.0, 30.0] 30 50
(30.0, 45.0] 45 61
In [470]: df2['price'] = df2['Used amount'].map(df1.Price)
In [471]: df2
Out[471]:
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55

You can use cut or searchsorted for create bins.
Notice: Index in df1 has to be default - 0,1,2....
#create default index if necessary
df1 = df1.reset_index(drop=True)
#create bins
bins = [0] + df1['Amount'].tolist()
#get index values of df1 by values of Used amount
a = pd.cut(df2['Used amount'], bins=bins, labels=df1.index)
#assign output
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
a = df1['Amount'].searchsorted(df2['Used amount'])
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55

You can use pd.DataFrame.reindex with method=bfill
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill')
Price
Used amount
4.5 50
1.2 50
6.2 53
4.1 50
25.6 50
31.0 61
19.0 50
15.0 55
To add that to a new column we can use
join
df2.join(
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill'),
on='Used amount'
)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Or assign
df2.assign(
Price=df1.set_index('Amount').reindex(df2['Used amount'], method='bfill').values)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas Pivot table without aggregating - python-3.x

How about: df.set_index(['Acct_Id', 'Acct_Nm', 'Srvc_Num']).unstack().swaplevel(0, 1, axis = 1).sort_index(axis = 1)

Related

Fill NaN values from its Previous Value pandas

Rolling window percentile rank over a multi-index Pandas DataFrame

Data Cleaning Python: Replacing the values of a column not within a range with NaN and then dropping the raws which contain NaN

why am I getting a too many indexers error?

Lookup Pandas Dataframe comparing different size data frames

Categories

Resources