Subsample irregular data into nearest time intervals / subset with buffers - subset

I have data which was collected at different sampling regimes, from every three seconds, up to every hour.
I want to get a subsample to this data at nearest to regular intervals of 10 minutes (also at other time intervals, but I assume it will be easy to change the interval once I have the code right).
I have seen similar posts here (eg:How to subsample a data frame based on a datetime column in R ) but no answers which can allow for buffers or 'nearest to' ideas.
The problems are:
The data is not sampled regularly. So I can't ask R to give me all data at every ten minute intervals, but I need 'nearest possible to every ten minutes'.
I also want to buffer this so that it doesn't give me intervals of less than 9 minutes.
Currently my date_time column is being read as a 'factor':
class(sample$date_time)
[1] "factor"
Sample data:
device_no date_time latitude longitude temperature
1 23/04/2012 15:57:22 -33.2415715 19.4810864 27.7
1 23/04/2012 15:58:08 -33.2415396 19.4810666 27.7
1 23/04/2012 15:58:58 -33.2415963 19.48109 27.2
1 23/04/2012 15:59:46 -33.2415137 19.4810624 27.2
1 23/04/2012 16:00:33 -33.2415019 19.4810885 27
1 23/04/2012 16:01:21 -33.241561 19.4810867 26.3
1 23/04/2012 16:02:10 -33.2415579 19.4810926 26.4
1 23/04/2012 16:02:57 -33.2414687 19.4810465 25.6
1 23/04/2012 16:03:45 -33.2415096 19.4810736 24.6
1 23/04/2012 16:05:20 -33.2415707 19.4810614 24.8
1 23/04/2012 16:06:22 -33.2415188 19.4810708 24.6
1 23/04/2012 16:07:12 -33.2415754 19.4810538 24.6
1 23/04/2012 16:08:00 -33.2415054 19.4810874 24.2
1 23/04/2012 16:08:48 -33.2414794 19.4810908 24.3
1 23/04/2012 16:09:36 -33.2415538 19.4810802 24
1 23/04/2012 16:10:25 -33.2413946 19.4811353 23.4
1 23/04/2012 16:11:14 -33.2414529 19.4811084 23.7
1 23/04/2012 16:12:01 -33.2413949 19.4810978 23.5
1 24/04/2012 05:26:39 -33.2415605 19.4810589 23.1
1 24/04/2012 05:56:51 -33.2414826 19.4811049 23
1 24/04/2012 06:01:22 -33.2415975 19.4810535 23
1 24/04/2012 06:02:09 -33.2384224 19.482825 21.7
1 24/04/2012 06:02:58 -33.2380158 19.4833283 20.6
1 24/04/2012 06:03:45 -33.241538 19.4810951 19.9
1 24/04/2012 06:04:34 -33.2416337 19.4810291 19.3
1 24/04/2012 06:05:22 -33.2410841 19.4819002 19.9
1 24/04/2012 06:06:11 -33.2401989 19.4817907 19.9
1 24/04/2012 06:06:57 -33.241593 19.4810426 19.4
1 24/04/2012 06:07:46 -33.241575 19.4810011 18.6
1 24/04/2012 06:08:34 -33.2415497 19.4810493 18.8
1 24/04/2012 06:09:22 -33.2415104 19.4810216 18.4
1 24/04/2012 06:10:11 -33.2416627 19.4810065 18
1 24/04/2012 06:10:59 -33.2414893 19.4811548 18.2
1 24/04/2012 06:11:44 -33.2420604 19.4810295 18.7
1 24/04/2012 06:12:33 -33.2408584 19.4803685 35.5
1 24/04/2012 06:13:20 -33.2407331 19.4805824 38.3
1 24/04/2012 06:25:58 -33.2411718 19.4810405 39.9
1 24/04/2012 06:26:49 -33.2415396 19.4810794 41.4
1 24/04/2012 06:27:56 -33.2415588 19.481089 40
1 24/04/2012 06:28:54 -33.2415257 19.4810381 41
1 24/04/2012 06:29:42 -33.239857 19.4807259 37.6
1 24/04/2012 06:30:29 -33.2409401 19.480927 36.4
Sample outcome:
device_no date_time latitude longitude temperature
1 23/04/2012 15:57:22 -33.2415715 19.4810864 27.7
1 23/04/2012 16:07:12 -33.2415754 19.4810538 24.6
1 24/04/2012 05:26:39 -33.2415605 19.4810589 23.1
1 24/04/2012 05:56:51 -33.2414826 19.4811049 23
1 24/04/2012 06:06:57 -33.241593 19.4810426 19.4
1 24/04/2012 06:25:58 -33.2411718 19.4810405 39.9`
Sorry, my screen shot won't post and this data doesn't appear as a table...

# Create some data
# start at Jan 1, 2014
start=as.POSIXct("1-Jan-2014 00:00",format="%d-%b-%Y %H:%M")
# end at Jan 31, 2014
end=as.POSIXct("31-Jan-2014 23:00",format="%d-%b-%Y %H:%M")
# sequence the time from start to end and create a continuous time
# set with observations every 5 minutes with a sd of 1 minute
time=seq.POSIXt(start,end,by=rnorm(1,1*60*5,60))
# create a loop
# you want the buffer for time > 9 minutes
time.diff=9
new.time=time[1]
# seq 10 minute intervals (this is the data we are going to try and match)
time.10min=seq.POSIXt(start,end,by=1*60*10)
# skip the first observation
for(i in 2:length(time.10min)){
# Look for where the minimum difference occurs in your original dataset
obs=time[which(min(abs(time.10min[i]-time))==abs(time.10min[i]-time))]
# see if it meets you buffer requirements. If it doesn't meet your buffer requirements
# it will take the next observation in your dataset
if(obs-new.time[i-1]<time.diff){
obs=time[which(min(abs(time.10min[i]-time))==abs(time.10min[i]-time))+1]
new.time[length(new.time)+1]=obs
} else {
new.time[length(new.time)+1]=obs
}
}
head(new.time)
# Good luck and hopefully this helps.

Related

Merge 'column attributes' of a single column into seperate columns, to lower the amount of dummy variables of that single column

if a column has for example 14 different [Unique Values]value_counts(), and they possess something in common,
in our example [when we groupby 'Loan.Purpose' with 'Interest.Rate' column, and compute mean of each [Unique Values]value_counts() based on Loan.Purpose mean() values], we get a certain common average rates for certain value_counts, for e.g :-('car','educational','major_purchase') attributes has the mean = 11.0, now i want to merge the above mentioned ('car','educational','major_purchase') [Unique Values]value_counts(), under column_name "LP_cem" because they have same mean, likewise i want to do the same with other value_counts(),
So that i can reduce the amount of dummy variables from 14 to 4.
basically, i want to merge the 14 different value_counts() under 3/4 columns based on their mean() and then create dummies out of those 3/4 columns
like this given below
LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0 0 1 0 0
1 0 0 1 0 0
2 0 0 1 0 0
3 0 0 1 0 0
4 0 1 0 0 0
raw_data['Loan.Purpose'].value_counts()
debt_consolidation 1306
credit_card 443
other 200
home_improvement 151
major_purchase 101
small_business 86
car 50
wedding 39
medical 30
moving 28
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64
i have clubbed the data from Loan.Purpose based on mean of the Interest.Rate
raw_data_8 = round(raw_data_5.groupby('Loan.Purpose')['Interest.Rate'].mean())
raw_data_8
Loan.Purpose
CHOS 15.0
DM 12.0
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64
now i want to club the values with same mean's together, i even tried the code but it is giving an error
for i in range(len(raw_data_5.index)):
if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
if raw_data_5['Loan.Purpose'][i] in ['credit_care','house','other','small_business']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'chos'
if raw_data_5['Loan.Purpose'][i] in ['debt_consolidation','moving']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'dcm'
error = TypeError Traceback (most recent
call last)
<ipython-input-51-cf7ef2ae1efd> in <module>
----> 1 for i in range(raw_data_5.index):
2 if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
3 raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
4 if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
5 raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
TypeError: 'Int64Index' object cannot be interpreted as an integer
Interest.Rate Loan.Length Loan.Purpose
0 8.90 36.0 debt_consolidation
1 12.12 36.0 debt_consolidation
2 21.98 60.0 debt_consolidation
3 9.99 36.0 debt_consolidation
4 11.71 36.0 credit_card
5 15.31 36.0 other
6 7.90 36.0 debt_consolidation
7 17.14 60.0 credit_card
8 14.33 36.0 credit_card
10 19.72 36.0 moving
11 14.27 36.0 debt_consolidation
12 21.67 60.0 debt_consolidation
13 8.90 36.0 debt_consolidation
14 7.62 36.0 debt_consolidation
15 15.65 60.0 debt_consolidation
16 12.12 36.0 debt_consolidation
17 10.37 60.0 debt_consolidation
18 9.76 36.0 credit_card
19 9.99 60.0 debt_consolidation
20 21.98 36.0 debt_consolidation
21 19.05 60.0 credit_card
22 17.99 60.0 car
23 11.99 36.0 credit_card
24 16.82 60.0 vacation
25 7.90 36.0 debt_consolidation
26 14.42 36.0 debt_consolidation
27 15.31 36.0 debt_consolidation
28 8.59 36.0 other
29 7.90 36.0 debt_consolidation
30 21.00 60.0 debt_consolidation

Interpolate above and below a range of values in a column - Pandas

I was looking for the way to extend the range values inside a Pandas column by interpolation, but I still don't know how to set the 'limits' of the interpolation, I mean, it's something like:
[Distance] [Radiation]
12 120
13 130
14 140
15 150
16 160
17 170
So, what I'm trying to get is the full range of column [Radiation] according to the complete secuence of column [Distance] by interpolation.
[Distance] [Radiation]
1 10
2 20
. .
. .
12 120
13 130
14 140
15 150
16 160
. .
. .
20 200
I was looking in the documentation of pandas and scipy methods but I think I couldn't find it yet.
Thanks for your insights.
One idea is use DataFrame.reindex for add all not existing values of distance and then use DataFrame.interpolate with barycentric method:
df = (df.set_index('Distance')
.reindex(range(1, 21))
.interpolate(method='barycentric', limit_direction='both')
.reset_index())
print (df)
Distance Radiation
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
10 11 110.0
11 12 120.0
12 13 130.0
13 14 140.0
14 15 150.0
15 16 160.0
16 17 170.0
17 18 180.0
18 19 190.0
19 20 200.0

Pandas Computing On Multidimensional Data

I have two data frames storing tracking data of offensive and defensive players during an nfl game. My goal is to calculate the maximum distance between an offensive player and the nearest defender during the course of the play.
As a simple example, I've made up some data where there are only three offensive players and two defensive players. Here is the data:
Defense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 117 20.2 20.0
1 2 1 117 21.0 19.1
2 3 1 117 21.3 18.3
3 4 1 117 22.0 17.5
4 5 1 117 22.5 17.2
5 6 1 117 23.0 16.9
6 7 1 117 23.6 16.7
7 8 2 117 25.1 34.1
8 9 2 117 25.9 34.2
9 10 2 117 24.1 34.5
10 11 2 117 22.7 34.2
11 12 2 117 21.5 34.5
12 13 2 117 21.1 37.3
13 14 3 117 21.2 44.3
14 15 3 117 20.4 44.6
15 16 3 117 21.9 42.7
16 17 3 117 21.1 41.9
17 18 3 117 20.1 41.7
18 19 3 117 20.1 41.3
19 1 1 555 40.1 17.0
20 2 1 555 40.7 18.3
21 3 1 555 41.0 19.6
22 4 1 555 41.5 18.4
23 5 1 555 42.6 18.4
24 6 1 555 43.8 18.0
25 7 1 555 44.2 15.8
26 8 2 555 41.2 37.1
27 9 2 555 42.3 36.5
28 10 2 555 45.6 36.3
29 11 2 555 47.9 35.6
30 12 2 555 47.4 31.3
31 13 2 555 46.8 31.5
32 14 3 555 47.3 40.3
33 15 3 555 47.2 40.6
34 16 3 555 44.5 40.8
35 17 3 555 46.5 41.0
36 18 3 555 47.6 41.4
37 19 3 555 47.6 41.5
Offense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 751 30.2 15.0
1 2 1 751 31.0 15.1
2 3 1 751 31.3 15.3
3 4 1 751 32.0 15.5
4 5 1 751 31.5 15.7
5 6 1 751 33.0 15.9
6 7 1 751 32.6 15.7
7 8 2 751 51.1 30.1
8 9 2 751 51.9 30.2
9 10 2 751 51.1 30.5
10 11 2 751 49.7 30.6
11 12 2 751 49.5 30.9
12 13 2 751 49.1 31.3
13 14 3 751 12.2 40.3
14 15 3 751 12.4 40.5
15 16 3 751 12.9 40.7
16 17 3 751 13.1 40.9
17 18 3 751 13.1 41.1
18 19 3 751 13.1 41.3
19 1 1 419 41.3 15.0
20 2 1 419 41.7 15.3
21 3 1 419 41.8 15.4
22 4 1 419 42.9 15.6
23 5 1 419 42.6 15.6
24 6 1 419 44.8 16.0
25 7 1 419 45.2 15.8
26 8 2 419 62.2 30.1
27 9 2 419 63.3 30.5
28 10 2 419 62.6 31.0
29 11 2 419 63.9 30.6
30 12 2 419 67.4 31.3
31 13 2 419 66.8 31.5
32 14 3 419 30.3 40.3
33 15 3 419 30.2 40.6
34 16 3 419 30.5 40.8
35 17 3 419 30.5 41.0
36 18 3 419 31.6 41.4
37 19 3 419 31.6 41.5
38 1 1 989 10.1 15.0
39 2 1 989 10.2 15.5
40 3 1 989 10.4 15.4
41 4 1 989 10.5 15.8
42 5 1 989 10.6 15.9
43 6 1 989 10.1 15.5
44 7 1 989 10.9 15.3
45 8 2 989 25.8 30.1
46 9 2 989 25.2 30.1
47 10 2 989 21.8 30.2
48 11 2 989 25.8 30.2
49 12 2 989 25.6 30.5
50 13 2 989 25.5 31.0
51 14 3 989 50.3 40.3
52 15 3 989 50.3 40.2
53 16 3 989 50.2 40.4
54 17 3 989 50.1 40.8
55 18 3 989 50.6 41.2
56 19 3 989 51.4 41.6
The data is essentially multidimensional with GameTime, PlayId, and PlayerId as independent variables and x-coord and y-coord as dependent variables. How can I go about calculating the maximum distance from the nearest defender during the course of a play?
My guess is I would have to create columns containing the distance from each defender for each offensive player, but I don't know how to name those and be able to account for an unknown amount of defensive/offensive players (the full data set contains thousands of players).
Here is a possible solution , I think there is a way to making it more efficient :
Assuming you have a dataframe called offense_df and a dataframe called defense_df:
In the merged dataframe you'll get the answer to your question, basically it will create the following dataframe:
from scipy.spatial import distance
merged_dataframe = pd.merge(offense_df,defense_df,on=['GameTime','PlayId'],suffixes=('_off','_def'))
GameTime PlayId PlayerId_off x-coord_off y-coord_off PlayerId_def x-coord_def y-coord_def
0 1 1 751 30.2 15.0 117 20.2 20.0
1 1 1 751 30.2 15.0 555 40.1 17.0
2 1 1 419 41.3 15.0 117 20.2 20.0
3 1 1 419 41.3 15.0 555 40.1 17.0
4 1 1 989 10.1 15.0 117 20.2 20.0
The next two lines are here to create a unique column for the coordinates , basically it will create for the offender (coord_off) and the defender a column (coord_def) that contains a tuple (x,y) this will simplify the computation of the distance.
merged_dataframe['coord_off'] = merged_dataframe.apply(lambda x: (x['x-coord_off'], x['y-coord_off']),axis=1)
merged_dataframe['coord_def'] = merged_dataframe.apply(lambda x: (x['x-coord_def'], x['y-coord_def']),axis=1)
We compute the distance to all the defender at a given GameTime,PlayId.
merged_dataframe['distance_to_def'] = merged_dataframe.apply(lambda x: distance.euclidean(x['coord_off'],x['coord_def']),axis=1)
For each PlayerId,GameTime,PlayId we take the distance to the nearest defender.
smallest_dist = merged_dataframe.groupby(['GameTime','PlayId','PlayerId_off'])['distance_to_def'].min()
Finally we take the maximum distance (of these minimum distances) for each PlayerId.
smallest_dist.groupby('PlayerId_off').max()

why am I getting a too many indexers error?

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Lookup Pandas Dataframe comparing different size data frames

I have two pandas df that look like this
df1
Amount Price
0 5 50
1 10 53
2 15 55
3 30 50
4 45 61
df2
Used amount
0 4.5
1 1.2
2 6.2
3 4.1
4 25.6
5 31
6 19
7 15
I am trying to insert a new column on df2 that will give provide the price from the df1, df1 and df2 have different size, df1 is smaller
I am expecting something like this
df3
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31 61
6 19 50
7 15 55
I am thinking to solve this, with something like this function
def price_function(key, table):
used_amount_df2 = (row[0] for row in df1)
price = filter(lambda x: x < key, used_amount_df1)
Here is my own solution
1st approach:
from itertools import product
import pandas as pd
df2=df2.reset_index()
DF=pd.DataFrame(list(product(df2.Usedamount, df1.Amount)), columns=['l1', 'l2'])
DF['DIFF']=(DF.l1-DF.l2)
DF=DF.loc[DF.DIFF<=0,]
DF=DF.sort_values(['l1','DIFF'],ascending=[True,False]).drop_duplicates(['l1'],keep='first')
df1.merge(DF,left_on='Amount',right_on='l2',how='left').merge(df2,left_on='l1',right_on='Usedamount',how='right').loc[:,['index','Usedamount','Price']].set_index('index').sort_index()
Out[185]:
Usedamount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
2nd using pd.merge_asof I recommend this
df2=df2.rename({'Used amount':Amount}).sort_values('Amount')
df2=df2.reset_index()
pd.merge_asof(df2,df1,on='Amount',allow_exact_matches=True,direction='forward')\
.set_index('index').sort_index()
Out[206]:
Amount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Using pd.IntervalIndex you can
In [468]: df1.index = pd.IntervalIndex.from_arrays(df1.Amount.shift().fillna(0),df1.Amount)
In [469]: df1
Out[469]:
Amount Price
(0.0, 5.0] 5 50
(5.0, 10.0] 10 53
(10.0, 15.0] 15 55
(15.0, 30.0] 30 50
(30.0, 45.0] 45 61
In [470]: df2['price'] = df2['Used amount'].map(df1.Price)
In [471]: df2
Out[471]:
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use cut or searchsorted for create bins.
Notice: Index in df1 has to be default - 0,1,2....
#create default index if necessary
df1 = df1.reset_index(drop=True)
#create bins
bins = [0] + df1['Amount'].tolist()
#get index values of df1 by values of Used amount
a = pd.cut(df2['Used amount'], bins=bins, labels=df1.index)
#assign output
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
a = df1['Amount'].searchsorted(df2['Used amount'])
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use pd.DataFrame.reindex with method=bfill
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill')
Price
Used amount
4.5 50
1.2 50
6.2 53
4.1 50
25.6 50
31.0 61
19.0 50
15.0 55
To add that to a new column we can use
join
df2.join(
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill'),
on='Used amount'
)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Or assign
df2.assign(
Price=df1.set_index('Amount').reindex(df2['Used amount'], method='bfill').values)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55

Resources