if a column has for example 14 different [Unique Values]value_counts(), and they possess something in common,
in our example [when we groupby 'Loan.Purpose' with 'Interest.Rate' column, and compute mean of each [Unique Values]value_counts() based on Loan.Purpose mean() values], we get a certain common average rates for certain value_counts, for e.g :-('car','educational','major_purchase') attributes has the mean = 11.0, now i want to merge the above mentioned ('car','educational','major_purchase') [Unique Values]value_counts(), under column_name "LP_cem" because they have same mean, likewise i want to do the same with other value_counts(),
So that i can reduce the amount of dummy variables from 14 to 4.
basically, i want to merge the 14 different value_counts() under 3/4 columns based on their mean() and then create dummies out of those 3/4 columns
like this given below
LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0 0 1 0 0
1 0 0 1 0 0
2 0 0 1 0 0
3 0 0 1 0 0
4 0 1 0 0 0
raw_data['Loan.Purpose'].value_counts()
debt_consolidation 1306
credit_card 443
other 200
home_improvement 151
major_purchase 101
small_business 86
car 50
wedding 39
medical 30
moving 28
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64
i have clubbed the data from Loan.Purpose based on mean of the Interest.Rate
raw_data_8 = round(raw_data_5.groupby('Loan.Purpose')['Interest.Rate'].mean())
raw_data_8
Loan.Purpose
CHOS 15.0
DM 12.0
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64
now i want to club the values with same mean's together, i even tried the code but it is giving an error
for i in range(len(raw_data_5.index)):
if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
if raw_data_5['Loan.Purpose'][i] in ['credit_care','house','other','small_business']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'chos'
if raw_data_5['Loan.Purpose'][i] in ['debt_consolidation','moving']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'dcm'
error = TypeError Traceback (most recent
call last)
<ipython-input-51-cf7ef2ae1efd> in <module>
----> 1 for i in range(raw_data_5.index):
2 if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
3 raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
4 if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
5 raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
TypeError: 'Int64Index' object cannot be interpreted as an integer
Interest.Rate Loan.Length Loan.Purpose
0 8.90 36.0 debt_consolidation
1 12.12 36.0 debt_consolidation
2 21.98 60.0 debt_consolidation
3 9.99 36.0 debt_consolidation
4 11.71 36.0 credit_card
5 15.31 36.0 other
6 7.90 36.0 debt_consolidation
7 17.14 60.0 credit_card
8 14.33 36.0 credit_card
10 19.72 36.0 moving
11 14.27 36.0 debt_consolidation
12 21.67 60.0 debt_consolidation
13 8.90 36.0 debt_consolidation
14 7.62 36.0 debt_consolidation
15 15.65 60.0 debt_consolidation
16 12.12 36.0 debt_consolidation
17 10.37 60.0 debt_consolidation
18 9.76 36.0 credit_card
19 9.99 60.0 debt_consolidation
20 21.98 36.0 debt_consolidation
21 19.05 60.0 credit_card
22 17.99 60.0 car
23 11.99 36.0 credit_card
24 16.82 60.0 vacation
25 7.90 36.0 debt_consolidation
26 14.42 36.0 debt_consolidation
27 15.31 36.0 debt_consolidation
28 8.59 36.0 other
29 7.90 36.0 debt_consolidation
30 21.00 60.0 debt_consolidation
I was looking for the way to extend the range values inside a Pandas column by interpolation, but I still don't know how to set the 'limits' of the interpolation, I mean, it's something like:
[Distance] [Radiation]
12 120
13 130
14 140
15 150
16 160
17 170
So, what I'm trying to get is the full range of column [Radiation] according to the complete secuence of column [Distance] by interpolation.
[Distance] [Radiation]
1 10
2 20
. .
. .
12 120
13 130
14 140
15 150
16 160
. .
. .
20 200
I was looking in the documentation of pandas and scipy methods but I think I couldn't find it yet.
Thanks for your insights.
One idea is use DataFrame.reindex for add all not existing values of distance and then use DataFrame.interpolate with barycentric method:
df = (df.set_index('Distance')
.reindex(range(1, 21))
.interpolate(method='barycentric', limit_direction='both')
.reset_index())
print (df)
Distance Radiation
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
10 11 110.0
11 12 120.0
12 13 130.0
13 14 140.0
14 15 150.0
15 16 160.0
16 17 170.0
17 18 180.0
18 19 190.0
19 20 200.0
I have two data frames storing tracking data of offensive and defensive players during an nfl game. My goal is to calculate the maximum distance between an offensive player and the nearest defender during the course of the play.
As a simple example, I've made up some data where there are only three offensive players and two defensive players. Here is the data:
Defense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 117 20.2 20.0
1 2 1 117 21.0 19.1
2 3 1 117 21.3 18.3
3 4 1 117 22.0 17.5
4 5 1 117 22.5 17.2
5 6 1 117 23.0 16.9
6 7 1 117 23.6 16.7
7 8 2 117 25.1 34.1
8 9 2 117 25.9 34.2
9 10 2 117 24.1 34.5
10 11 2 117 22.7 34.2
11 12 2 117 21.5 34.5
12 13 2 117 21.1 37.3
13 14 3 117 21.2 44.3
14 15 3 117 20.4 44.6
15 16 3 117 21.9 42.7
16 17 3 117 21.1 41.9
17 18 3 117 20.1 41.7
18 19 3 117 20.1 41.3
19 1 1 555 40.1 17.0
20 2 1 555 40.7 18.3
21 3 1 555 41.0 19.6
22 4 1 555 41.5 18.4
23 5 1 555 42.6 18.4
24 6 1 555 43.8 18.0
25 7 1 555 44.2 15.8
26 8 2 555 41.2 37.1
27 9 2 555 42.3 36.5
28 10 2 555 45.6 36.3
29 11 2 555 47.9 35.6
30 12 2 555 47.4 31.3
31 13 2 555 46.8 31.5
32 14 3 555 47.3 40.3
33 15 3 555 47.2 40.6
34 16 3 555 44.5 40.8
35 17 3 555 46.5 41.0
36 18 3 555 47.6 41.4
37 19 3 555 47.6 41.5
Offense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 751 30.2 15.0
1 2 1 751 31.0 15.1
2 3 1 751 31.3 15.3
3 4 1 751 32.0 15.5
4 5 1 751 31.5 15.7
5 6 1 751 33.0 15.9
6 7 1 751 32.6 15.7
7 8 2 751 51.1 30.1
8 9 2 751 51.9 30.2
9 10 2 751 51.1 30.5
10 11 2 751 49.7 30.6
11 12 2 751 49.5 30.9
12 13 2 751 49.1 31.3
13 14 3 751 12.2 40.3
14 15 3 751 12.4 40.5
15 16 3 751 12.9 40.7
16 17 3 751 13.1 40.9
17 18 3 751 13.1 41.1
18 19 3 751 13.1 41.3
19 1 1 419 41.3 15.0
20 2 1 419 41.7 15.3
21 3 1 419 41.8 15.4
22 4 1 419 42.9 15.6
23 5 1 419 42.6 15.6
24 6 1 419 44.8 16.0
25 7 1 419 45.2 15.8
26 8 2 419 62.2 30.1
27 9 2 419 63.3 30.5
28 10 2 419 62.6 31.0
29 11 2 419 63.9 30.6
30 12 2 419 67.4 31.3
31 13 2 419 66.8 31.5
32 14 3 419 30.3 40.3
33 15 3 419 30.2 40.6
34 16 3 419 30.5 40.8
35 17 3 419 30.5 41.0
36 18 3 419 31.6 41.4
37 19 3 419 31.6 41.5
38 1 1 989 10.1 15.0
39 2 1 989 10.2 15.5
40 3 1 989 10.4 15.4
41 4 1 989 10.5 15.8
42 5 1 989 10.6 15.9
43 6 1 989 10.1 15.5
44 7 1 989 10.9 15.3
45 8 2 989 25.8 30.1
46 9 2 989 25.2 30.1
47 10 2 989 21.8 30.2
48 11 2 989 25.8 30.2
49 12 2 989 25.6 30.5
50 13 2 989 25.5 31.0
51 14 3 989 50.3 40.3
52 15 3 989 50.3 40.2
53 16 3 989 50.2 40.4
54 17 3 989 50.1 40.8
55 18 3 989 50.6 41.2
56 19 3 989 51.4 41.6
The data is essentially multidimensional with GameTime, PlayId, and PlayerId as independent variables and x-coord and y-coord as dependent variables. How can I go about calculating the maximum distance from the nearest defender during the course of a play?
My guess is I would have to create columns containing the distance from each defender for each offensive player, but I don't know how to name those and be able to account for an unknown amount of defensive/offensive players (the full data set contains thousands of players).
Here is a possible solution , I think there is a way to making it more efficient :
Assuming you have a dataframe called offense_df and a dataframe called defense_df:
In the merged dataframe you'll get the answer to your question, basically it will create the following dataframe:
from scipy.spatial import distance
merged_dataframe = pd.merge(offense_df,defense_df,on=['GameTime','PlayId'],suffixes=('_off','_def'))
GameTime PlayId PlayerId_off x-coord_off y-coord_off PlayerId_def x-coord_def y-coord_def
0 1 1 751 30.2 15.0 117 20.2 20.0
1 1 1 751 30.2 15.0 555 40.1 17.0
2 1 1 419 41.3 15.0 117 20.2 20.0
3 1 1 419 41.3 15.0 555 40.1 17.0
4 1 1 989 10.1 15.0 117 20.2 20.0
The next two lines are here to create a unique column for the coordinates , basically it will create for the offender (coord_off) and the defender a column (coord_def) that contains a tuple (x,y) this will simplify the computation of the distance.
merged_dataframe['coord_off'] = merged_dataframe.apply(lambda x: (x['x-coord_off'], x['y-coord_off']),axis=1)
merged_dataframe['coord_def'] = merged_dataframe.apply(lambda x: (x['x-coord_def'], x['y-coord_def']),axis=1)
We compute the distance to all the defender at a given GameTime,PlayId.
merged_dataframe['distance_to_def'] = merged_dataframe.apply(lambda x: distance.euclidean(x['coord_off'],x['coord_def']),axis=1)
For each PlayerId,GameTime,PlayId we take the distance to the nearest defender.
smallest_dist = merged_dataframe.groupby(['GameTime','PlayId','PlayerId_off'])['distance_to_def'].min()
Finally we take the maximum distance (of these minimum distances) for each PlayerId.
smallest_dist.groupby('PlayerId_off').max()
cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')
I have two pandas df that look like this
df1
Amount Price
0 5 50
1 10 53
2 15 55
3 30 50
4 45 61
df2
Used amount
0 4.5
1 1.2
2 6.2
3 4.1
4 25.6
5 31
6 19
7 15
I am trying to insert a new column on df2 that will give provide the price from the df1, df1 and df2 have different size, df1 is smaller
I am expecting something like this
df3
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31 61
6 19 50
7 15 55
I am thinking to solve this, with something like this function
def price_function(key, table):
used_amount_df2 = (row[0] for row in df1)
price = filter(lambda x: x < key, used_amount_df1)
Here is my own solution
1st approach:
from itertools import product
import pandas as pd
df2=df2.reset_index()
DF=pd.DataFrame(list(product(df2.Usedamount, df1.Amount)), columns=['l1', 'l2'])
DF['DIFF']=(DF.l1-DF.l2)
DF=DF.loc[DF.DIFF<=0,]
DF=DF.sort_values(['l1','DIFF'],ascending=[True,False]).drop_duplicates(['l1'],keep='first')
df1.merge(DF,left_on='Amount',right_on='l2',how='left').merge(df2,left_on='l1',right_on='Usedamount',how='right').loc[:,['index','Usedamount','Price']].set_index('index').sort_index()
Out[185]:
Usedamount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
2nd using pd.merge_asof I recommend this
df2=df2.rename({'Used amount':Amount}).sort_values('Amount')
df2=df2.reset_index()
pd.merge_asof(df2,df1,on='Amount',allow_exact_matches=True,direction='forward')\
.set_index('index').sort_index()
Out[206]:
Amount Price
index
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Using pd.IntervalIndex you can
In [468]: df1.index = pd.IntervalIndex.from_arrays(df1.Amount.shift().fillna(0),df1.Amount)
In [469]: df1
Out[469]:
Amount Price
(0.0, 5.0] 5 50
(5.0, 10.0] 10 53
(10.0, 15.0] 15 55
(15.0, 30.0] 30 50
(30.0, 45.0] 45 61
In [470]: df2['price'] = df2['Used amount'].map(df1.Price)
In [471]: df2
Out[471]:
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use cut or searchsorted for create bins.
Notice: Index in df1 has to be default - 0,1,2....
#create default index if necessary
df1 = df1.reset_index(drop=True)
#create bins
bins = [0] + df1['Amount'].tolist()
#get index values of df1 by values of Used amount
a = pd.cut(df2['Used amount'], bins=bins, labels=df1.index)
#assign output
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
a = df1['Amount'].searchsorted(df2['Used amount'])
df2['price'] = df1['Price'].values[a]
print (df2)
Used amount price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
You can use pd.DataFrame.reindex with method=bfill
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill')
Price
Used amount
4.5 50
1.2 50
6.2 53
4.1 50
25.6 50
31.0 61
19.0 50
15.0 55
To add that to a new column we can use
join
df2.join(
df1.set_index('Amount').reindex(df2['Used amount'], method='bfill'),
on='Used amount'
)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55
Or assign
df2.assign(
Price=df1.set_index('Amount').reindex(df2['Used amount'], method='bfill').values)
Used amount Price
0 4.5 50
1 1.2 50
2 6.2 53
3 4.1 50
4 25.6 50
5 31.0 61
6 19.0 50
7 15.0 55