if a column has for example 14 different [Unique Values]value_counts(), and they possess something in common,
in our example [when we groupby 'Loan.Purpose' with 'Interest.Rate' column, and compute mean of each [Unique Values]value_counts() based on Loan.Purpose mean() values], we get a certain common average rates for certain value_counts, for e.g :-('car','educational','major_purchase') attributes has the mean = 11.0, now i want to merge the above mentioned ('car','educational','major_purchase') [Unique Values]value_counts(), under column_name "LP_cem" because they have same mean, likewise i want to do the same with other value_counts(),
So that i can reduce the amount of dummy variables from 14 to 4.
basically, i want to merge the 14 different value_counts() under 3/4 columns based on their mean() and then create dummies out of those 3/4 columns
like this given below
LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0 0 1 0 0
1 0 0 1 0 0
2 0 0 1 0 0
3 0 0 1 0 0
4 0 1 0 0 0
raw_data['Loan.Purpose'].value_counts()
debt_consolidation 1306
credit_card 443
other 200
home_improvement 151
major_purchase 101
small_business 86
car 50
wedding 39
medical 30
moving 28
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64
i have clubbed the data from Loan.Purpose based on mean of the Interest.Rate
raw_data_8 = round(raw_data_5.groupby('Loan.Purpose')['Interest.Rate'].mean())
raw_data_8
Loan.Purpose
CHOS 15.0
DM 12.0
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64
now i want to club the values with same mean's together, i even tried the code but it is giving an error
for i in range(len(raw_data_5.index)):
if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
if raw_data_5['Loan.Purpose'][i] in ['credit_care','house','other','small_business']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'chos'
if raw_data_5['Loan.Purpose'][i] in ['debt_consolidation','moving']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'dcm'
error = TypeError Traceback (most recent
call last)
<ipython-input-51-cf7ef2ae1efd> in <module>
----> 1 for i in range(raw_data_5.index):
2 if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
3 raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
4 if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
5 raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
TypeError: 'Int64Index' object cannot be interpreted as an integer
Interest.Rate Loan.Length Loan.Purpose
0 8.90 36.0 debt_consolidation
1 12.12 36.0 debt_consolidation
2 21.98 60.0 debt_consolidation
3 9.99 36.0 debt_consolidation
4 11.71 36.0 credit_card
5 15.31 36.0 other
6 7.90 36.0 debt_consolidation
7 17.14 60.0 credit_card
8 14.33 36.0 credit_card
10 19.72 36.0 moving
11 14.27 36.0 debt_consolidation
12 21.67 60.0 debt_consolidation
13 8.90 36.0 debt_consolidation
14 7.62 36.0 debt_consolidation
15 15.65 60.0 debt_consolidation
16 12.12 36.0 debt_consolidation
17 10.37 60.0 debt_consolidation
18 9.76 36.0 credit_card
19 9.99 60.0 debt_consolidation
20 21.98 36.0 debt_consolidation
21 19.05 60.0 credit_card
22 17.99 60.0 car
23 11.99 36.0 credit_card
24 16.82 60.0 vacation
25 7.90 36.0 debt_consolidation
26 14.42 36.0 debt_consolidation
27 15.31 36.0 debt_consolidation
28 8.59 36.0 other
29 7.90 36.0 debt_consolidation
30 21.00 60.0 debt_consolidation
I was looking for the way to extend the range values inside a Pandas column by interpolation, but I still don't know how to set the 'limits' of the interpolation, I mean, it's something like:
[Distance] [Radiation]
12 120
13 130
14 140
15 150
16 160
17 170
So, what I'm trying to get is the full range of column [Radiation] according to the complete secuence of column [Distance] by interpolation.
[Distance] [Radiation]
1 10
2 20
. .
. .
12 120
13 130
14 140
15 150
16 160
. .
. .
20 200
I was looking in the documentation of pandas and scipy methods but I think I couldn't find it yet.
Thanks for your insights.
One idea is use DataFrame.reindex for add all not existing values of distance and then use DataFrame.interpolate with barycentric method:
df = (df.set_index('Distance')
.reindex(range(1, 21))
.interpolate(method='barycentric', limit_direction='both')
.reset_index())
print (df)
Distance Radiation
0 1 10.0
1 2 20.0
2 3 30.0
3 4 40.0
4 5 50.0
5 6 60.0
6 7 70.0
7 8 80.0
8 9 90.0
9 10 100.0
10 11 110.0
11 12 120.0
12 13 130.0
13 14 140.0
14 15 150.0
15 16 160.0
16 17 170.0
17 18 180.0
18 19 190.0
19 20 200.0
I have 4 dataframe with value count of number of occurance per month.
I want to compare all 4 value counts in one graph, so i can see visual difference between every month on these four years.
Like below
i like to have output like this image with years and month
newdf2018.Month.value_counts()
output
1 3451
2 3895
3 3408
4 3365
5 3833
6 3543
7 3333
8 3219
9 3447
10 2943
11 3296
12 2909
newdf2017.Month.value_counts()
1 2801
2 3048
3 3620
4 3014
5 3226
6 3962
7 3500
8 3707
9 3601
10 3349
11 3743
12 2002
newdf2016.Month.value_counts()
1 3201
2 2034
3 2405
4 3805
5 3308
6 3212
7 3049
8 3777
9 3275
10 3099
11 3775
12 2115
newdf2015.Month.value_counts()
1 2817
2 2604
3 2711
4 2817
5 2670
6 2507
7 3256
8 2195
9 3304
10 3238
11 2005
12 2008
Create dictionary of DataFrames and concat together, then use plot:
dfs = {2015:newdf2015, 2016:newdf2016, 2017:newdf2017, 2018:newdf2018}
df = pd.concat({k:v['Month'].value_counts() for k, v in dfs.items()}, axis=1)
df.plot.bar()
cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')
HID gen views
1 1 20
1 2 2532
1 3 276
1 4 1684
1 5 779
1 6 200
1 7 545
2 1 20
2 2 7478
2 3 750
2 4 7742
2 5 2643
2 6 208
2 7 585
3 1 21
3 2 4012
3 3 2019
3 4 1073
3 5 3372
3 6 8
3 7 1823
3 8 22
this is a sample section of a data frame, where HID and gen are indexes.
how can it be transformed like this
HID 1 2 3 4 5 6 7 8
1 20 2532 276 1684 779 200 545 nan
2 20 7478 750 7742 2643 208 585 nan
3 21 4012 2019 1073 3372 8 1823 22
Its called pivoting i.e
df.reset_index().pivot('HID','gen','views')
gen 1 2 3 4 5 6 7 8
HID
1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0
Use unstack:
df = df['views'].unstack()
If need also HID column add reset_index + rename_axis:
df = df['views'].unstack().reset_index().rename_axis(None, 1)
print (df)
HID 1 2 3 4 5 6 7 8
0 1 20.0 2532.0 276.0 1684.0 779.0 200.0 545.0 NaN
1 2 20.0 7478.0 750.0 7742.0 2643.0 208.0 585.0 NaN
2 3 21.0 4012.0 2019.0 1073.0 3372.0 8.0 1823.0 22.0