Cumulative sum of rows in Python Pandas - python-3.x

I'm working on a dataframe which I get a value for each year and state :
0 State 1965 1966 1967 1968
1 Alabama 20.2 40 60.3 80
2 Alaska 10 15 18 20
3 Arizona 5 5 10 12
I need each value sum the last with the current one :
0 State 1965 1966 1967 1968
1 Alabama 20.2 60.2 120.5 200.5
2 Alaska 10 25 43 63
3 Arizona 5 10 20 32
I tried df['sum'] = df.sum(axis=1) and .cumsum but I don't know how to apply it to my problem, as I don't need a new column with the total sum.

Use DataFrame.cumsum with axis=1 and convert non numeric column State to index:
df = df.set_index('State').cumsum(axis=1)
print (df)
1965 1966 1967 1968
State
Alabama 20.2 60.2 120.5 200.5
Alaska 10.0 25.0 43.0 63.0
Arizona 5.0 10.0 20.0 32.0
Or select all columns without first and assign back:
df.iloc[:, 1:] = df.iloc[:, 1:].cumsum(axis=1)
print (df)
State 1965 1966 1967 1968
0
1 Alabama 20.2 60.2 120.5 200.5
2 Alaska 10.0 25.0 43.0 63.0
3 Arizona 5.0 10.0 20.0 32.0

Related

Merge 'column attributes' of a single column into seperate columns, to lower the amount of dummy variables of that single column

if a column has for example 14 different [Unique Values]value_counts(), and they possess something in common,
in our example [when we groupby 'Loan.Purpose' with 'Interest.Rate' column, and compute mean of each [Unique Values]value_counts() based on Loan.Purpose mean() values], we get a certain common average rates for certain value_counts, for e.g :-('car','educational','major_purchase') attributes has the mean = 11.0, now i want to merge the above mentioned ('car','educational','major_purchase') [Unique Values]value_counts(), under column_name "LP_cem" because they have same mean, likewise i want to do the same with other value_counts(),
So that i can reduce the amount of dummy variables from 14 to 4.
basically, i want to merge the 14 different value_counts() under 3/4 columns based on their mean() and then create dummies out of those 3/4 columns
like this given below
LP_cem LP_chos LP_dm LP_hmvw LP_renewable_energy
0 0 0 1 0 0
1 0 0 1 0 0
2 0 0 1 0 0
3 0 0 1 0 0
4 0 1 0 0 0
raw_data['Loan.Purpose'].value_counts()
debt_consolidation 1306
credit_card 443
other 200
home_improvement 151
major_purchase 101
small_business 86
car 50
wedding 39
medical 30
moving 28
vacation 21
house 20
educational 15
renewable_energy 4
Name: Loan.Purpose, dtype: int64
i have clubbed the data from Loan.Purpose based on mean of the Interest.Rate
raw_data_8 = round(raw_data_5.groupby('Loan.Purpose')['Interest.Rate'].mean())
raw_data_8
Loan.Purpose
CHOS 15.0
DM 12.0
car 11.0
credit_card 13.0
debt_consolidation 14.0
educational 11.0
home_improvement 12.0
house 13.0
major_purchase 11.0
medical 12.0
moving 14.0
other 13.0
renewable_energy 10.0
small_business 13.0
vacation 12.0
wedding 12.0
Name: Interest.Rate, dtype: float64
now i want to club the values with same mean's together, i even tried the code but it is giving an error
for i in range(len(raw_data_5.index)):
if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
if raw_data_5['Loan.Purpose'][i] in ['credit_care','house','other','small_business']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'chos'
if raw_data_5['Loan.Purpose'][i] in ['debt_consolidation','moving']:
raw_data_5.iloc[i, 'Loan.Purpose'] = 'dcm'
error = TypeError Traceback (most recent
call last)
<ipython-input-51-cf7ef2ae1efd> in <module>
----> 1 for i in range(raw_data_5.index):
2 if raw_data_5['Loan.Purpose'][i] in ['car','educational','major_purchase']:
3 raw_data_5.iloc[i, 'Loan.Purpose'] = 'cem'
4 if raw_data_5['Loan.Purpose'][i] in ['home_improvement','medical','vacation','wedding']:
5 raw_data_5.iloc[i, 'Loan.Purpose'] = 'hmvw'
TypeError: 'Int64Index' object cannot be interpreted as an integer
Interest.Rate Loan.Length Loan.Purpose
0 8.90 36.0 debt_consolidation
1 12.12 36.0 debt_consolidation
2 21.98 60.0 debt_consolidation
3 9.99 36.0 debt_consolidation
4 11.71 36.0 credit_card
5 15.31 36.0 other
6 7.90 36.0 debt_consolidation
7 17.14 60.0 credit_card
8 14.33 36.0 credit_card
10 19.72 36.0 moving
11 14.27 36.0 debt_consolidation
12 21.67 60.0 debt_consolidation
13 8.90 36.0 debt_consolidation
14 7.62 36.0 debt_consolidation
15 15.65 60.0 debt_consolidation
16 12.12 36.0 debt_consolidation
17 10.37 60.0 debt_consolidation
18 9.76 36.0 credit_card
19 9.99 60.0 debt_consolidation
20 21.98 36.0 debt_consolidation
21 19.05 60.0 credit_card
22 17.99 60.0 car
23 11.99 36.0 credit_card
24 16.82 60.0 vacation
25 7.90 36.0 debt_consolidation
26 14.42 36.0 debt_consolidation
27 15.31 36.0 debt_consolidation
28 8.59 36.0 other
29 7.90 36.0 debt_consolidation
30 21.00 60.0 debt_consolidation

Optimized way of modifying a column based on another column of a dataframe

Let's say I have a dataframe like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44 1 96 1 40 1 88 0 81
1 2017-05-01 State NY 0 42 0 55 1 92 1 82 0 38
2 2017-06-01 State NY 1 11 0 7 1 35 0 70 1 61
3 2017-07-01 State NY 1 12 1 80 1 83 1 47 1 44
4 2017-08-01 State NY 1 63 1 48 0 61 0 5 0 20
5 2017-09-01 State NY 1 56 1 92 0 55 0 45 1 17
I'd like to replace all the values of columns with _rank as NaN if it's corresponding flag is zero.To get something like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
Which is fairly simple. This is my approach for the same:
for k in variables:
dt[k+'_rank'] = np.where(dt[k+'_flag']==0,np.nan,dt[k+'_rank'])
Although this works fine for a smaller dataset, it takes a significant amount of time for processing a dataframe with very high number of columns and entries. So is there a optimized way of achieving the same without iteration?
P.S. There are other payloads apart from _rank and _flag in the data.
Thanks in advance
Use .str.endswith to filter the columns that ends with _flag, then use rstrip to strip the flag label and add rank label to get the corresponding column names with rank label, then use np.where to fill the NaN values in the columns containing _rank depending upon the condition when the corresponding values in flag columns is 0:
flags = df.columns[df.columns.str.endswith('_flag')]
ranks = flags.str.rstrip('flag') + 'rank'
df[ranks] = np.where(df[flags].eq(0), np.nan, df[ranks])
OR, it is also possible to use DataFrame.mask:
df[ranks] = df[ranks].mask(df[flags].eq(0).to_numpy())
Result:
# print(df)
Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0

How to filter out values from a pandas data frame for which only one occurrence exists

I have a Pandas data frame with the following columns and values
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
3 187.6 25.0 9.8
4 227.0 25.0 14.2
5 227.0 30.0 16.2
6 118.5 25.0 8.7
The data frame given the variable name df that has three distinct value which are 335.0, 187.6, 227.0, and 118.5; however, the values 187.6 and 118.5 only occur once. I would like to filter the data frame such that it gets rid of values that only occur once so the final data frame looks like.
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
4 227.0 25.0 14.2
5 227.0 30.0 16.2
Obviously in this simple case I know the values that only occur once and I can simply user a filtering function to weed them out. However, I would like to automate the process so that Python will determine which values only occur once and autonomously filter them. How can I enable this functionality?
Using duplicated
df[df.Temp.duplicated(keep=False)]
Out[630]:
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
4 227.0 25.0 14.2
5 227.0 30.0 16.2
Try this
df['count']=df.groupby(['Temp']).transform(pd.Series.count)
df = df[df['count']>1]
df.drop(['count'],axis=1,inplace=True)
dict
This is a dict approach to the same thing done by WeNYoBen
seen = {}
for t in df.Temp:
seen[t] = t in seen
df[df.Temp.map(seen)]
Temp Time grain_size
0 335.0 25.0 14.8
1 335.0 30.0 18.7
2 335.0 35.0 22.1
4 227.0 25.0 14.2
5 227.0 30.0 16.2

Pandas Computing On Multidimensional Data

I have two data frames storing tracking data of offensive and defensive players during an nfl game. My goal is to calculate the maximum distance between an offensive player and the nearest defender during the course of the play.
As a simple example, I've made up some data where there are only three offensive players and two defensive players. Here is the data:
Defense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 117 20.2 20.0
1 2 1 117 21.0 19.1
2 3 1 117 21.3 18.3
3 4 1 117 22.0 17.5
4 5 1 117 22.5 17.2
5 6 1 117 23.0 16.9
6 7 1 117 23.6 16.7
7 8 2 117 25.1 34.1
8 9 2 117 25.9 34.2
9 10 2 117 24.1 34.5
10 11 2 117 22.7 34.2
11 12 2 117 21.5 34.5
12 13 2 117 21.1 37.3
13 14 3 117 21.2 44.3
14 15 3 117 20.4 44.6
15 16 3 117 21.9 42.7
16 17 3 117 21.1 41.9
17 18 3 117 20.1 41.7
18 19 3 117 20.1 41.3
19 1 1 555 40.1 17.0
20 2 1 555 40.7 18.3
21 3 1 555 41.0 19.6
22 4 1 555 41.5 18.4
23 5 1 555 42.6 18.4
24 6 1 555 43.8 18.0
25 7 1 555 44.2 15.8
26 8 2 555 41.2 37.1
27 9 2 555 42.3 36.5
28 10 2 555 45.6 36.3
29 11 2 555 47.9 35.6
30 12 2 555 47.4 31.3
31 13 2 555 46.8 31.5
32 14 3 555 47.3 40.3
33 15 3 555 47.2 40.6
34 16 3 555 44.5 40.8
35 17 3 555 46.5 41.0
36 18 3 555 47.6 41.4
37 19 3 555 47.6 41.5
Offense
GameTime PlayId PlayerId x-coord y-coord
0 1 1 751 30.2 15.0
1 2 1 751 31.0 15.1
2 3 1 751 31.3 15.3
3 4 1 751 32.0 15.5
4 5 1 751 31.5 15.7
5 6 1 751 33.0 15.9
6 7 1 751 32.6 15.7
7 8 2 751 51.1 30.1
8 9 2 751 51.9 30.2
9 10 2 751 51.1 30.5
10 11 2 751 49.7 30.6
11 12 2 751 49.5 30.9
12 13 2 751 49.1 31.3
13 14 3 751 12.2 40.3
14 15 3 751 12.4 40.5
15 16 3 751 12.9 40.7
16 17 3 751 13.1 40.9
17 18 3 751 13.1 41.1
18 19 3 751 13.1 41.3
19 1 1 419 41.3 15.0
20 2 1 419 41.7 15.3
21 3 1 419 41.8 15.4
22 4 1 419 42.9 15.6
23 5 1 419 42.6 15.6
24 6 1 419 44.8 16.0
25 7 1 419 45.2 15.8
26 8 2 419 62.2 30.1
27 9 2 419 63.3 30.5
28 10 2 419 62.6 31.0
29 11 2 419 63.9 30.6
30 12 2 419 67.4 31.3
31 13 2 419 66.8 31.5
32 14 3 419 30.3 40.3
33 15 3 419 30.2 40.6
34 16 3 419 30.5 40.8
35 17 3 419 30.5 41.0
36 18 3 419 31.6 41.4
37 19 3 419 31.6 41.5
38 1 1 989 10.1 15.0
39 2 1 989 10.2 15.5
40 3 1 989 10.4 15.4
41 4 1 989 10.5 15.8
42 5 1 989 10.6 15.9
43 6 1 989 10.1 15.5
44 7 1 989 10.9 15.3
45 8 2 989 25.8 30.1
46 9 2 989 25.2 30.1
47 10 2 989 21.8 30.2
48 11 2 989 25.8 30.2
49 12 2 989 25.6 30.5
50 13 2 989 25.5 31.0
51 14 3 989 50.3 40.3
52 15 3 989 50.3 40.2
53 16 3 989 50.2 40.4
54 17 3 989 50.1 40.8
55 18 3 989 50.6 41.2
56 19 3 989 51.4 41.6
The data is essentially multidimensional with GameTime, PlayId, and PlayerId as independent variables and x-coord and y-coord as dependent variables. How can I go about calculating the maximum distance from the nearest defender during the course of a play?
My guess is I would have to create columns containing the distance from each defender for each offensive player, but I don't know how to name those and be able to account for an unknown amount of defensive/offensive players (the full data set contains thousands of players).
Here is a possible solution , I think there is a way to making it more efficient :
Assuming you have a dataframe called offense_df and a dataframe called defense_df:
In the merged dataframe you'll get the answer to your question, basically it will create the following dataframe:
from scipy.spatial import distance
merged_dataframe = pd.merge(offense_df,defense_df,on=['GameTime','PlayId'],suffixes=('_off','_def'))
GameTime PlayId PlayerId_off x-coord_off y-coord_off PlayerId_def x-coord_def y-coord_def
0 1 1 751 30.2 15.0 117 20.2 20.0
1 1 1 751 30.2 15.0 555 40.1 17.0
2 1 1 419 41.3 15.0 117 20.2 20.0
3 1 1 419 41.3 15.0 555 40.1 17.0
4 1 1 989 10.1 15.0 117 20.2 20.0
The next two lines are here to create a unique column for the coordinates , basically it will create for the offender (coord_off) and the defender a column (coord_def) that contains a tuple (x,y) this will simplify the computation of the distance.
merged_dataframe['coord_off'] = merged_dataframe.apply(lambda x: (x['x-coord_off'], x['y-coord_off']),axis=1)
merged_dataframe['coord_def'] = merged_dataframe.apply(lambda x: (x['x-coord_def'], x['y-coord_def']),axis=1)
We compute the distance to all the defender at a given GameTime,PlayId.
merged_dataframe['distance_to_def'] = merged_dataframe.apply(lambda x: distance.euclidean(x['coord_off'],x['coord_def']),axis=1)
For each PlayerId,GameTime,PlayId we take the distance to the nearest defender.
smallest_dist = merged_dataframe.groupby(['GameTime','PlayId','PlayerId_off'])['distance_to_def'].min()
Finally we take the maximum distance (of these minimum distances) for each PlayerId.
smallest_dist.groupby('PlayerId_off').max()

why am I getting a too many indexers error?

cars_df = pd.DataFrame((car.iloc[:[1,3,4,6]].values), columns = ['mpg', 'dip', 'hp', 'wt'])
car_t = car.iloc[:9].values
target_names = [0,1]
car_df['group'] = pd.series(car_t, dtypre='category')
sb.pairplot(cars_df)
I have tried using .iloc(axis=0)[xxxx] and making a slice into a list and a tuple. no dice. Any thoughts? I am trying to make a scatter plot from a lynda.com video but in the video, the host is using .ix which is deprecated. So I am using .iloc[]
car = a dataframe
a few lines of data
"Car_name","mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
"Mazda RX4",21,6,160,110,3.9,2.62,16.46,0,1,4,4
"Mazda RX4 Wag",21,6,160,110,3.9,2.875,17.02,0,1,4,4
"Datsun 710",22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
"Hornet 4 Drive",21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
"Hornet Sportabout",18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
"Valiant",18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
"Duster 360",14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
"Merc 240D",24.4,4,146.7,62,3.69,3.19,20,1,0,4,2
"Merc 230",22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
"Merc 280",19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
"Merc 280C",17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
"Merc 450SE",16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
I think you want select multiple columns by iloc:
cars_df = car.iloc[:, [1,3,4,6]]
print (cars_df)
mpg disp hp wt
0 21.0 160.0 110 2.620
1 21.0 160.0 110 2.875
2 22.8 108.0 93 2.320
3 21.4 258.0 110 3.215
4 18.7 360.0 175 3.440
5 18.1 225.0 105 3.460
6 14.3 360.0 245 3.570
7 24.4 146.7 62 3.190
8 22.8 140.8 95 3.150
9 19.2 167.6 123 3.440
10 17.8 167.6 123 3.440
11 16.4 275.8 180 4.070
sb.pairplot(cars_df)
Not 100% sure with another code, it seems need:
#select also 9. column
cars_df = car.iloc[:, [1,3,4,6,9]]
#rename 9. column
cars_df = cars_df.rename(columns={'am':'group'})
#convert it to categorical
cars_df['group'] = pd.Categorical(cars_df['group'])
print (cars_df)
mpg disp hp wt group
0 21.0 160.0 110 2.620 1
1 21.0 160.0 110 2.875 1
2 22.8 108.0 93 2.320 1
3 21.4 258.0 110 3.215 0
4 18.7 360.0 175 3.440 0
5 18.1 225.0 105 3.460 0
6 14.3 360.0 245 3.570 0
7 24.4 146.7 62 3.190 0
8 22.8 140.8 95 3.150 0
9 19.2 167.6 123 3.440 0
10 17.8 167.6 123 3.440 0
11 16.4 275.8 180 4.070 0
#add parameetr hue for different levels of a categorical variable
sb.pairplot(cars_df, hue='group')

Resources