pandas group by row wise conditions

pandas group by row wise conditions - python-3.x

I have a dataframe like this
import pandas as pd
raw_data = {'ID':['101','101','101','101','101','102','102','103'],
'Week':['W01','W02','W03','W07','W08','W01','W02','W01'],
'Orders':[15,15,10,15,15,5,10,10]}
df2 = pd.DataFrame(raw_data, columns = ['ID','Week','Orders'])
i wanted row by row percentages within groups.
How can i achieve like this

Using pct_change
df2.groupby('ID').Orders.pct_change()).add(1).fillna(0)
I find it wired in my pandas version pct_change can not do with groupby object , so that we need to do with
df2['New']=sum(l,[])
df2.New=(df2.New+1).fillna(0)
df2
Out[606]:
ID Week Orders New
0 101 W01 15 0.000000
1 101 W02 15 1.000000
2 101 W03 10 0.666667
3 101 W07 15 1.500000
4 101 W08 15 1.000000
5 102 W01 5 0.000000
6 102 W02 10 2.000000
7 103 W01 10 0.000000

Carry out a window operation shifting the value by 1 position
df2['prev']=df2.groupby(by='ID').Orders.shift(1).fillna(0)
Calculate % change individually using apply()
df2['pct'] = df2.apply(lambda x : ((x['Orders'] - x['prev']) / x['prev']) if x['prev'] != 0 else 0,axis=1)
I am not sure if there is any default pd.pct_change() within a window.
ID Week Orders prev pct
0 101 W01 15 0.0 0.000000
1 101 W02 15 15.0 0.000000
2 101 W03 10 15.0 -0.333333
3 101 W07 15 10.0 0.500000
4 101 W08 15 15.0 0.000000
5 102 W01 5 0.0 0.000000
6 102 W02 10 5.0 1.000000
7 103 W01 10 0.0 0.000000

Related

Optimized way of modifying a column based on another column of a dataframe

Let's say I have a dataframe like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44 1 96 1 40 1 88 0 81
1 2017-05-01 State NY 0 42 0 55 1 92 1 82 0 38
2 2017-06-01 State NY 1 11 0 7 1 35 0 70 1 61
3 2017-07-01 State NY 1 12 1 80 1 83 1 47 1 44
4 2017-08-01 State NY 1 63 1 48 0 61 0 5 0 20
5 2017-09-01 State NY 1 56 1 92 0 55 0 45 1 17
I'd like to replace all the values of columns with _rank as NaN if it's corresponding flag is zero.To get something like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
Which is fairly simple. This is my approach for the same:
for k in variables:
dt[k+'_rank'] = np.where(dt[k+'_flag']==0,np.nan,dt[k+'_rank'])
Although this works fine for a smaller dataset, it takes a significant amount of time for processing a dataframe with very high number of columns and entries. So is there a optimized way of achieving the same without iteration?
P.S. There are other payloads apart from _rank and _flag in the data.
Thanks in advance

Use .str.endswith to filter the columns that ends with _flag, then use rstrip to strip the flag label and add rank label to get the corresponding column names with rank label, then use np.where to fill the NaN values in the columns containing _rank depending upon the condition when the corresponding values in flag columns is 0:
flags = df.columns[df.columns.str.endswith('_flag')]
ranks = flags.str.rstrip('flag') + 'rank'
df[ranks] = np.where(df[flags].eq(0), np.nan, df[ranks])
OR, it is also possible to use DataFrame.mask:
df[ranks] = df[ranks].mask(df[flags].eq(0).to_numpy())
Result:
# print(df)
Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0

How can I find the sum of certain columns and find avg of other columns in python?

enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})

Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000

For loops hangs when trying to use various previous row values in calculating a new column. Need to find a non-for loop solution

I have a dataset of 6.5 million rows in which each user ID's interaction with a token supplier are recorded.
The data is sorted by 'id' and 'Days'
The 'Days' column is the count of days since they joined the supplier.
The day on which a user is given tokens, it is mentioned in column token_SUPPLY.
Each day one token is used.
I want to create a column in which the number of available tokens for every row is mentioned.
The logic I've used is:
For each row check if we are still looking at the same user 'id'. If yes then check if any tokens have been supplied, if yes, then save the day number.
For each subsequent row of the same user, calculate available tokens the number of tokens supplied minus the number of days passed since the tokens were supplied.
currID=0
tokenSupply=0
giveDay=0
for row in df11.itertuples():
if row.id != currID:
tokenSupply = 0
currID= row.id
if row.token_SUPPLY > 0:
giveDay=row.Days
tokenSupply = row.token_SUPPLY
df11.loc[row.Index,"token_onhand"]=tokenSupply
else:
if tokenSupply == 0:
df11.loc[row.Index,"token_onhand"]=0
else:
df11.loc[row.Index,"token_onhand"]=tokenSupply-(row.Days-giveDay)
# For loop doesn't end for more than 50 minutes.
I've been reading a lot since last night and it seems that people have suggested using numpy, but I don't know how to do that as I'm just learning to use these things. The other suggestion was to #jit, but I guess that works only if I define a function.
Another suggestion was to vectorise, but how would I then access rows conditionally and remember the the supplied quantity to use in every subsequent row ?
I did try using np.where but it seemed to get too convoluted to wrap my around around it.
I also ready somewhere about Cython, but again, I have no idea how to do that properly.
What would be the best approach to achieve my objective ?
EDIT: Added sample data and required output column
Sample output data:
id Days token_SUPPLY give_event token_onhand
190 ID1001 -12 NaN 0 0.0
191 ID1001 -12 NaN 0 0.0
192 ID1001 -3 NaN 0 0.0
193 ID1001 0 5.0 0 5.0
194 ID1001 0 5.0 1 5.0
195 ID1001 6 NaN 0 -1.0
196 ID1001 12 NaN 0 -7.0
197 ID1001 12 NaN 0 -7.0
198 ID1001 13 NaN 0 -8.0
199 ID1001 13 NaN 0 -8.0
The last column token_onhand is not in the dataset, and is what actually needs to be generated.

If I understand correctly:
Sample Data:
id Days token_SUPPLY give_event
0 ID1001 -12 NaN 0
1 ID1001 -12 NaN 0
2 ID1001 -3 NaN 0
3 ID1001 0 5.0 0
4 ID1001 0 5.0 1
5 ID1001 6 NaN 0
6 ID1001 12 NaN 0
7 ID1001 12 NaN 0
8 ID1001 13 NaN 0
9 ID1001 13 NaN 0
10 ID1002 -12 NaN 0
11 ID1002 -12 NaN 0
12 ID1002 -3 NaN 0
13 ID1002 0 5.0 0
14 ID1002 0 5.0 1
15 ID1002 6 NaN 0
16 ID1002 12 NaN 0
17 ID1002 12 NaN 0
18 ID1002 13 NaN 0
19 ID1002 13 NaN 0
You can useffill on token_Supply and subtract Days. For more than on id use groupby.
df = pd.read_clipboard()
df['token_onhand'] = df.groupby('id').apply(lambda x: (x['token_SUPPLY'].ffill() - x['Days']).fillna(0)).reset_index(drop=True)
df
Result:
id Days token_SUPPLY give_event token_onhand
0 ID1001 -12 NaN 0 0.0
1 ID1001 -12 NaN 0 0.0
2 ID1001 -3 NaN 0 0.0
3 ID1001 0 5.0 0 5.0
4 ID1001 0 5.0 1 5.0
5 ID1001 6 NaN 0 -1.0
6 ID1001 12 NaN 0 -7.0
7 ID1001 12 NaN 0 -7.0
8 ID1001 13 NaN 0 -8.0
9 ID1001 13 NaN 0 -8.0
10 ID1002 -12 NaN 0 0.0
11 ID1002 -12 NaN 0 0.0
12 ID1002 -3 NaN 0 0.0
13 ID1002 0 5.0 0 5.0
14 ID1002 0 5.0 1 5.0
15 ID1002 6 NaN 0 -1.0
16 ID1002 12 NaN 0 -7.0
17 ID1002 12 NaN 0 -7.0
18 ID1002 13 NaN 0 -8.0
19 ID1002 13 NaN 0 -8.0

Most frequent occurence in a pandas dataframe indexed by datetime

I have a large DataFrame which is indexed by datetime, in particular, by days. I am looking for an efficient function which, for each column, checks the most common non-null value in each week, and outputs a dataframe which is indexed by weeks consisting of these within-week most common values.
Here is an example. The following DataFrame consists of two weeks of daily data:
0 1
2015-11-12 00:00:00 8 nan
2015-11-13 00:00:00 7 nan
2015-11-14 00:00:00 nan 5
2015-11-15 00:00:00 7 nan
2015-11-16 00:00:00 8 nan
2015-11-17 00:00:00 7 nan
2015-11-18 00:00:00 5 nan
2015-11-19 00:00:00 9 nan
2015-11-20 00:00:00 8 nan
2015-11-21 00:00:00 6 nan
2015-11-22 00:00:00 6 nan
2015-11-23 00:00:00 6 nan
2015-11-24 00:00:00 6 nan
2015-11-25 00:00:00 2 nan
and should be transformed into:
0 1
2015-11-12 00:00:00 7 5
2015-11-19 00:00:00 6 nan
My DataFrame is very large so efficiency is important. Thanks.
EDIT: If possible, can someone suggest a method that would be applicable if the entries are tuples (instead of floats as in my example)?

You can use resample to group your data by the weekly interval. Then, count the number of occurences via pd.value_counts and select the most common with idxmax:
df.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
Edit
Here is another numpy version which is faster than the above solution:
def numpy_mode(series):
values = series.values
dropped = values[~np.isnan(values)]
# check for empty array and return NaN
if not dropped.size:
return np.NaN
uniques, counts = np.unique(series.dropna(), return_counts=True)
return uniques[np.argmax(counts)]
df2.resample("7D").apply(lambda x: x.apply(get_mode))
0 1
2015-11-12 00:00:00 7.0 5.0
2015-11-19 00:00:00 6.0 NaN
And here the timings based on the dummy data (for further improvements, have a look here):
%%timeit
df2.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 100 loops, best of 3: 18.6 ms per loop
%%timeit
df2.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 100 loops, best of 3: 3.72 ms per loop
I also tried scipy.stats.mode however it was also slower than the numpy solution:
size = 1000
index = pd.DatetimeIndex(start="2012-12-12", periods=size, freq="D")
dummy = pd.DataFrame(np.random.randint(0, 20, size=(size, 50)), index=index)
print(dummy.head)
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
2012-12-12 18 2 7 1 7 9 16 2 19 19 ... 10 2 18 16 15 10 7 19 9 6
2012-12-13 7 4 11 19 17 10 18 0 10 7 ... 19 11 5 5 11 4 0 16 12 19
2012-12-14 14 0 14 5 1 11 2 19 5 9 ... 2 9 4 2 9 5 19 2 16 2
2012-12-15 12 2 7 2 12 12 11 11 19 5 ... 16 0 4 9 13 5 10 2 14 4
2012-12-16 8 15 2 18 3 16 15 0 14 14 ... 18 2 6 13 19 10 3 16 11 4
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(get_mode))
>>> 1 loop, best of 3: 926 ms per loop
%%timeit
dummy.resample("7D").apply(lambda x: x.apply(pd.value_counts).idxmax())
>>> 1 loop, best of 3: 5.84 s per loop
%%timeit
dummy.resample("7D").apply(lambda x: stats.mode(x).mode)
>>> 1 loop, best of 3: 1.32 s per loop

Pandas Modify Dataframe

I have a dataframe as below
0 1 2 3 4 5
0 0.428519 0.000000 0.0 0.541096 0.250099 0.345604
1 0.056650 0.000000 0.0 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.0 0.000000 0.000000 0.000000
3 0.849066 0.559117 0.0 0.374447 0.424247 0.586254
4 0.317644 0.000000 0.0 0.271171 0.586686 0.424560
I would like to modify it as below
0 0 0.428519
0 1 0.000000
0 2 0.0
0 3 0.541096
0 4 0.250099
0 5 0.345604
1 0 0.056650
1 1 0.000000
........

Use stack with reset_index:
df1 = df.stack().reset_index()
df1.columns = ['col1','col2','col3']
print (df1)
col1 col2 col3
0 0 0 0.428519
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.541096
4 0 4 0.250099
5 0 5 0.345604
6 1 0 0.056650
7 1 1 0.000000
8 1 2 0.000000
9 1 3 0.000000
10 1 4 0.000000
11 1 5 0.000000
12 2 0 0.000000
13 2 1 0.000000
14 2 2 0.000000
15 2 3 0.000000
16 2 4 0.000000
17 2 5 0.000000
18 3 0 0.849066
19 3 1 0.559117
20 3 2 0.000000
21 3 3 0.374447
22 3 4 0.424247
23 3 5 0.586254
24 4 0 0.317644
25 4 1 0.000000
26 4 2 0.000000
27 4 3 0.271171
28 4 4 0.586686
29 4 5 0.424560
Numpy solution with numpy.tile and numpy.repeat, flattening is by numpy.ravel:
df2 = pd.DataFrame({
"col1": np.repeat(df.index, len(df.columns)),
"col2": np.tile(df.columns, len(df.index)),
"col3": df.values.ravel()})
print (df2)
col1 col2 col3
0 0 0 0.428519
1 0 1 0.000000
2 0 2 0.000000
3 0 3 0.541096
4 0 4 0.250099
5 0 5 0.345604
6 1 0 0.056650
7 1 1 0.000000
8 1 2 0.000000
9 1 3 0.000000
10 1 4 0.000000
11 1 5 0.000000
12 2 0 0.000000
13 2 1 0.000000
14 2 2 0.000000
15 2 3 0.000000
16 2 4 0.000000
17 2 5 0.000000
18 3 0 0.849066
19 3 1 0.559117
20 3 2 0.000000
21 3 3 0.374447
22 3 4 0.424247
23 3 5 0.586254
24 4 0 0.317644
25 4 1 0.000000
26 4 2 0.000000
27 4 3 0.271171
28 4 4 0.586686
29 4 5 0.424560

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string