i have a time series dataset. i need to extract the lag features. i am using below code but got all NAN's
df.groupby(['week','id1','id2','id3'],as_index=False)['value'].shift(1)
input
week,id1,id2,id3,value
1,101,123,001,45
1,102,231,004,89
1,203,435,099,65
2,101,123,001,48
2,102,231,004,75
2,203,435,099,90
output
week,id1,id2,id3,value,t-1
1,101,123,001,45,NAN
1,102,231,004,89,NAN
1,203,435,099,65,NAN
2,101,123,001,48,45
2,102,231,004,75,89
2,203,435,099,90,65
You want to shift to the next week so remove 'week' from the grouping:
df['t-1'] = df.groupby(['id1','id2','id3'],as_index=False)['value'].shift()
# week id1 id2 id3 value t-1
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0
That's error prone to missing weeks. In this case we can merge after changing the week, which ensures it is the prior week regardless of missing weeks.
df2 = df.assign(week=df.week+1).rename(columns={'value': 't-1'})
df = df.merge(df2, on=['week', 'id1', 'id2', 'id3'], how='left')
Another way to bring and rename many columns would be to use the suffixes argument in the merge. This will rename all overlapping columns (that are not keys) in the right DataFrame.
df.merge(df.assign(week=df.week+1), # Manally lag
on=['week', 'id1', 'id2', 'id3'],
how='left',
suffixes=['', '_lagged'] # Right df columns -> _lagged
)
# week id1 id2 id3 value value_lagged
#0 1 101 123 1 45 NaN
#1 1 102 231 4 89 NaN
#2 1 203 435 99 65 NaN
#3 2 101 123 1 48 45.0
#4 2 102 231 4 75 89.0
#5 2 203 435 99 90 65.0
Related
I have this store_df DataFrame:
store_id date sales
0 1 2023-1-2 11
1 2 2023-1-3 22
2 3 2023-1-4 33
3 1 2023-1-5 44
4 2 2023-1-6 55
5 3 2023-1-7 66
6 1 2023-1-8 77
7 2 2023-1-9 88
8 3 2023-1-10 99
I am not able to solve this in the interview.
This was the exact question asked :
Create a dataset with 3 columns – store_id, date, sales Create 3 Store_id Each store_id has 3 consecutive dates Sales are recorded for 9 rows We are considering the same 9 dates across all stores Sales can be any random number
Write a function that fetches the previous day’s sales as output once we give store_id & date as input
The question can be handled in multiple ways.
If you want to just get the previous row per group, assuming that the values are consecutive and sorted by increasing dates, use a groupby.shift:
store_df['prev_day_sales'] = store_df.groupby('store_id')['sales'].shift()
Output:
store_id date sales prev_day_sales
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 55.0
8 3 2023-01-04 99 66.0
If, you really want to get the previous day's value (not the previous available day), use a merge:
store_df['date'] = pd.to_datetime(store_df['date'])
store_df.merge(store_df.assign(date=lambda d: d['date'].add(pd.Timedelta('1D'))),
on=['store_id', 'date'], suffixes=(None, '_prev_day'), how='left'
)
Note. This makes it easy to handle other deltas, like business days (replace pd.Timedelta('1D') with pd.offsets.BusinessDay(1)).
Example (with a different input):
store_id date sales sales_prev_day
0 1 2023-01-02 11 NaN
1 2 2023-01-02 22 NaN
2 3 2023-01-02 33 NaN
3 1 2023-01-03 44 11.0
4 2 2023-01-03 55 22.0
5 3 2023-01-03 66 33.0
6 1 2023-01-04 77 44.0
7 2 2023-01-05 88 NaN # there is no data for 2023-01-04
8 3 2023-01-04 99 66.0
enter image description here
I've combined 10 excel files each with 1yr of NFL passing stats and there are certain columns (Games played, Completions, Attempts, etc) that I have summed but I'd need (Passer rating, and QBR) that I'd like to see the avg for.
df3 = df3.groupby(['Player'],as_index=False).agg({'GS':'sum' ,'Cmp': 'sum', 'Att': 'sum','Cmp%': 'sum','Yds': 'sum','TD': 'sum','TD%': 'sum', 'Int': 'sum', 'Int%': 'sum','Y/A': 'sum', 'AY/A': 'sum','Y/C': 'sum','Y/G':'sum','Rate':'sum','QBR':'sum','Sk':'sum','Yds.1':'sum','NY/A': 'sum','ANY/A': 'sum','Sk%':'sum','4QC':'sum','GWD': 'sum'})
Quick note: don't attach photos of your code, dataset, errors, etc. Provide the actual code, the actual dataset (or a sample of the dataset), etc, so that users can reproduce the error, issue, etc. No one is really going to take the time to manufacture your dataset from a photo (or I should say rarely, as I did do that...because I love working with sports data, and I could grab it realitively quickly).
But to get averages in stead of the sum, you would use 'mean'. Also, in your code, why are you summing percentages?
import pandas as pd
df = pd.DataFrame()
for season in range(2010, 2020):
url = 'https://www.pro-football-reference.com/years/{season}/passing.htm'.format(season=season)
df = df.append(pd.read_html(url)[0], sort=False)
df = df[df['Rk'] != 'Rk']
df = df.reset_index(drop=True)
df['Player'] = df.Player.str.replace('[^a-zA-Z .]', '')
df['Player'] = df['Player'].str.strip()
strCols = ['Player','Tm', 'Pos', 'QBrec']
numCols = [ x for x in df.columns if x not in strCols ]
df[['QB_wins','QB_loss', 'QB_ties']] = df['QBrec'].str.split('-', expand=True)
df[numCols] = df[numCols].apply(pd.to_numeric)
df3 = df.groupby(['Player'],as_index=False).agg({'GS':'sum', 'TD':'sum', 'QBR':'mean'})
Output:
print (df3)
Player GS TD QBR
0 A.J. Feeley 3 1 27.300000
1 A.J. McCarron 4 6 NaN
2 Aaron Rodgers 142 305 68.522222
3 Ace Sanders 4 1 100.000000
4 Adam Podlesh 0 0 0.000000
5 Albert Wilson 7 1 99.700000
6 Alex Erickson 6 0 NaN
7 Alex Smith 121 156 55.122222
8 Alex Tanney 0 1 42.900000
9 Alvin Kamara 9 0 NaN
10 Andrew Beck 6 0 NaN
11 Andrew Luck 86 171 62.766667
12 Andy Dalton 133 204 53.375000
13 Andy Lee 0 0 0.000000
14 Anquan Boldin 32 0 11.600000
15 Anthony Miller 4 0 81.200000
16 Antonio Andrews 10 1 100.000000
17 Antonio Brown 55 1 29.300000
18 Antonio Morrison 4 0 NaN
19 Antwaan Randle El 0 2 100.000000
20 Arian Foster 13 1 100.000000
21 Armanti Edwards 0 0 41.466667
22 Austin Davis 10 13 38.150000
23 B.J. Daniels 0 0 NaN
24 Baker Mayfield 29 49 53.200000
25 Ben Roethlisberger 130 236 66.833333
26 Bernard Scott 1 0 5.600000
27 Bilal Powell 12 0 17.700000
28 Billy Volek 0 0 89.400000
29 Blaine Gabbert 48 48 37.687500
.. ... ... ... ...
329 Tim Boyle 0 0 NaN
330 Tim Masthay 0 1 5.700000
331 Tim Tebow 16 17 42.733333
332 Todd Bouman 1 2 57.400000
333 Todd Collins 1 0 0.800000
334 Tom Brady 156 316 72.755556
335 Tom Brandstater 0 0 0.000000
336 Tom Savage 9 5 38.733333
337 Tony Pike 0 0 2.500000
338 Tony Romo 72 141 71.185714
339 Travaris Cadet 1 0 NaN
340 Travis Benjamin 8 0 1.700000
341 Travis Kelce 15 0 1.800000
342 Trent Edwards 3 2 98.100000
343 Tress Way 0 0 NaN
344 Trevone Boykin 0 1 66.400000
345 Trevor Siemian 25 30 40.750000
346 Troy Smith 6 5 38.500000
347 Tyler Boyd 14 0 2.800000
348 Tyler Bray 0 0 0.000000
349 Tyler Palko 4 2 56.600000
350 Tyler Thigpen 1 2 30.233333
351 Tyreek Hill 13 0 0.000000
352 Tyrod Taylor 46 54 51.242857
353 Vince Young 11 14 50.850000
354 Will Grier 2 0 NaN
355 Willie Snead 4 1 100.000000
356 Zach Mettenberger 10 12 24.600000
357 Zach Pascal 13 0 NaN
358 Zay Jones 15 0 0.000000
I have a df that looks something like the below
Index Col1 Col2 Col3 Col4 Col5
0 12 121 346 abc 747
1 156 121 146 68 75967
2 234 121 346 567
3 gj 161 646
4 214 171
5 fhg
....
.....
And I want to make the dataframe appear such that the columns where there are null values, the columns move/shift their data to the bottom of the dataframe.
Eg it should look like:
Index Col1 Col2 Col3 Col4 Col5
0 12
1 156 121
2 234 121 346
3 gj 121 146 abc
4 214 161 346 68 747
5 fhg 171 646 567 75967
I have thought along the lines of shift and/or justify.
However not sure how it can be accomplished in the most efficient way for a large dataframe
You can use a bit changed justify function for working also with non numeric values:
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notnull(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
arr = justify(df.values, invalid_val=np.nan, side='down', axis=0)
df = pd.DataFrame(arr, columns=df.columns, index=df.index).astype(df.dtypes)
print (df)
Col1 Col2 Col3 Col4 Col5
0 12 NaN NaN NaN NaN
1 156 121 NaN NaN NaN
2 234 121 346 NaN NaN
3 gj 121 346 567 NaN
4 214 121 346 567 75967
5 fhg 121 346 567 75967
I tried this,
t=df.isnull().sum()
for val in zip(t.index.values,t.values):
df[val[0]]=df[val[0]].shift(val[1])
print df
Output:
Index Col1 Col2 Col3 Col4 Col5
0 0 12 NaN NaN NaN NaN
1 1 156 121.0 NaN NaN NaN
2 2 234 121.0 346.0 NaN NaN
3 3 gj 121.0 146.0 abc NaN
4 4 214 161.0 346.0 68 747.0
5 5 fhg 171.0 646.0 567 75967.0
Note: Here I have used loop, may be not a better solution, but it will give you an idea to solve this.
Using Python/Pandas I am trying to transform a dataframe by creating two new columns (A and B) conditional on values from different lines (from column ID3), but from within the same group (as determined by ID1).
For each ID1 group, I want to take the ID2 value where ID3 is equal to 31 and put this value in a new column called A conditional on ID3 being a 1 or a 2. Similarly, I want to take the ID2 value where ID3 is equal to 41 and put this value in a new column called B, again conditional on ID3 being a 1 or a 2.
Assuming I have a dataframe in the following format:
import pandas as pd
df = pd.DataFrame({'ID1': (1, 1, 1, 1, 2, 2, 2), 'ID2': (151, 152, 153, 154, 261, 262, 263), 'ID3': (1, 2, 31, 41, 1, 2, 41), 'ID4': (2, 2, 1, 2, 1, 1, 2)})
print(df)
ID1 ID2 ID3 ID4
0 1 151 1 2
1 1 152 2 2
2 1 153 31 1
3 1 154 41 2
4 2 261 1 1
5 2 262 2 1
6 2 263 41 2
Post-transformation format should look like what is shown below. Where columns A and B are populated with values from ID2, conditional on values within ID3.
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153 154
1 1 152 2 2 153 154
2 1 153 31 1
3 1 154 41 2
4 2 261 1 1
5 2 262 2 1 263
6 2 263 41 2 263
I have attempted using what is shown below, but transform will retain the same number of values as the original dataset. This poses a problem for the lines in which ID3 = 31 or 41. Also, it returns the ID2 value by default if there is no ID2 value of 31 within the group.
df['A'] = df.groupby('ID1')['ID2'].transform(lambda x: x.loc[df['ID3'] == 31])
df['B'] = df.groupby('ID1')['ID2'].transform(lambda x: x.loc[df['ID3'] == 41])
Result:
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153 154
1 1 152 2 2 153 154
2 1 153 31 1 153 154
3 1 154 41 2 153 154
4 2 261 1 1 261 263
5 2 262 2 1 262 263
6 2 263 41 2 263 263
Any suggestions? Thank you in advance!
In no why do I think this is the best solution, but it its a solution.
You can replace .loc with .where, which will return NaN wherever the condition is not true. Then backfill NaN, and then again filter with .where on ID3 being 1 or 2
df['A'] = df.groupby('ID1')['ID2'].transform(lambda x:
x.where(df.ID3==31).fillna(method='bfill').where(df.ID3.isin([1,2])))
df['B'] = df.groupby('ID1')['ID2'].transform(lambda x:
x.where(df.ID3==41).fillna(method='bfill').where(df.ID3.isin([1,2])))
ID1 ID2 ID3 ID4 A B
0 1 151 1 2 153.0 154.0
1 1 152 2 2 153.0 154.0
2 1 153 31 1 NaN NaN
3 1 154 41 2 NaN NaN
4 2 261 1 1 NaN 263.0
5 2 262 2 1 NaN 263.0
6 2 263 41 2 NaN NaN
I have two pandas dataframes matches with columns (match_id, team_id,date, ...) and teams_att with columns (id, team_id, date, overall_rating, ...).
I want to join the two dataframes on matches.team_id = teams_att.team_id and teams_att.date closest to matches.date
Example
matches
match_id team_id date
1 101 2012-05-17
2 101 2014-07-11
3 102 2010-05-21
4 102 2017-10-24
teams_att
id team_id date overall_rating
1 101 2010-02-22 67
2 101 2011-02-22 69
3 101 2012-02-20 73
4 101 2013-09-17 79
5 101 2014-09-10 74
6 101 2015-08-30 82
7 102 2015-03-21 42
8 102 2016-03-22 44
Desired results
match_id team_id matches.date teams_att.date overall_rating
1 101 2012-05-17 2012-02-20 73
2 101 2014-07-11 2014-09-10 74
3 102 2010-05-21 2015-03-21 42
4 102 2017-10-24 2016-03-22 44
You can use merge_asof with by and direction parameters:
pd.merge_asof(matches.sort_values('date'),
teams_att.sort_values('date'),
on='date', by='team_id',
direction='nearest')
Output:
match_id team_id date id overall_rating
0 3 102 2010-05-21 7 42
1 1 101 2012-05-17 3 73
2 2 101 2014-07-11 5 74
3 4 102 2017-10-24 8 44
We using merge_asof (Please check Scott's answer, that is the right way for solving this type problem :-) cheers )
g1=df1.groupby('team_id')
g=df.groupby('team_id')
l=[]
for x in [101,102]:
l.append(pd.merge_asof(g.get_group(x),g1.get_group(x),on='date',direction ='nearest'))
pd.concat(l)
Out[405]:
match_id team_id_x date id team_id_y overall_rating
0 1 101 2012-05-17 3 101 73
1 2 101 2014-07-11 5 101 74
0 3 102 2010-05-21 7 102 42
1 4 102 2017-10-24 8 102 44