Python pandas: pad rows with missing/skipped date - python-3.x

I have the following data frame:
date my_count
--------------------------
2017-01-01 6
2017-01-04 5
2017-01-05 3
2017-01-08 8
I would like to pad the skipped date with my_count = 0, so the padded data frame will look like:
date my_count
--------------------------
2017-01-01 6
2017-01-02 0
2017-01-03 0
2017-01-04 5
2017-01-05 3
2017-01-06 0
2017-01-07 0
2017-01-08 8
Except checking the data frame line by line, is there a more elegant way to do this? Thanks!

1st option resample,
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.resample('D').sum().fillna(0).reset_index())
date my_count
0 2017-01-01 6.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
2nd option reindex by date_range,
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.reindex(pd.date_range('2017-01-01', '2017-01-08')).fillna(0))
my_count
2017-01-01 6.0
2017-01-02 0.0
2017-01-03 0.0
2017-01-04 5.0
2017-01-05 3.0
2017-01-06 0.0
2017-01-07 0.0
2017-01-08 8.0

If values of DatetimeIndex are unique use:
You can use asfreq or reindex by min or max value of index or by first and last (if DatetimeIndex is sorted):
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.asfreq('D', fill_value=0).reset_index())
date my_count
0 2017-01-01 6
1 2017-01-02 0
2 2017-01-03 0
3 2017-01-04 5
4 2017-01-05 3
5 2017-01-06 0
6 2017-01-07 0
7 2017-01-08 8
rng = pd.date_range(df.index.min(), df.index.max())
#alternative
#rng = pd.date_range(df.index[0], df.index[-1])
print(df.reindex(rng, fill_value=0).rename_axis('date').reset_index())
date my_count
0 2017-01-01 6
1 2017-01-02 0
2 2017-01-03 0
3 2017-01-04 5
4 2017-01-05 3
5 2017-01-06 0
6 2017-01-07 0
7 2017-01-08 8
If DatetimeIndex are not unique get:
ValueError: cannot reindex from a duplicate axis
Then need resample with some aggregate function like mean or groupby with Grouper and last replace NaNs by fillna:
print (df)
date my_count
0 2017-01-01 4 <-duplicate date
1 2017-01-01 6 <-duplicate date
2 2017-01-04 5
3 2017-01-05 3
4 2017-01-08 8
df['date'] = pd.to_datetime(df['date'])
print(df.resample('D', on='date')['my_count'].mean().fillna(0).reset_index())
date my_count
0 2017-01-01 5.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0
df = df.set_index('date')
print(df.groupby(pd.Grouper(freq='D'))['my_count'].mean().fillna(0).reset_index())
date my_count
0 2017-01-01 5.0
1 2017-01-02 0.0
2 2017-01-03 0.0
3 2017-01-04 5.0
4 2017-01-05 3.0
5 2017-01-06 0.0
6 2017-01-07 0.0
7 2017-01-08 8.0

Related

How to bucket pandas timeseries and apply complex groupby

I have a timeseries dataframe like below
ts_ms a. b. c. x. y. z
1614772770705. 10. 10. 4. 1 2 3
1614772770800. 10. 10. 2. 1 2 4
1614772770750. 10. 5. 4. 1 2 3
I need to create 5 min buckets and then apply the dataframe equivalent of the SQL below
select sum(x), sum(y), sum(z)
group by a, b, c
What I have so far is
#convert to datetimes
df['ts_date'] = pd.to_datetime(df['ts_ms'])
# create bucket
df.set_index('ts_date').groupby(pd.Grouper(freq='5Min'))
But I am not sure how to apply the SQL equivalent to this dataframe after this point.
Please suggest.
If need grouping by 5Min together by a,b,c columns use one DataFrame.groupby:
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df1 = df.groupby(['a','b','c',pd.Grouper(freq='5Min',key='ts_date')])[["x", "y", "z"]].sum()
print (df1)
x y z
a b c ts_date
10.0 5.0 4.0 1970-01-01 00:25:00 7 8 9
1970-01-01 00:45:00 7 8 9
10.0 2.0 1970-01-01 00:25:00 8 10 12
4.0 1970-01-01 00:25:00 2 4 6
Or is possible use DataFrame.groupby with DataFrame.resample by 5Min:
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df2 = df.set_index('ts_date').groupby(['a','b','c'])[["x", "y", "z"]].resample('5Min').sum()
print (df2)
x y z
a b c ts_date
10.0 5.0 4.0 1970-01-01 00:25:00 7 8 9
1970-01-01 00:30:00 0 0 0
1970-01-01 00:35:00 0 0 0
1970-01-01 00:40:00 0 0 0
1970-01-01 00:45:00 7 8 9
10.0 2.0 1970-01-01 00:25:00 8 10 12
4.0 1970-01-01 00:25:00 2 4 6
Setup:
# data.csv
ts_ms,a,b,c,x,y,z
1614772770705.,10.,10.,4.,1,2,3
1614772770800.,10.,10.,2.,4,5,6
1614772770750.,10.,5.,4.,7,8,9
1614772770805.,10.,10.,4.,1,2,3
1614772770900.,10.,10.,2.,4,5,6
2714772770850.,10.,5.,4.,7,8,9
Code:
import pandas as pd
def func(grp):
return grp.groupby(pd.Grouper(freq='5Min'))[["x", "y", "z"]].sum()
df = pd.read_csv("data.csv")
df['ts_date'] = pd.to_datetime(df['ts_ms'])
df.set_index('ts_date', inplace=True)
df.groupby(["a", "b", "c"]).apply(func)
Outputs:
x y z
a b c ts_date
10.0 5.0 4.0 1970-01-01 00:25:00 7 8 9
1970-01-01 00:30:00 0 0 0
1970-01-01 00:35:00 0 0 0
1970-01-01 00:40:00 0 0 0
1970-01-01 00:45:00 7 8 9
10.0 2.0 1970-01-01 00:25:00 8 10 12
4.0 1970-01-01 00:25:00 2 4 6

python 3.6 pandas conditionally filling missing values

If there is a dataframe:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':np.nan},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan}
]
)
print(users[['id','date','balance_total','transaction_total']])
Dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 NaN NaN
3 1 01/04/2019 NaN NaN
4 1 01/05/2019 NaN -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 NaN NaN
8 2 01/05/2019 NaN -4.0
How can i do the following?
If both of the transaction_total and balance_total are NaN, just fill in the last date's balance_total (e.g. in row 3 where id=1, since the user1's transaction_total and balance_total are NaN, fill in 100 from 01/02/2019. The same will be row 4, fill in 100 from 01/03/2019.)
If the transaction_total is NOT NaN, but balance_total is NaN, do the math of the previous date's balance_total+ the current row's date's transaction_total.
In user 1, 01/05/2019 as example: the balance total will be=100+(-4), where 100 is 01/04/2019's balance total, and (-4) is 01/05/2019's transaction total.
Desired output:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is my code but it doesn't work. I think i couldn't figure out how to do "if logic in pandas when a row is null, do something".
for i, row in df.iterrows():
if(pd.isnull(row['transaction_total'] is True)):
if(pd.isnull(row['balance_total'] is True)):
df.loc[i,'transaction_total'] = df.loc[i-1,'transaction_total']
Could someone enlighten?
IIUC, first create a dummy series with ffill, and then use np.where:
s = df["balance_total"].ffill()
df["balance_total"] = np.where(df["balance_total"].isnull()&df["transaction_total"].notnull(),
s.add(df["transaction_total"]), s)
print (df)
id date transaction_total balance_total
0 1 01/01/2019 -1.0 102.0
1 1 01/02/2019 -2.0 100.0
2 1 01/03/2019 NaN 100.0
3 1 01/04/2019 NaN 100.0
4 1 01/05/2019 -4.0 96.0
5 2 01/01/2019 -2.0 200.0
6 2 01/02/2019 -2.0 100.0
7 2 01/04/2019 NaN 100.0
8 2 01/05/2019 -4.0 96.0

KeyError: 'Dayofweek'

Seems fastai library is not working on Python. However, I have tried to add feature using the following lines of code with an objective that should identify whether a given day is
Monday/Friday or Tuesday/Wednesday/Thursday.
The code is as follows
data['mon_fri'] = 0
for i in range(0,len(data)):
if (data['Dayofweek'][i] == 0 or data['Dayofweek'][i] == 4):
data['mon_fri'][i] = 1
else:
data['mon_fri'][i] = 0
when i run, getting the following error -
KeyError: 'Dayofweek'
Can anyone help me on this?
New answer showing the entire dataframe.
In [51]: df = pd.DataFrame({'col1':range(9)})
In [52]: df['d'] = pd.date_range('2016-12-31','2017-01-08',freq='D')
In [53]: df
Out[53]:
col1 d
0 0 2016-12-31
1 1 2017-01-01
2 2 2017-01-02
3 3 2017-01-03
4 4 2017-01-04
5 5 2017-01-05
6 6 2017-01-06
7 7 2017-01-07
8 8 2017-01-08
Now adding column for day of the week
In [54]: df['dow'] = df['d'].dt.dayofweek
In [55]: df
Out[55]:
col1 d dow
0 0 2016-12-31 5
1 1 2017-01-01 6
2 2 2017-01-02 0
3 3 2017-01-03 1
4 4 2017-01-04 2
5 5 2017-01-05 3
6 6 2017-01-06 4
7 7 2017-01-07 5
8 8 2017-01-08 6
Finally doing the calculation, 1 for M/Th 0 for other days
In [56]: df['feature'] = df['dow'].apply(lambda x: int((x==1) or (x==4)))
In [57]: df
Out[57]:
col1 d dow feature
0 0 2016-12-31 5 0
1 1 2017-01-01 6 0
2 2 2017-01-02 0 0
3 3 2017-01-03 1 1
4 4 2017-01-04 2 0
5 5 2017-01-05 3 0
6 6 2017-01-06 4 1
7 7 2017-01-07 5 0
8 8 2017-01-08 6 0
Assuming you are using pandas you can just use the built in dayofweek function
In [32]: d = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
In [33]: d
Out[33]:
2016-12-31 2016-12-31
2017-01-01 2017-01-01
2017-01-02 2017-01-02
2017-01-03 2017-01-03
2017-01-04 2017-01-04
2017-01-05 2017-01-05
2017-01-06 2017-01-06
2017-01-07 2017-01-07
2017-01-08 2017-01-08
Freq: D, dtype: datetime64[ns]
In [34]: s = (d.dt.dayofweek==1) |(d.dt.dayofweek==4)
In [35]: s
Out[35]:
2016-12-31 False
2017-01-01 False
2017-01-02 False
2017-01-03 True
2017-01-04 False
2017-01-05 False
2017-01-06 True
2017-01-07 False
2017-01-08 False
Freq: D, dtype: bool
Then convert to 1/0 simply by
In [39]: t = s.apply(lambda x: int(x==True))
In [40]: t
Out[40]:
2016-12-31 0
2017-01-01 0
2017-01-02 0
2017-01-03 1
2017-01-04 0
2017-01-05 0
2017-01-06 1
2017-01-07 0
2017-01-08 0
Freq: D, dtype: int64

diagonally subtract different columns in python

I want to create another column in dataframe which consists value of difference. The difference is calculated by subtracting different rows of different columns for unique date values.
I tried looking for various stackoverflow links but didn't find the answer.
The difference should be the value after subtracting value of ATA of 2st row with ATD of 1st row and so on for unique date values. For ex, ATA of 1st january cannot be subtracted from ATD of 2nd january.
For example:-
The difference column's first values should be NAN.
Second values should be 50 Mins (17:13:00 - 16:23:00)
But ATD of 02-01-2019 should not be subtracted with ATA of 01-01-2019
You want to apply a shift grouped by Date and then subtract this with ATD
>>> df = pd.DataFrame({'ATA':range(0,365),'ATD':range(10,375),'Date':pd.date_range(start="2018-01-01",end="2018-12-31")})
>>> df['ATD'] = df['ATD']/6.0
>>> df = pd.concat([df,df,df,df])
>>> df['shifted_ATA'] = df.groupby('Date')['ATA'].transform('shift')
>>> df['result'] = df['ATD'] - df['shifted_ATA']
>>> df = df.sort_values(by='Date', ascending=[1])
>>> df.head(20)
ATA ATD Date shifted_ATA result
0 0 1.666667 2018-01-01 NaN NaN
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
0 0 1.666667 2018-01-01 0.0 1.666667
1 1 1.833333 2018-01-02 NaN NaN
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
1 1 1.833333 2018-01-02 1.0 0.833333
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 NaN NaN
2 2 2.000000 2018-01-03 2.0 0.000000
2 2 2.000000 2018-01-03 2.0 0.000000
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 3.0 -0.833333
3 3 2.166667 2018-01-04 NaN NaN
3 3 2.166667 2018-01-04 3.0 -0.833333
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 4.0 -1.666667
4 4 2.333333 2018-01-05 NaN NaN

Calculating rolling sum in a pandas dataframe on the basis of 2 variable constraints

I want to create a variable : SumOfPrevious5OccurencesAtIDLevel which is the sum of previous 5 values (as per Date variable) of Var1 at an ID level (column 1) , otherwise it will take a value of NA
Sample Data and Output:
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel
1 1/1/2018 0 NA
1 1/2/2018 1 NA
1 1/3/2018 2 NA
1 1/4/2018 3 NA
2 1/1/2018 4 NA
2 1/2/2018 5 NA
2 1/3/2018 6 NA
2 1/4/2018 7 NA
2 1/5/2018 8 NA
2 1/6/2018 9 30
2 1/7/2018 10 35
2 1/8/2018 11 40
Use groupby with transform and functions rolling and shift:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
#if not sorted ID with datetimes
df = df.sort_values(['ID','Date'])
df['new'] = df.groupby('ID')['Var1'].transform(lambda x: x.rolling(5).sum().shift())
print (df)
ID Date Var1 SumOfPrevious5OccurencesAtIDLevel new
0 1 2018-01-01 0 NaN NaN
1 1 2018-01-02 1 NaN NaN
2 1 2018-01-03 2 NaN NaN
3 1 2018-01-04 3 NaN NaN
4 2 2018-01-01 4 NaN NaN
5 2 2018-01-02 5 NaN NaN
6 2 2018-01-03 6 NaN NaN
7 2 2018-01-04 7 NaN NaN
8 2 2018-01-05 8 NaN NaN
9 2 2018-01-06 9 30.0 30.0
10 2 2018-01-07 10 35.0 35.0
11 2 2018-01-08 11 40.0 40.0

Resources