Python Subtracting two columns with date data, from csv to get number of weeks , months? - python-3.x

I have a csv in which I have two columns representing start date: st_dt and end date: 'end_dt` , I have to subtract these columns to get the number of weeks. I tried iterating through columns using pandas, but it seems my output is wrong.
st_dt end_dt
---------------------------------------
20100315 20100431

Use read_csv with parse_dates for datetimes and then after substract days:
df = pd.read_csv(file, parse_dates=[0,1])
print (df)
st_dt end_dt
0 2010-03-15 2010-04-30
df['diff'] = (df['end_dt'] - df['st_dt']).dt.days
print (df)
st_dt end_dt diff
0 2010-03-15 2010-04-30 46
If some dates are wrong like 20100431 use to_datetime with parameter errors='coerce' for convert them to NaT:
df = pd.read_csv(file)
print (df)
st_dt end_dt
0 20100315 20100431
1 20100315 20100430
df['st_dt'] = pd.to_datetime(df['st_dt'], errors='coerce', format='%Y%m%d')
df['end_dt'] = pd.to_datetime(df['end_dt'], errors='coerce', format='%Y%m%d')
df['diff'] = (df['end_dt'] - df['st_dt']).dt.days
print (df)
st_dt end_dt diff
0 2010-03-15 NaT NaN
1 2010-03-15 2010-04-30 46.0

Related

Formatting date from a dataframe of strings

I have data that looks like this:
0 2/18/2020
1 9/30/2019
Each line is a str
I want it to keep its type (string, not datetime) but to look like this:
0 2020-02-18
1 2019-09-30
How can I achieve this format for all rows in the dataframe?
Thank you!
Use split to separate the MM, DD, and YYYY and then change the order of the lists as preferred. Make sure single digit months and days receive a leading zero, before finally joining the lists back into one string using - instead of /:
df = pd.DataFrame({'col': ['2/18/2020', '9/30/2019']})
df['col'] = (df.col
.apply(lambda x: x.split('/')[2:] + x.split('/')[:2])
.apply(
lambda x:
x[0:1]
+ (x[1:2] if len(x[1]) > 1 else ['0'+x[1]])
+ (x[2:3] if len(x[2]) > 1 else ['0'+x[2]])
)
.apply(lambda x: '-'.join(x)))
df
col
0 2020-02-18
1 2019-09-30
You can use:
import pandas as pd
df = pd.DataFrame({'date': ['2/18/2020', '9/30/2019']})
df['date'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m-%d')
print(df)
Result:
date
0 2020-02-18
1 2019-09-30

Handling dates with mix of two and four digit years in python

I have two DataFrame df:
A B
5/4/2018 8/4/2018
24/5/15 26/5/15
21/7/16 22/7/16
3/7/2015 5/7/2015
1/7/2016 1/7/2016
I want to calculate the difference of days for each row.
for example:
A B C
5/4/2018 8/4/2018 3
24/5/15 26/5/15 2
I have tried to convert dataframe into datetime using pd.to_datetime. but, getting error "ValueError: unconverted data remains: 18"
tried following code:
import datetime as dt
df['A'] = pd.to_datetime(df['A'], format = "%d/%m/%y").datetime.datetime.strftime("%Y-%m-%d")
df['B'] = pd.to_datetime(df['B'], format = "%d/%m/%y").datetime.datetime.strftime("%Y-%m-%d")
df['C'] = (df['B'] - df['A']).dt.days
note :using python 3.7
Try:
df['A'] = pd.to_datetime(df['A'], dayfirst=True)
df['B'] = pd.to_datetime(df['B'], dayfirst=True)
df['C'] = (df['B'] - df['A']).dt.days
Output:
A B C
0 2018-04-05 2018-04-08 3
1 2015-05-24 2015-05-26 2

Pandas timestamp

I'd like to group my data per day and calculate the daily mean of the sentiment.
I have problem with the pandas dataframe because I am not able to transform my date column in datestamp to use the groupby() function. Here is my data sample:
sentiment date
0 1 2018-01-01 07:37:07+00:00
1 0 2018-02-12 06:57:27+00:00
2 -1 2018-09-18 06:23:07+00:00
3 1 2018-09-18 07:23:10+00:00
4 0 2018-02-12 06:21:08+00:00
I think need resample - it create full DatatimeIndex:
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('D',on='date')['sentiment'].mean()
#if want remove NaNs rows
df1 = df.resample('D',on='date')['sentiment'].mean().dropna()
Or groupby and aggregate mean with dates or floor for remove times:
df2 = df.groupby(df['date'].dt.date)['sentiment'].mean()
#DatetimeIndex in output
df2 = df.groupby(df['date'].dt.floor('d'))['sentiment'].mean()

Month difference YYYYMM Pandas

I had two date columns in the data frame, which was of float type, So I converted it in to date format YYYYMM. Now I have to find the difference of months between
them. I tried the below, but I goves error.
df['Date_1'] = pd.to_datetime(df['Date_1'], format = '%Y%m%d').dt.strftime('%Y%m') #Convert float to YYYYMM Format
df['Date_2'] = pd.to_datetime(df['Date_2'], format='%Y%m.0').dt.strftime('%Y%m') #Convert float to YYYYMM Format
df['diff'] = df['Date_1'] - df['Date_2'] #Gives error
I think need subtract periods created byto_period :
df = pd.DataFrame({'Date_1':[20150810, 20160804],
'Date_2':[201505.0, 201602.0]})
print (df)
Date_1 Date_2
0 20150810 201505.0
1 20160804 201602.0
df['Date_1'] = pd.to_datetime(df['Date_1'], format = '%Y%m%d').dt.to_period('m')
df['Date_2'] = pd.to_datetime(df['Date_2'], format='%Y%m.0').dt.to_period('m')
df['diff'] = df['Date_1'] - df['Date_2']
print (df)
Date_1 Date_2 diff
0 2015-08 2015-05 3
1 2016-08 2016-02 6
Another solution is convert Date_1 to first day of month:
df['Date_1'] = pd.to_datetime(df['Date_1'], format = '%Y%m%d') - pd.offsets.MonthBegin()
df['Date_2'] = pd.to_datetime(df['Date_2'], format='%Y%m.0')
df['diff'] = df['Date_1'] - df['Date_2']
print (df)
Date_1 Date_2 diff
0 2015-08-01 2015-05-01 92 days
1 2016-08-01 2016-02-01 182 days

roll off profile stacking data frames

I have a dataframe that looks like:
import pandas as pd
import datetime as dt
df= pd.DataFrame({'date':['2017-12-31','2017-12-31'],'type':['Asset','Liab'],'Amount':[100,-100],'Maturity Date':['2019-01-02','2018-01-01']})
df
I am trying to build a roll-off profile by checking if the 'Maturity Date' is greater than a 'date' in the future. I am trying to achieve something like:
#First Month
df1=df[df['Maturity Date']>'2018-01-31']
df1['date']='2018-01-31'
#Second Month
df2=df[df['Maturity Date']>'2018-02-28']
df2['date']='2018-02-28'
#third Month
df3=df[df['Maturity Date']>'2018-03-31']
df3['date']='2018-02-31'
#first quarter
qf1=df[df['Maturity Date']>'2018-06-30']
qf1['date']='2018-06-30'
#concatenate
df=pd.concat([df,df1,df2,df3,qf1])
df
I was wondering if there is a way to :
Allow an arbitrary long number of dates without repeating code
I think you need numpy.tile for repeat indices and assign to new column, last filter by boolean indexing and sorting by sort_values:
d = '2017-12-31'
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'])
#generate first month and next quarters
c1 = pd.date_range(d, periods=4, freq='M')
c2 = pd.date_range(c1[-1], periods=2, freq='Q')
#join together
c = c1.union(c2[1:])
#repeat rows be indexing repeated index
df1 = df.loc[np.tile(df.index, len(c))].copy()
#assign column by datetimes
df1['date'] = np.repeat(c, len(df))
#filter by boolean indexing
df1 = df1[df1['Maturity Date'] > df1['date']]
print (df1)
Amount Maturity Date date type
0 100 2019-01-02 2017-12-31 Asset
1 -100 2018-01-01 2017-12-31 Liab
0 100 2019-01-02 2018-01-31 Asset
0 100 2019-01-02 2018-02-28 Asset
0 100 2019-01-02 2018-03-31 Asset
0 100 2019-01-02 2018-06-30 Asset
You could use a nifty tool in the Pandas arsenal called
pd.merge_asof. It
works similarly to pd.merge, except that it matches on "nearest" keys rather
than equal keys. Furthermore, you can tell pd.merge_asof to look for nearest
keys in only the backward or forward direction.
To make things interesting (and help check that things are working properly), let's add another row to df:
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
print(df)
# Amount Maturity Date date type
# 1 200 2018-03-15 2017-12-31 Asset
# 0 100 2019-01-02 2017-12-31 Asset
Now define some new dates:
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
# date
# 0 2018-01-31
# 1 2018-02-28
# 2 2018-03-31
# 3 2018-06-30
Now we can merge rows, matching nearest dates from result with Maturity Dates from df:
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
In this case we want to "match" dates with Maturity Dates which are greater
so we use direction='forward'.
Putting it all together:
import pandas as pd
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
result = pd.concat([df, result], axis=0)
result = result.sort_values(by=['Maturity Date', 'date'])
print(result)
yields
Amount Maturity Date date type
1 200 2018-03-15 2017-12-31 Asset
0 200 2018-03-15 2018-01-31 Asset
1 200 2018-03-15 2018-02-28 Asset
0 100 2019-01-02 2017-12-31 Asset
2 100 2019-01-02 2018-03-31 Asset
3 100 2019-01-02 2018-06-30 Asset

Resources