Round Pandas date to nearest year/month - python-3.x

I am trying to round a pandas datetime column to its nearest year or month but I cannot figure out how to do it. For instance, this minimal example rounds to the closest hour:
pd.Timestamp.now().round('60min')
What I'd like is a way to replace the '60min' in order to round pd.Timestamp.now() to obtain either 2020-01-01 (for the year case) or 2019-08-01 (for the month case) (note that now() is exactly 2019-07-30 16:41:23.612004 at the time of asking!).
The pandas.Series.dt.round doc suggest a freq argument linking to this page, but trying the months/years options there return this error:
ValueError: is a non-fixed frequency
Any idea what I am missing?

If the column is really DateTime column (check with df.dtypes), you can get the year, month & day with the code below.
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['round_Year'] = df['Date']+ pd.offsets.YearBegin(-1)
rounds off to start of current year. Change -1 to 0 rounds off to start of next year.
df['round_Month'] = df['Date'] + pd.offsets.MonthBegin(-1)
rounds off to start of current Month. Change -1 to 0 rounds off to start of next Month

Example of rounding a Python Timestamp to the nearest half year:
from Pandas import Timestamp
def round_date_to_nearest_half_year(ts: Timestamp) -> Timestamp:
if 4 <= ts.month <=8:
return Timestamp(ts.year, 7, 1)
elif ts.month >=9:
return Timestamp(ts.year+1, 1, 1)
elif ts.month <= 3:
return Timestamp(ts.year, 1, 1)
else:
raise Exception("Logic error.")
Test:
print(round_date_to_nearest_half_year(Timestamp("2022-6-5")))
print(round_date_to_nearest_half_year(Timestamp("2022-7-3")))
print(round_date_to_nearest_half_year(Timestamp("2022-12-15")))
print(round_date_to_nearest_half_year(Timestamp("2023-1-5")))
Out:
2022-07-01 00:00:00
2022-07-01 00:00:00
2023-01-01 00:00:00
2023-01-01 00:00:00

Related

Get difference between two week days that are in string

Problem Statement:
Am developing a custom job scheduler that needs to be run on given days. It takes start date and end date as string and third param is list of week days on which job should run.
Start day can be different with given days but first job should run on next valid day
Let suppose Start date is 2022-09-07 (so day name is Wednesday) but given frequency days are ["Monday", "Friday", "Saturday"] so i need to run my first job on coming Friday and for this i need to calculate difference between start date and first valid day (in this case it's Friday)
So how can i do this python to run my first job on valid day (that can be in any position of given frequency days list) and also after one job complete i need to also get next valid day. I did some work but unfortunately its not working. Here is what i did
sorted_week_days_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
start_date = "2022-09-07"
valid_frequency_days = ["Monday", "Tuesday", "Friday"] # It can be any days in sorted order
start_date_object = datetime.datetime.strptime(start_date, "%Y-%m-%d")
given_start_day = start_date_object.strftime("%A")
if given_start_day not in valid_frequency_days:
# Need help to implement logic to get date for valid day
You should use the datetime.weekday() method to pull out the day of the week for days of interest. Assuming that you have dates similar to the format you show above, it is easy to convert, and also just use the day index for your "allowable start days" (Monday=0).
Then you can jig up a little function to look for the next start date in your sorted list and figure out how many days you need to wait.
Example below does that and also "rolls over" the weekend as needed.
Code:
from datetime import datetime, timedelta
from bisect import bisect_left
start_date = "2022-09-09"
valid_start_dates = [1, 4] # It can be any days in sorted order
start_date_object = datetime.strptime(start_date, "%Y-%m-%d")
d=start_date_object.weekday()
print(f'the numbered day of the week is: {d}')
def days_till_start(day, valid_start_days):
idx = bisect_left(valid_start_days, day)
if idx >= len(valid_start_days): # wrap around to next start
return valid_start_days[0] + 7 - day
elif day == valid_start_days[idx]:
return 0
else:
return valid_start_days[idx] - day
print(days_till_start(d, valid_start_dates))
start_dates = ['2022-09-05', '2022-09-06', '2022-09-07', '2022-09-08', '2022-09-09', '2022-09-10']
start_wkdys = [datetime.strptime(d, "%Y-%m-%d").weekday() for d in start_dates]
for d in start_wkdys:
print(f'day index is: {d}')
print(f'next start date is {days_till_start(d, valid_start_dates)} away')
print()
Output:
the numbered day of the week is: 4
0
day index is: 0
next start date is 1 away
day index is: 1
next start date is 0 away
day index is: 2
next start date is 2 away
day index is: 3
next start date is 1 away
day index is: 4
next start date is 0 away
day index is: 5
next start date is 3 away

Python3 Get epoch time first and last day of month

Given a month and a year, such as 03 2022, I need to get the epoch time of the first day of the month and the last day of the month. I am not sure how to do that. Any help would be appreciated. Thank you.
You can get the beginning of the month easily by setting the day to 1.
To get the end of the month conveniently, you can calculate the first day of the next month, then go back one day.
Then set the time zone (tzinfo) to UTC to prevent Python using local time.
Finally a call to .timestamp()
import datetime
def date_to_endofmonth(
dt: datetime.datetime, offset: datetime.timedelta = datetime.timedelta(days=1)
) -> datetime.datetime:
"""
Roll a datetime to the end of the month.
Parameters
----------
dt : datetime.datetime
datetime object to use.
offset : datetime.timedelta, optional.
Offset to the next month. The default is 1 day; datetime.timedelta(days=1).
Returns
-------
datetime.datetime
End of month datetime.
"""
# reset day to first day of month, add one month and subtract offset duration
return (
datetime.datetime(
dt.year + ((dt.month + 1) // 12), ((dt.month + 1) % 12) or 12, 1
)
- offset
)
year, month = 2022, 11
# make datetime objects; make sure to set UTC
dt_month_begin = datetime.datetime(year, month, 1, tzinfo=datetime.timezone.utc)
dt_month_end = date_to_endofmonth(dt_month_begin).replace(tzinfo=datetime.timezone.utc)
ts_month_begin = dt_month_begin.timestamp()
ts_month_end = dt_month_end.timestamp()
print(ts_month_begin, ts_month_end)
# 1667260800.0 1701302400.0
Unable to comment due to reputation but #FObersteiner is excellent just I would recommend a small change.
For example running the current code it would produce this for Nov 2022
print(dt_month_begin.timestamp())
print(dt_month_begin)
print(dt_month_end.timestamp())
print(dt_month_end)
--->
1667260800.0
2022-11-01 00:00:00+00:00
1701302400.0
2023-11-30 00:00:00+00:00
Note the year field
I'd suggest the following
import datetime
def date_to_endofmonth(
dt: datetime.datetime, offset: datetime.timedelta = datetime.timedelta(seconds=1)
) -> datetime.datetime:
"""
Roll a datetime to the end of the month.
Parameters
----------
dt : datetime.datetime
datetime object to use.
offset : datetime.timedelta, optional.
Offset to the next month. The default is 1 second; datetime.timedelta(seconds=1).
Returns
-------
datetime.datetime
End of month datetime.
"""
# reset day to first day of month, add one month and subtract offset duration
return (
datetime.datetime(
dt.year + ((dt.month + 1) // 13), ((dt.month + 1) % 12) or 12, 1
)
- offset
)
year, month = 2022, 11
# make datetime objects; make sure to set UTC
dt_month_begin = datetime.datetime(year, month, 1, tzinfo=datetime.timezone.utc)
dt_month_end = date_to_endofmonth(dt_month_begin).replace(tzinfo=datetime.timezone.utc)
Differences being floor division by 13 instead of 12 to handle the month of November.
Changing the offset to seconds delta because I felt the user ( and myself who came looking for the answer) wanted the starting epoch time and the ending epoch time so
Nov 1st 00:00:00 --> Nov 30th 23:59:59 would be better than
Nov 1st 00:00:00 --> Nov 30th 00:00:00 ( Losing a day worth of seconds)
Output of the above would be
:
1667260800.0
2022-11-01 00:00:00+00:00
1669852799.0
2022-11-30 23:59:59+00:00

Comparison between dates starts with -1

I have the following code:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame ({
'Date':['4/22/2020 14:32:10','4/21/2020 4:32:10','4/20/2020 1:32:10']
})
date ='04/22/2020'
datetime_object = datetime.strptime(date, '%m/%d/%Y')
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y %H:%M:%S')
days_diff = (datetime_object - df['Date']).dt.days
print(days_diff)
0 -1
1 0
2 1
Why the result is not looking like the one below? Why the no of days starts with -1 and not with 0?
0 0
1 1
2 2
This is because it's flooring the answers
for the first case
'4/22/2020 14:32:10' the diff is = -14/ 24 = ~ -0.6 days
o/p:- -1
for the second case
'4/21/2020 4:32:10' the diff is = 20/24 = ~ 0.8 days
o/p:- 0
for the third case
'4/20/2020 1:32:10' the difff is = 47/24 = ~1.9 days
o/p:- 1
I hope it helps.
Solution would be convert all the datetimes to dates
as in following line i have done with 'Date' column
days_diff = (datetime_object.date() - df['Date'].dt.date ).dt.days
In [32]: days_diff
Out[32]:
0 0
1 1
2 2
Name: Date, dtype: int64
The issue is to do with the fact you are subtracting the higher date from the lower date which leaves you with a negative result. In the datetime module, subtracting one date object from another creates a time delta object like so
days1 = self.toordinal()
days2 = other.toordinal()
secs1 = self._second + self._minute * 60 + self._hour * 3600
secs2 = other._second + other._minute * 60 + other._hour * 3600
base = timedelta(days1 - days2,
secs1 - secs2,
self._microsecond - other._microsecond)
If we mimic that with your dates we see the following days and secs created for each date object
737537 0
737537 52330
subtracting day2 from days1 and secs2 form secs 1 means we pass the following to the timedelta object
0 -52330
So we are saying create a time delta object where the difference is 0 days and negative 52,330 seconds. Which is quite correct. However the timedelta object is a complex object and allows fractional values, and also many other types, like weeks or minutes etc. it also does not apply any limits to the values. so in the seconds part you can pass 10 seconds or 100,000 seconds. Now 100,000 seconds is actually more seconds than there are in a day. So the code takes this into account and will divmod the seconds to work out if there are any extra days in these seconds.
days, seconds = divmod(seconds, 24*3600)
d += days
s += int(seconds) # can't overflow
Now Here the issue lies in understanding what divmod does. div mod will do a floor division and remainder of the calculation. Now in a positive case thats fine.
print(divmod(52330, 24*3600))
print(divmod(-52330, 24*3600))
(0, 52330)
(-1, 34070)
Since the floor division will round down to 0 days and return you the remaining seconds. However in the negative case the floor division will round down to -1 since -52330 / 86400 is -0.6056.... So floor division rounds this down to -1 and the remainder is the difference between between 86400 and 52330 so leaves 34070 seconds.
So you wouldnt face this issue if you are always subtracting the oldest date from the newest date so you never end up with a negative difference. Infact it doesnt make sense to subtract a newer date from an older date.
for the other cases you listed the difference between 4/21/2020 4:32:10 and 4/22/2020 00:00:00 is indeed 0 days since the difference is actually only 20 hours, this behavior is correct the difference is not 1 days its 20 hours.

How do I know if today is a day due to change civil local time e.g. daylight saving time in standard python and pandas timestamps?

According to the rules of British Summer Time / daylight saving time (https://www.gov.uk/when-do-the-clocks-change) the clocks:
go forward 1 hour at 1am on the last Sunday in March,
go back 1 hour at 2am on the last Sunday in October.
In 2019 this civil local time change happens on March 31st and October 27th, but the days slightly change every year. Is there a clean way to know these dates for each input year?
I need to check these "changing time" dates in an automatic way, is there a way to avoid a for loop to check the details of each date to see if it is a "changing time" date?
At the moment I am exploring these dates for 2019 just to try to figure out a reproducible/automatic procedure and I found this:
# using datetime from the standard library
march_utc_30 = datetime.datetime(2019, 3, 30, 0, 0, 0, 0, tzinfo=datetime.timezone.utc)
march_utc_31 = datetime.datetime(2019, 3, 31, 0, 0, 0, 0, tzinfo=datetime.timezone.utc)
april_utc_1 = datetime.datetime(2019, 4, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc)
# using pandas timestamps
pd_march_utc_30 = pd.Timestamp(march_utc_30) #, tz='UTC')
pd_march_utc_31 = pd.Timestamp(march_utc_31) #, tz='UTC')
pd_april_utc_1 = pd.Timestamp(april_utc_1) #, tz='UTC')
# using pandas wrappers
pd_local_march_utc_30 = pd_march_utc_30.tz_convert('Europe/London')
pd_local_march_utc_31 = pd_march_utc_31.tz_convert('Europe/London')
pd_local_april_utc_1 = pd_april_utc_1.tz_convert('Europe/London')
# then printing all these dates
print("march_utc_30 {} pd_march_utc_30 {} pd_local_march_utc_30 {}".format(march_utc_30, pd_march_utc_30, pd_local_march_utc_30))
print("march_utc_31 {} pd_march_utc_31 {} pd_local_march_utc_31 {}".format(march_utc_31, pd_march_utc_31, pd_local_march_utc_31))
print("april_utc_1 {} pd_april_utc_1 {} pd_local_april_utc_1 {}".format(april_utc_1, pd_april_utc_1, pd_local_april_utc_1))
The output of those print statements is:
march_utc_30 2019-03-30 00:00:00+00:00 pd_march_utc_30 2019-03-30 00:00:00+00:00 pd_local_march_utc_30 2019-03-30 00:00:00+00:00
march_utc_31 2019-03-31 00:00:00+00:00 pd_march_utc_31 2019-03-31 00:00:00+00:00 pd_local_march_utc_31 2019-03-31 00:00:00+00:00
april_utc_1 2019-04-01 00:00:00+00:00 pd_april_utc_1 2019-04-01 00:00:00+00:00 pd_local_april_utc_1 2019-04-01 01:00:00+01:00
I could use a for loop to find out if the current date is the last Sunday of the month, or compare the "hour delta" between the current date and the date of the day after to see if there is a +1, but I am wondering if there is a cleaner way to do this?
Is there something attached to the year e.g. knowing the input year is 2019 then we know for sure the "change date" in March will be day 31st?
Using dateutil.rrule can help (install with pip install python-dateutil).
Because we can fetch dates by weeks, we don't need any loops,
from dateutil.rrule import rrule, WEEKLY
from dateutil.rrule import SU as Sunday
from datetime import date
import datetime
def get_last_sunday(year, month):
date = datetime.datetime(year=year, month=month, day=1)
# we can find max 5 sundays in a months
days = rrule(freq=WEEKLY, dtstart=date, byweekday=Sunday, count=5)
# Check if last date is same month,
# If not this couple year/month only have 4 Sundays
if days[-1].month == month:
return days[-1]
else:
return days[-2]
def get_march_switch(year):
# Get 5 next Sundays from first March
day = get_last_sunday(year, 3)
return day.replace(hour=1, minute=0, second=0, microsecond=0)
def get_october_switch(year):
day = get_last_sunday(year, 10)
return day.replace(hour=2, minute=0, second=0, microsecond=0)
print('2019:')
print(' {}'.format(get_march_switch(2019)))
print(' {}'.format(get_october_switch(2019)))
print('2021:')
print(' {}'.format(get_march_switch(2021)))
print(' {}'.format(get_october_switch(2021)))
get_sundays() returns the 5 next sundays from the first day of the given month, because a month can have maximum 5 sundays.
Then I just check (within get_(march|october)_switch()) if the last given sunday is from the expected month, if not well this month only have 4 sunday, I took this one.
Finally I fix the hours, seconds and microseconds.
Output:
2019:
2019-03-31 01:00:00
2019-10-27 02:00:00
2021:
2021-03-28 01:00:00
2021-10-24 02:00:00
I know the topic is quite old now. However, I had the same question today, and at the end I found a solution which seems quite simple to me, using only the standard datetime:
I want to check whether my date refdate is the October DST day - I did it in the following way:
refdate is my standard datetime object.
If you have a panda timestamp, you can convert it to native datetime using .to_pydatetime()
if refdate.month == 10 and refdate.weekday() == 6 and (refdate + dt.timedelta(weeks = 1)).month == 11:
oct_dst = 1

Creating a daily account log from a Pandas expense file in data frame format

I have an expense file that I am trying to read in and from this file create a daily log. A small subset of the file that extends over years is shown below, for a few days in January 2015.
Date,Checking_Debit,Checking_Addition,Savings_Debit,Savings_Addition
2015-01-07,342.1,0.0,0.0,0.0
2015-01-07,981.0,0.0,0.0,0.0
2015-01-07,3185.0,0.0,0.0,0.0
2015-01-05,55.0,0.0,0.0,0.0
2015-01-05,75.0,0.0,0.0,0.0
2015-01-03,287.0,0.0,0.0,0.0
2015-01-02,64.8,0.0,0.0,0.0
2015-01-02,75.0,0.0,0.0,75.0
2015-01-02,1280.0,0.0,0.0,0.0
2015-01-02,245.0,0.0,0.0,0.0
2015-01-01,45.0,0.0,0.0,0.0
In my code I start with the variables checking_start and savings_start that contain the start values of the checking and savings account. I would like to give the code a start date and an end date and have the code iterate through each day, see if there was an expense on that day and subtract the checking and savings debits and add the checking and savings additions. If there were no expenses on that day it should keep the accounts at the same value as the previous day. In addition, I am trying to constrain myself to Pandas data frames in the implementation. So far my code looks like this.
import pandas as pd
from date time import date
check_start = 8500.0
savings_start = 4000.0
start_date = date(2017, 1, 1)
end_date = date(2017, 1, 8)
df = pd.read_csv(file_name.csv, dtype={'Date': str, 'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
In a Pythonic format with the Pandas module, how do I walk through from the start date to the end date, one day at a time, then see if there is an expense or expenses on those date and then subtract that from the checking and savings. At the end I should have an array for the value of the checking account on each date and the same for the savings account on that day.
The result should be arrays written into another .csv file with the following format.
Date,Checking,Savings
2017-01-07,1865.1,3925.0
2017-01-06,6373.2,3925.0
2017-01-05,6373.2,3925.0
2017-01-04,6503.2,3925.0
2017-01-03,6503.2,3925.0
2017-01-02,6790.2,3925.0
2017-01-01,8455.0,4000.0
Start by reading the data that you provided and identifying the date column in data with it
import pandas as pd
df = pd.read_csv(r"dat.csv", parse_dates=[0],dtype={'Checking_Debit': float,
'Checking_Addition': float,
'Savings_Debit': float,
'Savings_Addition': float})
Set Date as index for better data manipulation.
df = df.set_index("Date")
Initialize all the variables for the loop
check_start = 8500.0
savings_start = 4000.0
start_date = pd.to_datetime('2015/1/1')
end_date = pd.to_datetime('2015/1/8')
delta = pd.Timedelta('1 days') # time that needs to be added to start date
Now group the expense data w.r.t to each date
grp_df = df.groupby('Date').sum()
Now we will do while loop for create expense report for each day
expense_report = []
while start_date<=end_date:
if start_date in df.index:
savings_start += (grp_df.loc[start_date,"Savings_Addition"]-grp_df.loc[start_date,"Savings_Debit"])
check_start += (grp_df.loc[start_date,"Checking_Addition"]-grp_df.loc[start_date,"Checking_Debit"])
expense_report.append([start_date,check_start,savings_start])
elif start_date not in df.index:
expense_report.append([start_date,check_start,savings_start])
start_date += delta
convert expense_report list to pandas Dataframe
df_exp_rpt = pd.DataFrame(expense_report,columns=["Date","Checking","Savings"])
print(df_exp_rpt)
Date Checking Savings
0 2015-01-01 8455.0 4000.0
1 2015-01-02 6790.2 4075.0
2 2015-01-03 6503.2 4075.0
3 2015-01-04 6503.2 4075.0
4 2015-01-05 6373.2 4075.0
5 2015-01-06 6373.2 4075.0
6 2015-01-07 1865.1 4075.0
7 2015-01-08 1865.1 4075.0
You can save to csv by
df_exp_rpt.to_csv("filename.csv")
Note: The saving column values are 4075 instead of 3925.0 because you have 75 value in saving_addition column in your original data

Resources