How to create a Series of datetime values for the nth calendar of the month for this year? - python-3.x

I would like to create a [function that returns] a pandas series of datetime values for the Nth calendar day of each month for the current year. An added wrinkle is I would also need it to be the previous business day if it happens to fall on the weekend. Bonus would be to check against known holidays as well.
For example, I'd like the output to look like this for the [business day prior to or equal to the] 14th day of the month
0 2021-01-14
1 2021-02-12
2 2021-03-12
3 2021-04-14
4 2021-05-14
5 2021-06-14
6 2021-07-14
7 2021-08-13
8 2021-09-14
9 2021-10-14
10 2021-11-12
11 2021-12-14
I've tried using pd.date_range() and pd.bdate_range() and did not get the desired results. Example:
pd.date_range("2021-01-14","2021-12-14", periods=12)
>> DatetimeIndex(['2021-01-14 00:00:00',
'2021-02-13 08:43:38.181818182',
'2021-03-15 17:27:16.363636364',
'2021-04-15 02:10:54.545454546',
'2021-05-15 10:54:32.727272728',
'2021-06-14 19:38:10.909090910',
'2021-07-15 04:21:49.090909092',
'2021-08-14 13:05:27.272727272',
'2021-09-13 21:49:05.454545456',
'2021-10-14 06:32:43.636363640',
'2021-11-13 15:16:21.818181820',
'2021-12-14 00:00:00'],
dtype='datetime64[ns]', freq=None)>>
Additionally this requires knowing the first and last month days that would be the start and end. Analogous tests with pd.bdate_range() resulted mostly in errors.

Similar approach to Pandas Date Range Monthly on Specific Day of Month but subtract a Bday to get the previous buisness day. Also start at 12/31 of the previous year to get all values for the current year:
def get_date_range(day_of_month, year=pd.Timestamp.now().year):
return (
pd.date_range(start=pd.Timestamp(year=year - 1, month=12, day=31),
periods=12, freq='MS') +
pd.Timedelta(days=day_of_month) -
pd.tseries.offsets.BDay()
)
Usage for year:
get_date_range(14)
DatetimeIndex(['2021-01-14', '2021-02-12', '2021-03-12', '2021-04-14',
'2021-05-14', '2021-06-14', '2021-07-14', '2021-08-13',
'2021-09-14', '2021-10-14', '2021-11-12', '2021-12-14'],
dtype='datetime64[ns]', freq=None)
Or for another year:
get_date_range(14, 2020)
DatetimeIndex(['2020-01-14', '2020-02-14', '2020-03-13', '2020-04-14',
'2020-05-14', '2020-06-12', '2020-07-14', '2020-08-14',
'2020-09-14', '2020-10-14', '2020-11-13', '2020-12-14'],
dtype='datetime64[ns]', freq=None)
With Holidays (this is non-vectorized so it will raise a PerformanceWarning):
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
def get_date_range(day_of_month, year=pd.Timestamp.now().year):
return (
pd.date_range(start=pd.Timestamp(year=year - 1, month=12, day=31),
periods=12, freq='MS') +
pd.Timedelta(days=day_of_month) -
bday_us
)
get_date_range(25)
DatetimeIndex(['2021-01-25', '2021-02-25', '2021-03-25', '2021-04-23',
'2021-05-25', '2021-06-25', '2021-07-23', '2021-08-25',
'2021-09-24', '2021-10-25', '2021-11-24', '2021-12-23'],
dtype='datetime64[ns]', freq=None)

You can use the months start and then add a timedelta to get it to the day you want. So for your example it would be:
pd.date_range(start=pd.Timestamp("2020-12-14"), periods=12, freq='MS') + pd.Timedelta(days=13)
Output:
DatetimeIndex(['2021-01-14', '2021-02-14', '2021-03-14', '2021-04-14',
'2021-05-14', '2021-06-14', '2021-07-14', '2021-08-14',
'2021-09-14', '2021-10-14', '2021-11-14', '2021-12-14'],
dtype='datetime64[ns]', freq=None)
to move to the previous business day use (see: Pandas offset DatetimeIndex to next business if date is not a business day and Most recent previous business day in Python) :
(pd.date_range(start=pd.Timestamp("2021-06-04"), periods=12, freq='MS') + pd.Timedelta(days=4)).map(lambda x: x - pd.tseries.offsets.BDay())
output:
DatetimeIndex(['2021-07-02', '2021-08-05', '2021-09-03', '2021-10-04',
'2021-11-04', '2021-12-03', '2022-01-06', '2022-02-04',
'2022-03-04', '2022-04-04', '2022-05-05', '2022-06-03'],
dtype='datetime64[ns]', freq=None)

Related

Wrong sequence of months in PySpark sequence interval month

I am trying to create an array of dates that all months from a minimum date to a maximum date!
Example:
min_date = "2021-05-31"
max_date = "2021-11-30"
.withColumn('array_date', F.expr('sequence(to_date(min_date), to_date(max_date), interval 1 month)')
But it gives me the following Output:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31']
Why doesn't the upper limit appear on 11/30/2021? In the documentation, it says that the extremes are included.
My desired output is:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30']
Thank you!
I think this is related to the timezone. I can reproduce the same behavior in my timezone Europe/Paris but when setting timezone to UTC it gives expected result:
from pyspark.sql import functions as F
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([("2021-05-31", "2021-11-30")], ["min_date", "max_date"])
df.withColumn(
"array_date",
F.expr("sequence(to_date(min_date), to_date(max_date), interval 1 month)")
).show(truncate=False)
#+----------+----------+------------------------------------------------------------------------------------+
#|min_date |max_date |array_date |
#+----------+----------+------------------------------------------------------------------------------------+
#|2021-05-31|2021-11-30|[2021-05-31, 2021-06-30, 2021-07-31, 2021-08-31, 2021-09-30, 2021-10-31, 2021-11-30]|
#+----------+----------+------------------------------------------------------------------------------------+
Alternatively, you can use TimestampType for start and end parameters of the sequence instead of DateType:
df.withColumn(
"array_date",
F.expr("sequence(to_timestamp(min_date), to_timestamp(max_date), interval 1 month)").cast("array<date>")
).show(truncate=False)

efficient cumulative pivot in pyspark

Is there a more efficient/idiomatic way of rewriting this query:
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupby('make', 'model')
.pivot('timeframe')
.agg(countDistinct('id').alias('count'))
.fillna(0)
.withColumn('1y+', col('1y+')+col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('1y', col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('6m', col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('3m', col('3m')+col('1m')+col('1w'))
.withColumn('1m', col('1m')+col('1w'))
The gist of the query is for every make/model combination to return the number of entries seen within a set of time periods from today. The period counts are cumulative, i.e. an entry that registered within the last 7 days would be counted for 1 week, 1 month, 3 months, etc.
if you want to use cumulative sum instead of summing for each columns, you can replace the code from .groupby onwards and use window functions
from pyspark.sql.window import Window
import pyspark.sql.functions as F
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupBy('make', 'model', 'timeframe')
.agg(F.countDistinct('id').alias('count'),
F.max('age_days').alias('max_days')) # for orderBy clause
.withColumn('cumsum',
F.sum('count').over(Window.partitionBy('make', 'model')
.orderBy('max_days')
.rowsBetween(Window.unboundedPreceding, 0)))
.groupBy('make', 'model').pivot('timeframe').agg(F.first('cumsum'))
.fillna(0)

Set days since first occurrence based on multiple columns

I have a pandas dataset with this structure:
Date datetime64[ns]
Events int64
Location object
Day float64
I've used the following code to get the date of the first occurrence for location "A":
start_date = df[df['Location'] == 'A'][df.Events != 0].iat[0,0]
I now want to update all of the records after the start_date with the number of days since the start_date, where Day = df.Date - start_date.
I tried this code:
df.loc[df.Location == country, 'Day'] = (df.Date - start_date).days
However, that code returns an error:
AttributeError: 'Series' object has no attribute 'days'
The problem seems to be that the code recognizes df.Date as an object instead of a datetime. Anyone have any ideas on what is causing this problem?
Try, you need to add the .dt accessor.
df.loc[df.Location == country, 'Day'] = (df.Date - start_date).dt.days

Pandas Data Frame, find max value and return adjacent column value, not the entire row

New to Pandas so I'm sorry if there is an obvious solution...
I imported a CSV that only had 2 columns and I created a 3rd column.
Here's a screen shot of the top 10 rows and header:
Screen shot of DataFrame
I've figured out how to find the min and max values in the ['Amount Changed'] column but also need to pull the date associated with the min and max - but not the index and ['Profit/Loss']. I've tried iloc, loc, read about groupby - I can't get any of them to return a single value (in this case a date) that I can use again.
My goal is to create a new variable 'Gi_Date' that is in the same row as the max value in ['Amount Changed'] but tied to the date in the ['Date'] column.
I'm trying to keep the variables separate so I can use them in print statements, write them to txt files, etc.
import os
import csv
import pandas as pd
import numpy as np
#path for CSV file
csvpath = ("budget_data.csv")
#Read CSV into Panadas and give it a variable name Bank_pd
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
#Number of month records in the CSV
Months = Bank_pd["Date"].count()
#Total amount of money captured in the data converted to currency
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
#Determine the amount of increase or decrease from the previous month
AmtChange = Bank_pd["Profit/Losses"].diff()
Bank_pd["Amount Changed"] = AmtChange
#Identify the greatest positive change
GreatestIncrease = '${:.0f}'.format(Bank_pd["Amount Changed"].max())
Gi_Date = Bank_pd[Bank_pd["Date"] == GreatestIncrease]
#Identify the greatest negative change
GreatestDecrease = '${:.0f}'.format(Bank_pd["Amount Changed"].min())
Gd_Date = Bank_pd[Bank_pd['Date'] == GreatestDecrease]
print(f"Total Months: {Months}")
print(f"Total: {Total_Funds}")
print(f"Greatest Increase in Profits: {Gi_Date} ({GreatestIncrease})")
print(f"Greatest Decrease in Profits: {Gd_Date} ({GreatestDecrease})")
When I run the script in git bash I don't get an error anymore so I think I'm getting close, rather than showing the date it says:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($1926159)
Greatest Decrease in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($-2196167)
I want it to print out like this:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Feb-2012 ($1926159)
Greatest Decrease in Profits: Sept-2013 ($-2196167)
Here is one years worth of the original DataFrame:
bank_pd = pd.DataFrame({'Date':['Jan-10', 'Feb-10', 'Mar-10', 'Apl-10', 'May-10', 'Jun-10', 'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10'],
'Profit/Losses':[867884, 984655, 322013, -69417, 310503, 522857, 1033096, 604885, -216386, 477532, 893810, -80353]})
The expected output with the sample df would be:
Total Months: 12
Total Funds: $5651079
Greatest Increase in Profits: Oct-10 ($693918)
Greatest Decrease in Profits: Dec-10 ($-974163)
I also had an error in the sample dataframe from above, I was missing a month when I typed it out quickly - it's fixed now.
Thanks!
I'm seeing few glitches in the variables used.
Bank_pd["Amount Changed"] = AmtChange
The above statement is actually replacing the dataframe with column "Amount Changed". After this statement you can use this column for any manipulation.
Below is the updated code and highlighted the newly added lines. You could add further formatting:
import pandas as pd
csvpath = ("budget_data.csv")
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
inp_bank_pd = pd.DataFrame(Bank_pd)
Months = Bank_pd["Date"].count()
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
AmtChange = Bank_pd["Profit/Losses"].diff()
GreatestIncrease = Bank_pd["Amount Changed"].max()
Gi_Date = inp_bank_pd.loc[Bank_pd["Amount Changed"] == GreatestIncrease]
print(Months)
print(Total_Funds)
print(Gi_Date['Date'].values[0])
print(GreatestIncrease)
In your example code, Gi_date and Gd_date are trying to initialize new DF's instead of calling values. Change Gi_Date and Gd_Date:
Gi_Date = Bank_pd.sort_values('Profit/Losses').tail(1).Date
Gd_Date = Bank_pd.sort_values('Profit/Losses').head(1).Date
Check outputs:
Gi_Date
Jul-10
Gd_Date
Sep-10
To print how you want to print using string formatting:
print("Total Months: %s" %(Months))
print("Total: %s" %(Total_Funds))
print("Greatest Increase in Profits: %s %s" %(Gi_Date.to_string(index=False), GreatestIncrease))
print("Greatest Decrease in Profits: %s %s" %(Gd_Date.to_string(index=False), GreatestDecrease))
Note if you don't use the:
(Gd_Date.to_string(index=False)
The pandas object information will be included in the print output, like it is in your example when you see the DataFrame info.
Output for 12 month sample DF:
Total Months: 12
Total: $5651079
Greatest Increase in Profits: Jul-10 $693918
Greatest Decrease in Profits: Sep-10 $-974163
Use Series.idxmin and Series.idxmax with loc:
df.loc[df['Amount Changed'].idxmin(), 'Date']
df.loc[df['Amount Changed'].idxmax(), 'Date']
Full example based on your sample DataFrame:
df = pd.DataFrame({'Date':['Jan-2010', 'Feb-2010', 'Mar-2010', 'Apr-2010', 'May-2010',
'Jun-2010', 'Jul-2010', 'Aug-2010', 'Sep-2010', 'Oct-2010'],
'Profit/Losses': [867884,984655,322013,-69417,310503,522857,
1033096,604885,-216386,477532]})
df['Amount Changed'] = df['Profit/Losses'].diff()
print(df)
Date Profit/Losses Amount Changed
0 Jan-2010 867884 NaN
1 Feb-2010 984655 116771.0
2 Mar-2010 322013 -662642.0
3 Apr-2010 -69417 -391430.0
4 May-2010 310503 379920.0
5 Jun-2010 522857 212354.0
6 Jul-2010 1033096 510239.0
7 Aug-2010 604885 -428211.0
8 Sep-2010 -216386 -821271.0
9 Oct-2010 477532 693918.0
print(df.loc[df['Amount Changed'].idxmin(), 'Date'])
print(df.loc[df['Amount Changed'].idxmax(), 'Date'])
Sep-2010
Oct-2010

Filter on month and date irrespective of year in python

I have a column of data one of them being a date and am expected to drop the rows that have leap dates. It is a range of years so I was hoping to drop any that matched the 02-29 filter.
The one way I used is to add additional columns, extract the month and date separately and then filter on the data as shown below. It serves the purpose but obviously not good from an efficiency perspective
df['Yr'], df['Mth-Dte'] = zip(*df['Date'].apply(lambda x: (x[:4], x[5:])))
df = df[df['Mth-Dte'] != '02-29']
Is there a better way to implement this by directly applying the filter on the column in the dataframe?
Adding the data
ID Date
22398 IDM00096087 1/1/2005
22586 IDM00096087 1/1/2005
21790 IDM00096087 1/2/2005
21791 IDM00096087 1/2/2005
14727 IDM00096087 1/3/2005
Thanks in advance
Convert to datetime and use boolean mask.
import pandas as pd
data = {'Date': {14727: '1/3/2005',
21790: '1/2/2005',
21791: '1/2/2005',
22398: '1/1/2005',
22586: '29/2/2008'},
'ID': {14727: 'IDM00096087',
21790: 'IDM00096087',
21791: 'IDM00096087',
22398: 'IDM00096087',
22586: 'IDM00096087'}}
df = pd.DataFrame(data)
Option1, convert + dt:
df.Date = pd.to_datetime(df.Date)
# Filter away february 29
df[~((df.Date.dt.month == 2) & (df.Date.dt.day == 29))] # ~ for not equal to
Option2, convert + strftime:
df.Date = pd.to_datetime(df.Date)
# Filter away february 29
df[df.Date.dt.strftime('%m%d') != '0229']
Option3, without conversion:
mask = pd.to_datetime(df.Date).dt.strftime('%m%d') != '0229'
df[mask]

Resources