Python3 How to convert date into monthly periods where the first period is September - python-3.x

Working with a group that has a Fiscal Year that starts in September. I have a dataframe with a bunch of dates that I want to calculate a monthly period that = 1 in September.
What works:
# Convert date column to datetime format
df['Hours_Date'] = pd.to_datetime(df['Hours_Date'])
# First quarter starts in September - Yes!
df['Quarter'] = pd.PeriodIndex(df['Hours_Date'], freq='Q-Aug').strftime('Q%q')
What doesn't work:
# Gives me monthly periods starting in January. Don't want.
df['Period'] = pd.PeriodIndex(df['Hours_Date'], freq='M').strftime('%m')
# Gives me an error
df['Period'] = pd.PeriodIndex(df['Hours_Date'], freq='M-Aug').strftime('%m')
Is there a way to adjust the monthly frequency?

I think it is not implemented, check anchored offsets.
Possible solution is subtract or Index.shift 8 for shift by 8 months:
rng = pd.date_range('2017-04-03', periods=10, freq='m')
df = pd.DataFrame({'Hours_Date': rng})
df['Period'] = (pd.PeriodIndex(df['Hours_Date'], freq='M') - 8).strftime('%m')
Or:
df['Period'] = pd.PeriodIndex(df['Hours_Date'], freq='M').shift(-8).strftime('%m')
print (df)
Hours_Date Period
0 2017-04-30 08
1 2017-05-31 09
2 2017-06-30 10
3 2017-07-31 11
4 2017-08-31 12
5 2017-09-30 01
6 2017-10-31 02
7 2017-11-30 03
8 2017-12-31 04
9 2018-01-31 05

I think 'M-Aug' is not applicable for month , so you can do little bit adjust by using np.where, Data From Jez
np.where(df['Hours_Date'].dt.month-8<=0,df['Hours_Date'].dt.month+4,df['Hours_Date'].dt.month-8)
Out[271]: array([ 8, 9, 10, 11, 12, 1, 2, 3, 4, 5], dtype=int64)

Related

Handle ValueError while creating date in pd

I'm reading a csv file with p, day, month, and put it in a df. The goal is to create a date from day, month, current year, and I run into this error for 29th of Feb:
ValueError: cannot assemble the datetimes: day is out of range for month
I would like when this error occurs, to replace the day by the day before. How can we do that? Below are few lines of my pd and datex at the end is what I would like to get
p day month year datex
0 p1 29 02 2021 28Feb-2021
1 p2 18 07 2021 18Jul-2021
2 p3 12 09 2021 12Sep-2021
Right now, my code for the date is only the below, so I have nan where the date doesn't exist.
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
You could try something like this :
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
Indeed, you get NA :
p day year month datex
0 p1 29 2021 2 NaT
1 p2 18 2021 7 2021-07-18
2 p3 12 2021 9 2021-09-12
You could then make a particular case for these NA :
df.loc[df.datex.isnull(), 'previous_day'] = df.day -1
p day year month datex previous_day
0 p1 29 2021 2 NaT 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
df.loc[df.datex.isnull(), 'datex'] = pd.to_datetime(df[['previous_day', 'year', 'month']].rename(columns={'previous_day': 'day'}))
p day year month datex previous_day
0 p1 29 2021 2 2021-02-28 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
You have to create a new day column if you want to keep day = 29 in the day column.

insert column in multiheader dataframe using loc

I am trying to insert column in a multi header data frame which is the output of pandas pivot. I have used pandas. loc option for this, but I am not able to insert a column at a specific location.
Here is my code:
data = {'Commander': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'Date': ['2012, 02, 08', '2012, 02, 08', '2012, 02, 08',
'2012, 02, 08', '2012, 02, 08'],
'Hour':['00','01','02','03','04'],
'Subject': ['Maths','Science','Biology','Chemistry','Physics'],
'Score': [4, 24, 31, 3, 1],
'Grade':[1,2,1,4,5],
'credit':[20,50,40,20,10]}
df = pd.DataFrame(data)
df1=pd.pivot_table(df,index=['Commander','Hour'],columns=['Date'],values=['Score','Grade','credit'],aggfunc=np.max)
I am trying to insert another subcolumn for grade.for which I tried below code, which allows me to insert column but it inserted column in the last, not under grade. Can anyone please guide how to achieve this.
df1.loc[:,('Grade','subcredit')]=df1.loc[:,('Grade','2012, 02, 08')]*5
You just need to add one more line of code, to sort the index (the parameter axis=1 sorts the column index):
df1.sort_index(axis=1)
Grade Score credit
Date 2012, 02, 08 subcredit 2012, 02, 08 2012, 02, 08
Commander Hour
Amy 04 5 25 1 10
Jake 03 4 20 3 20
Jason 00 1 5 4 20
Molly 01 2 10 24 50
Tina 02 1 5 31 40
insert the column exactly where you want it. This is useful if sort has the possibility of re-ordering other columns you don't want to change.
Find the last column that belongs to 'Grade' checking argmax on the reversed array. We'll insert right after this.
i = int(len(df1.columns) - (df1.columns.get_level_values(0) == 'Grade')[::-1].argmax())
df1.insert(i, ('Grade', 'subcredit'), df1.loc[:,('Grade','2012, 02, 08')]*5)
Grade Score credit
Date 2012, 02, 08 subcredit 2012, 02, 08 2012, 02, 08
Commander Hour
Amy 04 5 25 1 10
Jake 03 4 20 3 20
Jason 00 1 5 4 20
Molly 01 2 10 24 50
Tina 02 1 5 31 40

how to Get week number from specified year date in python?

I have a time-series data and i want to get the week number from the initial date
date
20180401
20180402
20180902
20190130
20190401
Things Tried
Code
df["date"]= pd.to_datetime(df.date,format='%Y%m%d')
df["week_no"]= df.date.dt.week
But the week getting reset in 2019 results in getting a common week number of 2018.
is there anything we can do in it ??
You can use this function that will calculate the difference between two days in weeks:
def Wdiff(fromdate, todate):
d = pd.to_datetime(todate) - pd.to_datetime(fromdate)
return int(d / np.timedelta64(1, 'W'))
You can create a datetime object with the specified date, then retrieve the week number using the isocalendar method:
import datetime
myDate = datetime.date(2018, 4, 1)
week = myDate.isocalendar()[1]
print(week)
You could then calculate the total number of remaining weeks in 2018, then add the total number of weeks in each year in between, and finally add the week number of the current date.
For example, this code would print the number of weeks from the 1st of April 2018 to the 6th May 2020:
import datetime
myDate = datetime.date(2018, 4, 1)
currentDate = datetime.date(2020, 5, 6)
weeks = datetime.date(myDate.year, 12, 28).isocalendar()[1] -
myDate.isocalendar()[1]
for i in range(myDate.year, currentDate.year):
weeks += datetime.date(i, 12, 28).isocalendar()[1]
weeks += currentDate.isocalendar()[1]
print(weeks)
Note that because of the way isocalendar works, the 28th of December will always be in the last week of the given year.
The ISO year consists of 52 or 53 full weeks, and where a week starts on a Monday and ends on a Sunday. The first week of an ISO year is the first (Gregorian) calendar week of a year containing a Thursday. This is called week number 1, and the ISO year of that Thursday is the same as its Gregorian year.
You can get more information about isocalendar here: https://docs.python.org/3/library/datetime.html
To get the week number, but as a 2-digit string (with leading zero),
you can run:
df['week_no'] = df.date.dt.strftime('%W')
The result, for slightly extended source data is:
date week_no
0 2018-04-01 13
1 2018-04-02 14
2 2018-09-02 35
3 2018-12-30 52
4 2018-12-31 53
5 2019-01-01 00
6 2019-01-02 00
7 2019-01-03 00
8 2019-01-04 00
9 2019-01-05 00
10 2019-01-06 00
11 2019-01-07 01
12 2019-01-30 04
13 2019-04-01 13
Note that the last day of 2018 (monday) has week No == 53 and "initial" days
in 2019 (up to 2019-01-06 - Sunday) have week No == 00.
If you want this column as int, append .astype(int) to the above code.

Alternative to looping? Vectorisation, cython?

I have a pandas dataframe something like the below:
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023 etc
ref1 100 2020 5 10 0 0 0 0 0
ref2 20 2028 2 5 0 0 0 0 0
ref3 30 2021 7 16 0 0 0 0 0
ref4 40 2025 9 18 0 0 0 0 0
ref5 10 2022 4 30 0 0 0 0 0
The 'Total' column shows how many of a product needs to be delivered.
'First_yr_Del' tells you how many will be delivered in the first year. After this the delivery rate reverts to 'Del_rate' - a flat rate that can be applied each year until all products are delivered.
The 'Year to Use' column tells you the first year column to begin delivery from.
EXAMPLE: Ref1 has 100 to deliver. It will start delivering in 2020 and will deliver 5 in the first year, and 10 each year after that until all 100 are accounted for.
Any ideas how to go about this?
I thought i might use something like the below to reference which columns to use in turn, but i'm not even sure if that's helpful or not as it will depend on the solution (in the proper version, base_date.year is defined as the first column in the table - 2019):
start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]
I'm pretty new to python and aren't sure if i'm getting ahead of myself a bit...
The way i would think to go about it would be to use a for loop, or something using iterrows, but other posts seem to say this is a bad idea and i should be using vectorisation, cython or lambdas. Of those 3 i've only managed a very simple lambda so far. The others are a bit of a mystery to me since the solution seems to suggest doing one action after another until complete.
Any and all help appreciated!
Thanks
EDIT: Example expected output below (I edited some of the dates so you can better see the logic):
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023etc
ref1 100 2020 5 10 0 5 10 10 10
ref2 20 2021 2 5 0 0 2 5 5
ref3 30 2021 7 16 0 0 7 16 7
ref4 40 2019 9 18 9 18 13 0 0
ref5 10 2020 4 30 0 4 6 0 0
Here's another option, which separates the calculation of the rates/years matrix and appends it to the input df later on. Still does looping in the script itself (not "externalized" to some numpy / pandas function). Should be fine for 5k rows I'd guesstimate.
import pandas as pd
import numpy as np
# def gen_df1():
# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2021, 2021, 2019, 2020],
'First_Year_Del': [5, 2, 7, 9, 10],
'Del_rate': [10, 5, 16, 18, 30]})
# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])
# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])
# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...
# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)
# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed
# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
# concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
if r[i]: # if rest is not zero, append it as well
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
else: # rest is zero, skip it
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
# insert the rates at the apropriate location of the output array:
out[i, offset[i]:offset[i]+rates.shape[0]] = rates
# add the years/rates matrix to the original df
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)
You can accomplish this using two user-defined function and apply method
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'],
'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
'First_Year_Del': [5,2,7,9,4],
'Del_rate':[10,5,16,18,30]})
def f(r):
'''
Computes values per year and respective year
'''
n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))
return r
df = df.apply(f, axis=1)
def get_year_range(r):
'''
Computes min and max year for each row
'''
r['y_min'] = min(r['years'])
r['y_max'] = max(r['years'])
return r
df = df.apply(get_year_range, axis=1)
y_min = df['y_min'].min()
y_max = df['y_max'].max()
#Initialize each year value to zero
for year in range(y_min, y_max+1):
df[year] = 0
def expand(r):
'''
Update value for each year
'''
for v, y in zip(r['values'], r['years']):
r[y] = v
return r
# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)

Extract Day of Week More Pythonically

I have a df with fields year, month, day, formatted as integers. I have used the following to extract the day of the week.
How can I do this more pythonically?
### First Attempt - Succeeds
lst = []
for i in zip(df['day'], df['month'], df['year']):
lst.append(calendar.weekday(i[2], i[1], i[0]))
df['weekday'] = lst
### Second Attempt -- Fails
df['weekday'] = df.apply(lambda x: calendar.weekday(x.year, x.month, x.day))
AttributeError: ("'Series' object has no attribute 'year'", 'occurred at index cons_conf')
Try .to_datetime and the dt accessor:
import pandas as pd
data = pd.DataFrame({'year': [2018, 2018, 2018], 'month': [12, 12, 12], 'day': [1, 2, 3]})
data['weekday'] = pd.to_datetime(data[['year', 'month', 'day']]).dt.weekday
print(data)
Giving:
year month day weekday
0 2018 12 1 5
1 2018 12 2 6
2 2018 12 3 0
Note that weekday is zero-indexed.

Resources