Is there a way to convert numerical month/day/year to letter form month/day/year in pandas? - python-3.x

Currently, I have a column in the form month/day/year like 2/11/2020. I am trying to get it into the form of February Eleven 2020.
So far I've tried looking into dt.time to split into date and month but it looks like I need it in yy-mm-dd. I was thinking I can maybe split it into three columns then use .replace on each column with a dictionary.
Does anyone know how to get this method to work or have a better solution?

The following are the three I have come across.
df=pd.DataFrame({'Date':['2/11/2020']})
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%d %B %Y')# day/Month/Year
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%A %B %Y')#day of the week/month/Year
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%A %d %B %Y')

I don't believe the standard datetime library has functionality to spell out the days of the month, but you can easily create your own dict to handle that.
from datetime import datetime
now = datetime.now()
date_dict = {'19': 'Nineteenth',
'20':'Twentieth',
'21': 'Twenty First'}
print(now.strftime('%B ') + date_dict[now.strftime('%d')] + now.strftime(' %Y'))
'June Twenty First 2020'
Here is a larger toy example if you want to use a lambda. The extra C column is added because I don't think apply works with an index.
import pandas as pd
dr = pd.date_range(start='6/19/2020', end='6/21/2020')
df = pd.DataFrame({'A' :['6/19/2020', '6/20/2020', '6/21/2020'], 'B':[3,4,5]}, index=dr)
df['C'] = df.index
df['D'] = df['C'].apply(lambda x: x.strftime('%B ') + date_dict[x.strftime('%d')] + x.strftime(' %Y'))
print(df)
A B C D
2020-06-19 6/19/2020 3 2020-06-19 June Nineteenth 2020
2020-06-20 6/20/2020 4 2020-06-20 June Twentieth 2020
2020-06-21 6/21/2020 5 2020-06-21 June Twenty First 2020
If you just want to use an already existing string column (like column A) in m/d/y format, you can use this method.
df['A'] = pd.to_datetime(df['A'], dayfirst=False)
df['D'] = df['A'].apply(lambda x: x.strftime('%B ') + date_dict[x.strftime('%d')] + x.strftime(' %Y'))

Related

Find the cumulative number of missing days for a datetime column in pandas

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'A', 'A','A','A','A' ,'B','B','B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-22 04:38:26', '2021-09-23
00:12:29','2021-09-22 11:20:58','2021-09-25 09:20:58','2021-03-11 21:20:00','2021-03-
11 21:25:00','2021-03-12 21:25:00', '2021-03-13 21:25:00', '2021-03-15 21:25:00']}
df1 = pd.DataFrame(data)
df1
The snippet of it is given below. The 'Date' column is in Datetime format.
Now, I want to find the total number of missing dates in between for each participant and print them(or create a new dataframe).
ID Missing days
A 3 (21st,22nd and 24th September dates missing)
B 1 (14th march missing)
Any help is greatly appreciated. Thanks.
Answer below will fail with multiple consecutive missing days (Thanks Ben T). We can solve this by using resample per group, than count the NaT:
dfg = df1.groupby("ID").apply(lambda x: x.resample(rule="D", on="Date").first())
dfg["Date"].isna().sum(level=0).reset_index(name="Missing days")
ID Missing days
0 A 2
1 B 1
** OLD ANSWER **
We can use GroupBy.diff and check how many diffs are greater than 1 day:
df1["Date"] = pd.to_datetime(df1["Date"])
(
df1.groupby("ID")["Date"]
.apply(lambda x: x.diff().gt(pd.Timedelta(1, "D")).sum())
.reset_index(name="Missing days")
)
ID Missing days
0 A 2
1 B 1

Group by the dates to weeks

I have time in ms in epoch format, I need to translate this into a date and group it by a week number.
I tried the following procedure:
df.loc[0, 'seconds'] = df['seconds'].iloc[0]
for _, grp in df.groupby(pd.TimeGrouper(key='seconds', freq='7D')):x
print (grp)
df["week"].to_period(freq='w')
For example, if my 'seconds' column is presented like 1557499095332, then I want the 'dates' column to be 10-05-2019 20:08:15 and the 'Week' column to present W19 or 19.
How do I go about this?
Try using strftime method:
from datetime import datetime as dt
x = 1557499095332
dt.fromtimestamp(x/1000).strftime("%A, %B %d, %Y %I:%M:%S")
dt.fromtimestamp(x/1000).strftime("%W")
3rd line will return 'Friday, May 10, 2019 03:38:15'
4th line will return '18' (it's because 1st of January 2019 will return '0' as it's first week)

PANDAS date summarisation

I have a pandas dataframe that looks like:
import pandas as pd
df= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22']})
df
I want to aggregate the dates so it shows the first four quarters and then the year end. For the dataset above, I would expect:
df1= pd.DataFrame({'type':['Asset','Liability','Asset','Liability','Asset'],'Amount':[10,-10,20,-20,5],'Maturity Date':['2018-01-22','2018-02-22','2018-06-22','2019-06-22','2020-01-22'],'Mat Group':['1Q18','1Q18','2Q18','FY19','FY20']})
df1
right now I achieve this using a set of loc statements such as :
df.loc[(df['Maturity Date'] >'2018-01-01') & (df['Maturity Date'] <='2018-03-31'),'Mat Group']="1Q18"
df.loc[(df['Maturity Date'] >'2018-04-01') & (df['Maturity Date'] <='2018-06-30'),'Mat Group']="2Q18"
I was wondering if there is a more elegant way to achieve the same result? Perhaps have the buckets in a list and parse through the list so that the bucketing can be made more flexible ?
A bit specific. I would use.
the strftime format %y to get the short
the pandas built-in quarter to get the quarter
the python format function to construct strings
a lambda to apply it to the column
Here is the result. Maybe there is a better answer, but this one is pretty concise.
df['Mat Group'] = df['Maturity Date'].apply(
lambda x: '{}Q{:%y}'.format(x.quarter, x) if x.year < 2019
else 'FY{:%y}'.format(x))
df
# Amount Maturity Date type Mat Group
# 0 10 2018-01-22 Asset 1Q18
# 1 -10 2018-02-22 Liability 1Q18
# 2 20 2018-06-22 Asset 2Q18
# 3 -20 2019-06-22 Liability FY19
# 4 5 2020-01-22 Asset FY20

Pandas Equivalent to SQL YEAR(GETDATE())

I'm a Pandas newbie but decent at SQL. A function I often leverage in SQL is this:
YEAR(date_format_data) = (YEAR(GETDATE())-1)
This will get me all the data from last year. Can someone please help me understand how to do the equivalent in Pandas?
Here's some example data:
Date Number
01/01/15 1
01/02/15 2
01/01/15 3
01/01/16 2
01/01/16 1
And here's my best guess at the code:
df = df[YEAR('Date') == (YEAR(GETDATE()) -1)].agg(['sum'])
And this code would return a value of '3'.
Thank you in advance for your help, I'm having a really hard time figuring out what I'm sure is simple.
Me
I think you can do it this way:
prev_year = pd.datetime.today().year - 1
df.loc[df['Date'].dt.year == prev_year]
PS .dt.year accessor will work only if Date column is of datetime dtype. If it's not the case you may want to convert that column to datetime dtype first:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
For pandas, first convert your date column to timestamp by pd.to_datetime
df['Date2'] = pd.to_datetime(df['Date'])
(pd.to_datetime has a format parameter to specify your input date format) Then you have
df['Date2'].year

How can I count categorical columns by month in Pandas?

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Resources