When converting Date entry in pandas, how to ignore empty values? - python-3.x

I have an excel with some date entries as below.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
1 Closed 2/1/2017 7:15 AM 2/1/2017 10:44 AM 2/21/2017 11:50 AM
2 Assigned 2/2/2017 2:09 PM
3 Resolved 2/8/2017 10:32 AM 9/11/2017 8:49 PM
4 Closed 8/27/2018 6:00 AM 10/15/2018 9:10 AM 10/15/2018 9:10 AM
5 Resolved 12/26/2018 3:25 PM 2/11/2019 9:08 AM
Initially I'm converting them from the above pattern to $year-$mm-$dd.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
1 Closed 2017-02-01 2017-02-01 2017-02-21
2 Assigned 2017-02-02 NaN NaN
3 Resolved 2017-02-08 2017-09-11 NaN
4 Closed 2018-08-27 2018-10-15 2018-10-15
5 Resolved 2018-12-26 2019-02-11 NaN
With these converted dates, I'm trying extract month and year in the format $mon $year.
I'm using following code to extract month and year.
df['Month Opened'] = pd.to_datetime(df["Date/Time Opened"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
When I have applied this formula with 'Date/Time Opened', I can see that it works as below.
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed Month Opened
1 Closed 2017-02-01 2017-02-01 2017-02-21 Feb 2017
2 Assigned 2017-02-02 NaN NaN Feb 2017
3 Resolved 2017-02-08 2017-09-11 NaN Feb 2017
4 Closed 2018-08-27 2018-10-15 2018-10-15 Aug 2018
5 Resolved 2018-12-26 2019-02-11 NaN Dec 2018
Here's my complete code - http://tpcg.io/X5S8Pe
import pandas as pd
import calendar
CaseDetails = {
'Case Number': [1, 2, 3, 4, 5],
'Status': ['Closed', 'Assigned', 'Resolved', 'Closed', 'Resolved'],
'Date/Time Opened': ['2/1/2017 7:15 AM', '2/2/2017 2:09 PM', '2/8/2017 10:32 AM', '8/27/2018 6:00 AM', '12/26/2018 3:25 PM'],
'Date/Time Resolved': ['2/1/2017 10:44 AM', '', '9/11/2017 8:49 PM', '10/15/2018 9:10 AM', '2/11/2019 9:08 AM'],
'Date/Time Closed': ['2/21/2017 11:50 AM', '', '', '10/15/2018 9:10 AM', '']
}
df = pd.DataFrame(CaseDetails,columns= ['Case Number', 'Status', 'Date/Time Opened', 'Date/Time Resolved', 'Date/Time Closed'])
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).dt.date
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).dt.date
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).dt.date
print (df)
df['Month Opened'] = pd.to_datetime(df["Date/Time Opened"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
print (df)
As expected, my code converted entries under 'Date/Time Opened' to the desired format. When tried converting other 2 dates columns, I get the following error.
Traceback (most recent call last):
File "main.py", line 21, in <module>
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
File "/usr/lib64/python2.7/site-packages/pandas/core/series.py", line 2158, in map
new_values = map_f(values, arg)
File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
File "main.py", line 21, in <lambda>
df['Month Closed'] = pd.to_datetime(df["Date/Time Closed"]).map(lambda x: calendar.month_abbr[x.month] + " " + str(x.year))
File "/usr/lib64/python2.7/calendar.py", line 56, in __getitem__
funcs = self._months[i]
TypeError: list indices must be integers, not float
I wanted to know is there a way to covert the columns with empty values?

Here is possible use Series.dt.strftime - it working nice with missing values:
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).dt.strftime('%b %Y')
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).dt.strftime('%b %Y')
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).dt.strftime('%b %Y')
Alternative is use apply with list of columns:
cols = ['Date/Time Opened','Date/Time Resolved','Date/Time Closed']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x).dt.strftime('%b %Y'))
print (df)
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
0 1 Closed Feb 2017 Feb 2017 Feb 2017
1 2 Assigned Feb 2017 NaT NaT
2 3 Resolved Feb 2017 Sep 2017 NaT
3 4 Closed Aug 2018 Oct 2018 Oct 2018
4 5 Resolved Dec 2018 Feb 2019 NaT
In your solution is possible use nice trick np.nan != np.nan, so in your function is added if-else statement:
f = lambda x: calendar.month_abbr[x.month] + " " + str(x.year) if x == x else np.nan
df['Date/Time Opened'] = pd.to_datetime(df['Date/Time Opened']).map(f)
df['Date/Time Resolved'] = pd.to_datetime(df['Date/Time Resolved']).map(f)
df['Date/Time Closed'] = pd.to_datetime(df['Date/Time Closed']).map(f)
print (df)
Case Number Status Date/Time Opened Date/Time Resolved Date/Time Closed
0 1 Closed Feb 2017 Feb 2017 Feb 2017
1 2 Assigned Feb 2017 NaN NaN
2 3 Resolved Feb 2017 Sep 2017 NaN
3 4 Closed Aug 2018 Oct 2018 Oct 2018
4 5 Resolved Dec 2018 Feb 2019 NaN
Or alternative:
f = lambda x: calendar.month_abbr[x.month] + " " + str(x.year) if x == x else np.nan
cols = ['Date/Time Opened','Date/Time Resolved','Date/Time Closed']
df[cols] = df[cols].apply(lambda x: pd.to_datetime(x).map(f))

Related

Handle ValueError while creating date in pd

I'm reading a csv file with p, day, month, and put it in a df. The goal is to create a date from day, month, current year, and I run into this error for 29th of Feb:
ValueError: cannot assemble the datetimes: day is out of range for month
I would like when this error occurs, to replace the day by the day before. How can we do that? Below are few lines of my pd and datex at the end is what I would like to get
p day month year datex
0 p1 29 02 2021 28Feb-2021
1 p2 18 07 2021 18Jul-2021
2 p3 12 09 2021 12Sep-2021
Right now, my code for the date is only the below, so I have nan where the date doesn't exist.
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
You could try something like this :
df['datex'] = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')
Indeed, you get NA :
p day year month datex
0 p1 29 2021 2 NaT
1 p2 18 2021 7 2021-07-18
2 p3 12 2021 9 2021-09-12
You could then make a particular case for these NA :
df.loc[df.datex.isnull(), 'previous_day'] = df.day -1
p day year month datex previous_day
0 p1 29 2021 2 NaT 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
df.loc[df.datex.isnull(), 'datex'] = pd.to_datetime(df[['previous_day', 'year', 'month']].rename(columns={'previous_day': 'day'}))
p day year month datex previous_day
0 p1 29 2021 2 2021-02-28 28.0
1 p2 18 2021 7 2021-07-18 NaN
2 p3 12 2021 9 2021-09-12 NaN
You have to create a new day column if you want to keep day = 29 in the day column.

Pandas: How to ctrate DateTime index

There is Pandas Dataframe as:
year month count
0 2014 Jan 12
1 2014 Feb 10
2 2015 Jan 12
3 2015 Feb 10
How to create DateTime index from 'year' and 'month',so result would be :
count
2014.01.31 12
2014.02.28 10
2015.01.31 12
2015.02.28 10
Use to_datetime with DataFrame.pop for use and remove columns and add offsets.MonthEnd:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.offsets.MonthEnd()
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10
Or:
dates = pd.to_datetime(df.pop('year').astype(str) + df.pop('month'), format='%Y%b')
df.index = dates + pd.to_timedelta(dates.dt.daysinmonth - 1, unit='d')
print (df)
count
2014-01-31 12
2014-02-28 10
2015-01-31 12
2015-02-28 10

How to split rows into columns in a dataframe

I have this dataframe (with dimension 840rows x 1columns):
0 151284 Apr 19 11:37 0-01-20200419063614
1 48054 Apr 21 12:50 0-01-20200421074934
2 187588 Apr 21 13:55 0-01-20200421085439
3 51584 Apr 21 14:37 0-01-20200421143636
4 63522 Apr 22 08:40 0-01-20200422083937
I want to convert this dataframe into a format like this:
id datetime size
151284 2020-04-19 11:37:00 0-01-20200419063614
. . .
datetime being in the format: (yyyy-mm-dd)(hr-min-sec). So basically splitting a single column into three columns and also combining date and time into a single datetime column in a standard format.
Any help is appreciated.
EDIT: output of df.columns: Index(['col'], dtype='object')
Like this:
In [70]: df = pd.DataFrame({'col':['151284 Apr 19 11:37 0-01-20200419063614', '48054 Apr 21 12:50 0-01-20200421074934', '187588 Apr 21 13:55 0-01-20200421085439', '51584 Apr 21 14:37 0-01-20200421143636',
...: '63522 Apr 22 08:40 0-01-20200422083937']})
In [54]: df['id'] = df.col.str.split(' ').str[0]
In [55]: df['Datetime'] = df.col.str.split(' ').str[1] + ' ' + df.col.str.split(' ').str[2] + ' ' + df.col.str.split(' ').str[3]
In [57]: df['Size'] = df.col.str.split(' ').str[-1]
In [63]: from dateutil import parser
In [65]: def format_datetime(x):
...: return parser.parse(x)
...:
In [67]: df['Datetime'] = df.Datetime.apply(format_datetime)
In [79]: df
Out[79]:
id Datetime Size
0 151284 2020-04-19 11:37:00 0-01-20200419063614
1 48054 2020-04-21 12:50:00 0-01-20200421074934
2 187588 2020-04-21 13:55:00 0-01-20200421085439
3 51584 2020-04-21 14:37:00 0-01-20200421143636
4 63522 2020-04-22 08:40:00 0-01-20200422083937

How to run a script on x axis of plots in matplotlib [duplicate]

I want to transform an integer between 1 and 12 into an abbrieviated month name.
I have a df which looks like:
client Month
1 sss 02
2 yyy 12
3 www 06
I want the df to look like this:
client Month
1 sss Feb
2 yyy Dec
3 www Jun
Most of the info I found was not in python>pandas>dataframe hence the question.
You can do this efficiently with combining calendar.month_abbr and df[col].apply()
import calendar
df['Month'] = df['Month'].apply(lambda x: calendar.month_abbr[x])
Since the abbreviated month names is the first three letters of their full names, we could first convert the Month column to datetime and then use dt.month_name() to get the full month name and finally use str.slice() method to get the first three letters, all using pandas and only in one line of code:
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.month_name().str.slice(stop=3)
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
The calendar module is useful, but calendar.month_abbr is array-like: it cannot be used directly in a vectorised fashion. For an efficient mapping, you can construct a dictionary and then use pd.Series.map:
import calendar
d = dict(enumerate(calendar.month_abbr))
df['Month'] = df['Month'].map(d)
Performance benchmarking shows a ~130x performance differential:
import calendar
d = dict(enumerate(calendar.month_abbr))
mapper = calendar.month_abbr.__getitem__
np.random.seed(0)
n = 10**5
df = pd.DataFrame({'A': np.random.randint(1, 13, n)})
%timeit df['A'].map(d) # 7.29 ms per loop
%timeit df['A'].map(mapper) # 946 ms per loop
Solution 1: One liner
df['Month'] = pd.to_datetime(df['Month'], format='%m').dt.strftime('%b')
Solution 2: Using apply()
def mapper(month):
return month.strftime('%b')
df['Month'] = df['Month'].apply(mapper)
Reference:
http://strftime.org/
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
using datetime object methods
I'm surpised this answer doesn't have a solution using strftime
note, you'll need to have a valid datetime object before using the strftime method, use pd.to_datetime(df['date_column']) to cast your target column to a datetime object.
import pandas as pd
dates = pd.date_range('01-Jan 2020','01-Jan 2021',freq='M')
df = pd.DataFrame({'dates' : dates})
df['month_name'] = df['dates'].dt.strftime('%b')
dates month_name
0 2020-01-31 Jan
1 2020-02-29 Feb
2 2020-03-31 Mar
3 2020-04-30 Apr
4 2020-05-31 May
5 2020-06-30 Jun
6 2020-07-31 Jul
7 2020-08-31 Aug
8 2020-09-30 Sep
9 2020-10-31 Oct
10 2020-11-30 Nov
11 2020-12-31 Dec
another method would be to slice the name using dt.month_name()
df['month_name_str_slice'] = df['dates'].dt.month_name().str[:3]
dates month_name month_name_str_slice
0 2020-01-31 Jan Jan
1 2020-02-29 Feb Feb
2 2020-03-31 Mar Mar
3 2020-04-30 Apr Apr
4 2020-05-31 May May
5 2020-06-30 Jun Jun
6 2020-07-31 Jul Jul
7 2020-08-31 Aug Aug
8 2020-09-30 Sep Sep
9 2020-10-31 Oct Oct
10 2020-11-30 Nov Nov
11 2020-12-31 Dec Dec
You can do this easily with a column apply.
import pandas as pd
df = pd.DataFrame({'client':['sss', 'yyy', 'www'], 'Month': ['02', '12', '06']})
look_up = {'01': 'Jan', '02': 'Feb', '03': 'Mar', '04': 'Apr', '05': 'May',
'06': 'Jun', '07': 'Jul', '08': 'Aug', '09': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
df['Month'] = df['Month'].apply(lambda x: look_up[x])
df
Month client
0 Feb sss
1 Dec yyy
2 Jun www
One way of doing that is with the apply method in the dataframe but, to do that, you need a map to convert the months. You could either do that with a function / dictionary or with Python's own datetime.
With the datetime it would be something like:
def mapper(month):
date = datetime.datetime(2000, month, 1) # You need a dateobject with the proper month
return date.strftime('%b') # %b returns the months abbreviation, other options [here][1]
df['Month'].apply(mapper)
In a simillar way, you could build your own map for custom names. It would look like this:
months_map = {01: 'Jan', 02: 'Feb'}
def mapper(month):
return months_map[month]
Obviously, you don't need to define this functions explicitly and could use a lambda directly in the apply method.
Use strptime and lambda function for this:
from time import strptime
df['Month'] = df['Month'].apply(lambda x: strptime(x,'%b').tm_mon)
Suppose we have a DF like this, and Date is already in DateTime Format:
df.head(3)
value
date
2016-05-19 19736
2016-05-26 18060
2016-05-27 19997
Then we can extract month number and month name easily like this :
df['month_num'] = df.index.month
df['month'] = df.index.month_name()
value year month_num month
date
2017-01-06 37353 2017 1 January
2019-01-06 94108 2019 1 January
2019-01-05 77897 2019 1 January
2019-01-04 94514 2019 1 January
Having tested all of these on a large dataset, I have found the following to be fastest:
import calendar
def month_mapping():
# I'm lazy so I have a stash of functions already written so
# I don't have to write them out every time. This returns the
# {1:'Jan'....12:'Dec'} dict in the laziest way...
abbrevs = {}
for month in range (1, 13):
abbrevs[month] = calendar.month_abbr[month]
return abbrevs
abbrevs = month_mapping()
df['Month Abbrev'} = df['Date Col'].dt.month.map(mapping)
You can use Pandas month_name() function. Example:
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')
For more detail visit this link.
the best way would be to do with month_name() as commented by
Nurul Akter Towhid.
df['Month'] = df.Month.dt.month_name()
First you need to strip "0 " in the beginning (as u might get the exception leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers)
step1)
def func(i):
if i[0] == '0':
i = i[1]
return(i)
df["Month"] = df["Month"].apply(lambda x: func(x))
Step2:
df["Month"] = df["Month"].apply(lambda x: calendar.month_name(x))

How to combine multiple columns in a Data Frame to Pandas datetime format

I have a pandas data frame with values as below
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
I would like to know is there any way I can combine columns of Date,Month&Year & time to a pd.datetimeformat?
Use to_datetime with automatic convert column Day,Month,Year with add times converted to_timedelta:
df['Datetime'] = pd.to_datetime(df.rename(columns={'Date':'Day'})[['Day','Month','Year']]) + \
pd.to_timedelta(df['Time'])
Another solutions are join all column converted to strings first:
df['Datetime'] = pd.to_datetime(df[['Date','Month','Year', 'Time']]
.astype(str).apply(' '.join, 1), format='%d %m %Y %H:%M:%S')
df['Datetime'] = (pd.to_datetime(df['Year'].astype(str) + '-' +
df['Month'].astype(str) + '-' +
df['Date'].astype(str) + ' ' +
df['Time']))
print (df)
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Last if need remove these columns:
df = df.drop(['Date','Month','Year', 'Time'], axis=1)
print (df)
ProcessID1 UserID Datetime
0 248 Tony 2017-04-29 23:30:56
1 436 Jeff 2017-04-28 20:02:19
2 500 Greg 2017-05-04 11:48:29
Concatenate the columns together to a string format and use pd.to_datetime to convert to datetime.
import pandas as pd
import io
txt = """
ProcessID1 UserID Date Month Year Time
248 Tony 29 4 2017 23:30:56
436 Jeff 28 4 2017 20:02:19
500 Greg 4 5 2017 11:48:29
"""
df = pd.read_csv(io.StringIO(txt), sep="[\t ,]+")
df['Datetime'] = pd.to_datetime(df['Date'].astype(str) \
+ '-' + df['Month'].astype(str) \
+ '-' + df['Year'].astype(str) \
+ ' ' + df['Time'],
format='%d-%m-%Y %H:%M:%S')
df
import pandas as pd
You can also do this by using apply() method:-
df['Datetime']=df[['Year','Month','Date']].astype(str).apply('-'.join,1)+' '+df['Time']
Finally convert 'Datetime' to datetime dtype by using pandas to_datetime() method:-
df['Datetime']=pd.to_datetime(df['Datetime'])
Output of df:
ProcessID1 UserID Date Month Year Time Datetime
0 248 Tony 29 4 2017 23:30:56 2017-04-29 23:30:56
1 436 Jeff 28 4 2017 20:02:19 2017-04-28 20:02:19
2 500 Greg 4 5 2017 11:48:29 2017-05-04 11:48:29
Now if you want to remove 'Date','Month','Year' and 'Time' column then use:-
df=df.drop(columns=['Date','Month','Year', 'Time'])

Resources