Mixed date formats dd/mm/yyyy and d/m/y in pandas - python-3.x

I have a dataframe with mixed date formats in a column. Some of it is in the format dd/mm/yyyy and some of it is in the format d/m/y. How can I set the column as datetime by applying the appropriate format depending on the value of the cell?
I am reading from a csv file:
DayofWeek,Date
Friday,22/05/2015
Friday,10/2/12
Friday,10/10/14
Friday,21/10/2011
Friday,8/7/11

df = pd.read_csv('dates.txt', parse_dates=['Date'], dayfirst=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
DayofWeek 5 non-null object
Date 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 100.0+ bytes
print(df)
DayofWeek Date
0 Friday 2015-05-22
1 Friday 2012-02-10
2 Friday 2014-10-10
3 Friday 2011-10-21
4 Friday 2011-07-08

Related

How to parse correctly to date in python from specific ISO format

I connected to a table on a db where there are two columns with dates.
I had no problem to parse the column with values formatted as this: 2017-11-03
But I don't find a way to parse the other column with dates formatted as this: 2017-10-03 05:06:52.840 +02:00
My attempts
If I parse a single value through the strptime method
dt.datetime.strptime("2017-12-14 22:16:24.037 +02:00", "%Y-%m-%d %H:%M:%S.%f %z")
I get the correct output
datetime.datetime(2017, 12, 14, 22, 16, 24, 37000, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200)))
but if I try to use the same code format while parsing the table to the dataframe, the column dtype is an object:
Licenze_FromGY = pd.read_sql(query, cnxn, parse_dates={"EndDate":"%Y-%m-%d", "LastUpd":"%Y-%m-%d %H:%M:%S.%f %z"})
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 1000 non-null object
I also tried to change the code format either in the read_sql method or in the pd.to_datetime() method, but then all the values become NaT:
Licenze_FromGY["LastUpd"] = pd.to_datetime(Licenze_FromGY["LastUpd"], format="%Y-%m-%d %H:%M:%S.%fZ", errors="coerce")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tenant 1000 non-null int64
1 IdService 1000 non-null object
2 Code 1000 non-null object
3 Aggregate 1000 non-null object
4 Bundle 991 non-null object
5 Status 1000 non-null object
6 Value 1000 non-null int64
7 EndDate 258 non-null datetime64[ns]
8 Trial 1000 non-null bool
9 LastUpd 0 non-null datetime64[ns]
dtypes: bool(1), datetime64[ns](2), int64(2), object(5)
memory usage: 71.4+ KB
None
Anyone can help?
pandas cannot handle mixed UTC offsets in one Series (column). Assuming you have data like this
import pandas as pd
df = pd.DataFrame({"datetime": ["2017-12-14 22:16:24.037 +02:00",
"2018-08-14 22:16:24.037 +03:00"]})
if you just parse to datetime,
df["datetime"] = pd.to_datetime(df["datetime"])
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: object
you get dtype object. The elements of the series are of the Python datetime.datetime dtype. That limits the datetime functionality, compared to the pandas datetime dtype.
You can get that e.g. by parsing to UTC:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
df["datetime"]
0 2017-12-14 20:16:24.037000+00:00
1 2018-08-14 19:16:24.037000+00:00
Name: datetime, dtype: datetime64[ns, UTC]
You might set an appropriate time zone to re-create the UTC offset:
df["datetime"] = pd.to_datetime(df["datetime"], utc=True).dt.tz_convert("Europe/Athens")
df["datetime"]
0 2017-12-14 22:16:24.037000+02:00
1 2018-08-14 22:16:24.037000+03:00
Name: datetime, dtype: datetime64[ns, Europe/Athens]

How to read in unusual date\time format

I have a small df with a date\time column using a format I have never seen.
Pandas reads it in as an object even if I use parse_dates, and to_datetime() chokes on it.
The dates in the column are formatted as such:
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
I think the best approach is using a date parsing pattern. Something like this:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
But I simply do not know how to approach this format.
The datatime format for UTC is very specific for converting the offset.
strftime() and strptime() Format Codes
The format must be + or - and then 00:00
Use str.zfill to backfill the 0s between the sign and the integer
+08:00 or -08:00 or +10:00 or -10:00
import pandas as pd
# sample data
df = pd.DataFrame({'datetime': ['2019/12/29 GMT+8 18:00', '2019/12/15 GMT+8 05:00', '2019/12/15 GMT+10 05:00', '2019/12/15 GMT-10 05:00']})
# display(df)
datetime
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
2019/12/15 GMT+10 05:00
2019/12/15 GMT-10 05:00
# fix the format
df.datetime = df.datetime.str.split(' ').apply(lambda x: x[0] + x[2] + x[1][3:].zfill(3) + ':00')
# convert to a utc datetime
df.datetime = pd.to_datetime(df.datetime, format='%Y/%m/%d%H:%M%z', utc=True)
# display(df)
datetime
2019-12-29 10:00:00+00:00
2019-12-14 21:00:00+00:00
2019-12-14 19:00:00+00:00
2019-12-15 15:00:00+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 4 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 160.0 bytes
You could pass the custom format with GMT+8 in the middle and then subtract eight hours with timedelta(hours=8):
import pandas as pd
from datetime import datetime, timedelta
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d GMT+8 %H:%M') - timedelta(hours=8)
df
Date
0 2019-12-29 10:00:00
1 2019-12-14 21:00:00

how to merge month day year columns in date column?

The date is in separate columns
Month Day Year
8 12 1993
8 12 1993
8 12 1993
I want to merge it in one column
Date
8/12/1993
8/12/1993
8/12/1993
I tried
df_date = df.Timestamp((df_filtered.Year*10000+df_filtered.Month*100+df_filtered.Day).apply(str),format='%Y%m%d')
I get this error
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
Using pd.to_datetime with astype(str)
1. as string type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y').dt.strftime('%d/%m/%Y')
Month Day Year Date
0 8 12 1993 08/12/1993
1 8 12 1993 08/12/1993
2 8 12 1993 08/12/1993
2. as datetime type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y')
Month Day Year Date
0 8 12 1993 1993-12-08
1 8 12 1993 1993-12-08
2 8 12 1993 1993-12-08
Here is the solution:
df = pd.DataFrame({'Month': [8, 8, 8], 'Day': [12, 12, 12], 'Year': [1993, 1993, 1993]})
# This way dates will be a DataFrame
dates = df.apply(lambda row:
pd.Series(pd.Timestamp(row[2], row[0], row[1]), index=['Date']),
axis=1)
# And this way dates will be a Series:
# dates = df.apply(lambda row:
# pd.Timestamp(row[2], row[0], row[1]),
# axis=1)
apply method generates a new Series or DataFrame iteratively applying provided function (lambda in this case) and joining the results.
You can read about apply method in official documentation.
And here is the explanation of lambda expressions.
EDIT:
#JohnClements suggested a better solution, using pd.to_datetime method:
dates = pd.to_datetime(df).to_frame('Date')
Also, if you want your output to be a string, you can use
dates = df.apply(lambda row: f"{row[2]}/{row[0]}/{row[1]}",
axis=1)
You can try:
df = pd.DataFrame({'Month': [8,8,8], 'Day': [12,12,12], 'Year': [1993, 1993, 1993]})
df['date'] = pd.to_datetime(df)
Result:
Month Day Year date
0 8 12 1993 1993-08-12
1 8 12 1993 1993-08-12
2 8 12 1993 1993-08-12
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
Month 3 non-null int64
Day 3 non-null int64
Year 3 non-null int64
date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 176.0 bytes

Convert datetime object to date and datetime2 to time then combine to single column

I have a dataset where the transaction date is stored as YYYY-MM-DD 00:00:00 and the transaction time is stored as 1900-01-01 HH:MM:SS
I need to truncate these timestamps and then either leave as is or convert to a singular timestamp. I've tried several methods and all continue to return the full timestamp. Thoughts?
Use split and pd.to_datetime:
df = pd.DataFrame({'TransDate':['2015-01-01 00:00:00','2015-01-02 00:00:00','2015-01-03 00:00:00'],
'TransTime':['1900-01-01 07:00:00','1900-01-01 08:30:00','1900-01-01 09:45:15']})
df['Date'] = (pd.to_datetime(df['TransDate'].str.split().str[0] +
' ' +
df['TransTime'].str.split().str[1]))
Output:
TransDate TransTime Date
0 2015-01-01 00:00:00 1900-01-01 07:00:00 2015-01-01 07:00:00
1 2015-01-02 00:00:00 1900-01-01 08:30:00 2015-01-02 08:30:00
2 2015-01-03 00:00:00 1900-01-01 09:45:15 2015-01-03 09:45:15
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
TransDate 3 non-null object
TransTime 3 non-null object
Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 152.0+ bytes
None

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Resources