Give default datetime object value to pandas.to_datetime() - python-3.x

I have some dates in string with different formats that I convert to datetime objects using to_datetime(). However, the list of strings also has some garbage values that I want to convert to default date.
import pandas as pd
import datetime as dt
print(df)
dates
0 2018-02-12
1 2018-03-19
2 12-24-2018
3 garbage
I use errors='coerece' to avert to throw exception. It produces NaT, that I want to convert to a default date 2018-12-31, in my case.
df['dates'] = pd.to_datetime(df['dates'], errors='coerce')
Below result.
dates
0 2018-02-12
1 2018-03-19
2 2018-12-24
3 NaT
Approach:
I am checking if the given value is a valid datetime or not. If not, put the default datetime object. But for some reason, it produces all default values.
df['dates'].apply(lambda x: dt.datetime(2018,12,31) if x is not dt.datetime else x)
Current Output
dates
0 2018-12-31
1 2018-12-31
2 2018-12-31
3 2018-12-31
Expected Output:
dates
0 2018-02-12
1 2018-03-19
2 2018-12-24
3 2018-12-31
Is there a way to give a default date to to_datetime() function so that, it won't produce NaT? If not, how do I put default dates afterwards?

You just need adding fillna at the end after pd.to_datetime call
pd.to_datetime(df['dates'], errors='coerce').fillna(pd.to_datetime('2018-12-31'))
Out[217]:
0 2018-02-12
1 2018-03-19
2 2018-12-24
3 2018-12-31
Name: dates, dtype: datetime64[ns]

Related

Determining correlation for datetime between two time series.ValueError: could not convert string to float: [duplicate]

I have a column I_DATE of type string(object) in a dataframe called train as show below.
I_DATE
28-03-2012 2:15:00 PM
28-03-2012 2:17:28 PM
28-03-2012 2:50:50 PM
How to convert I_DATE from string to datetime format & specify the format of input string.
Also, how to filter rows based on a range of dates in pandas?
Use to_datetime. There is no need for a format string since the parser is able to handle it:
In [51]:
pd.to_datetime(df['I_DATE'])
Out[51]:
0 2012-03-28 14:15:00
1 2012-03-28 14:17:28
2 2012-03-28 14:50:50
Name: I_DATE, dtype: datetime64[ns]
To access the date/day/time component use the dt accessor:
In [54]:
df['I_DATE'].dt.date
Out[54]:
0 2012-03-28
1 2012-03-28
2 2012-03-28
dtype: object
In [56]:
df['I_DATE'].dt.time
Out[56]:
0 14:15:00
1 14:17:28
2 14:50:50
dtype: object
You can use strings to filter as an example:
In [59]:
df = pd.DataFrame({'date':pd.date_range(start = dt.datetime(2015,1,1), end = dt.datetime.now())})
df[(df['date'] > '2015-02-04') & (df['date'] < '2015-02-10')]
Out[59]:
date
35 2015-02-05
36 2015-02-06
37 2015-02-07
38 2015-02-08
39 2015-02-09
Approach: 1
Given original string format: 2019/03/04 00:08:48
you can use
updated_df = df['timestamp'].astype('datetime64[ns]')
The result will be in this datetime format: 2019-03-04 00:08:48
Approach: 2
updated_df = df.astype({'timestamp':'datetime64[ns]'})
For a datetime in AM/PM format, the time format is '%I:%M:%S %p'. See all possible format combinations at https://strftime.org/. N.B. If you have time component as in the OP, the conversion will be done much, much faster if you pass the format= (see here for more info).
df['I_DATE'] = pd.to_datetime(df['I_DATE'], format='%d-%m-%Y %I:%M:%S %p')
To filter a datetime using a range, you can use query:
df = pd.DataFrame({'date': pd.date_range('2015-01-01', '2015-04-01')})
df.query("'2015-02-04' < date < '2015-02-10'")
or use between to create a mask and filter.
df[df['date'].between('2015-02-04', '2015-02-10')]

how to I use the .dt.hour accessor to get hours from a datetime object?

I have a dataframe I'm trying to separate into hour and day, so I can use the "hour of day" as (1,2,3,...,22,23,24) as ID variables for a project.
I'm having trouble with casting .dt.hour to my date column, and it spits out:
AttributeError: Can only use .dt accessor with datetimelike values
Currently, my dateformat is:
YYYY-MM-DD HH:MM:SS+00:00, and I'm assuming the error is in the 00:00
Here is a sample of the dataframe:
date btc_open btc_close
0 2021-01-01 00:00:00+00:00 28905.984003808422 29013.059128535537
1 2021-01-01 01:00:00+00:00 29016.129189426065 29432.828723553906
2 2021-01-01 02:00:00+00:00 29436.647295100185 29212.8610969002
For reproducible code (with error message), look below.
data = pd.DataFrame({'date': ['2021-01-01 00:00:00+00:00','2021-01-01 01:00:00+00:00','2021-01-01 02:00:00+00:00'],
'btc_open': [28905.98, 29016.12, 29436.64],
'btc_close': [29013.05, 29432.82, 29212.86]})
data['date'] = pd.to_datetime(data['date'], format = '%Y-%m-%d %H:%M:%S')
df_subset_1 = data[['date','btc_open','btc_close']]
# Converting datehour to date and hour columns
df_subset_1['date'] = df_subset_1['date'].dt.date
df_subset_1['hour'] = df_subset_1['date'].dt.hour
Does anyone know how to make this work?
keep a column of pandas datetime dtype (see also Time series / date functionality), EX:
import pandas as pd
data = pd.DataFrame({'datetime': ['2021-01-01 00:00:00+00:00','2021-01-01 01:00:00+00:00','2021-01-01 02:00:00+00:00'],
'btc_open': [28905.98, 29016.12, 29436.64],
'btc_close': [29013.05, 29432.82, 29212.86]})
data['datetime'] = pd.to_datetime(data['datetime'])
df_subset_1 = data[['datetime','btc_open','btc_close']]
# extract date and hour from datetime column
df_subset_1['date'] = df_subset_1['datetime'].dt.date
df_subset_1['hour'] = df_subset_1['datetime'].dt.hour
df_subset_1
datetime btc_open btc_close date hour
0 2021-01-01 00:00:00+00:00 28905.98 29013.05 2021-01-01 0
1 2021-01-01 01:00:00+00:00 29016.12 29432.82 2021-01-01 1
2 2021-01-01 02:00:00+00:00 29436.64 29212.86 2021-01-01 2

Hours, minutes and seconds are not showing in my timestamp after converting a datetime [duplicate]

I have a column I_DATE of type string(object) in a dataframe called train as show below.
I_DATE
28-03-2012 2:15:00 PM
28-03-2012 2:17:28 PM
28-03-2012 2:50:50 PM
How to convert I_DATE from string to datetime format & specify the format of input string.
Also, how to filter rows based on a range of dates in pandas?
Use to_datetime. There is no need for a format string since the parser is able to handle it:
In [51]:
pd.to_datetime(df['I_DATE'])
Out[51]:
0 2012-03-28 14:15:00
1 2012-03-28 14:17:28
2 2012-03-28 14:50:50
Name: I_DATE, dtype: datetime64[ns]
To access the date/day/time component use the dt accessor:
In [54]:
df['I_DATE'].dt.date
Out[54]:
0 2012-03-28
1 2012-03-28
2 2012-03-28
dtype: object
In [56]:
df['I_DATE'].dt.time
Out[56]:
0 14:15:00
1 14:17:28
2 14:50:50
dtype: object
You can use strings to filter as an example:
In [59]:
df = pd.DataFrame({'date':pd.date_range(start = dt.datetime(2015,1,1), end = dt.datetime.now())})
df[(df['date'] > '2015-02-04') & (df['date'] < '2015-02-10')]
Out[59]:
date
35 2015-02-05
36 2015-02-06
37 2015-02-07
38 2015-02-08
39 2015-02-09
Approach: 1
Given original string format: 2019/03/04 00:08:48
you can use
updated_df = df['timestamp'].astype('datetime64[ns]')
The result will be in this datetime format: 2019-03-04 00:08:48
Approach: 2
updated_df = df.astype({'timestamp':'datetime64[ns]'})
For a datetime in AM/PM format, the time format is '%I:%M:%S %p'. See all possible format combinations at https://strftime.org/. N.B. If you have time component as in the OP, the conversion will be done much, much faster if you pass the format= (see here for more info).
df['I_DATE'] = pd.to_datetime(df['I_DATE'], format='%d-%m-%Y %I:%M:%S %p')
To filter a datetime using a range, you can use query:
df = pd.DataFrame({'date': pd.date_range('2015-01-01', '2015-04-01')})
df.query("'2015-02-04' < date < '2015-02-10'")
or use between to create a mask and filter.
df[df['date'].between('2015-02-04', '2015-02-10')]

Pandas changing dates near each other

I have a pandas dataframe with dates and users which looks like this-
date = ['1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user =['A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
I am trying to find all dates that are next to each other (Jan-1 and Jan-2) and convert them to a single date so both entries would then become the lower of the two. The number of entries are over a million. This data is created from a scan results that triggers nightly and sometime flows into the other day.
Update-
I wanted to consolidate the date of the scan so that I can show the visualization properly. As right now the results would have more entry on the day the scan starts but very few entries for the day where the scan overflowed. There is a primary date and time stored so I am not loosing the data. The user column is presented as it scans a file with all the usernames and the date stores the date when it was scanned.
So far I was able to read the dataframe and then sort it based on the date to have the entries one after the other.
The output should look like the following -
Is there a pytonic way of doing this?
One issue to consider is the case of multiple consecutive days and how you want to handle these. The following code sets the day to the first of the consecutive days in each block:
import pandas as pd
from datetime import timedelta
# prepend two dates to show multiple consecutive days "use-case"
date = ['12/31/2019','1/1/2020','1/2/2020','1/9/2020','1/10/2020','1/17/2020','1/18/2020','1/24/2020','1/25/2020','5/17/2019','5/18/2019','5/24/2019','5/29/2019']
user = ['Z','Z','A','B','C','B','A','A','B','C','A','A','B']
df = pd.DataFrame(data={"Date":date, "User":user})
# first convert to datetime to allow date operations
df.Date = pd.to_datetime(df.Date)
# check if the the date is one day after the row before (by shifting the Date column)
df['isConsecutive'] = (df.Date == df.Date.shift()+pd.DateOffset(1))
# get number of consecutive days in each block
df['numConsecutive'] = df.isConsecutive.groupby((~df.isConsecutive).cumsum()).cumsum()
# convert to timedelta
df.numConsecutive = df.numConsecutive.apply(lambda x: timedelta(days=x))
# take this as differnce to Date
df['NewDate'] = df.Date - df.numConsecutive
print(df)
This returns:
Date User isConsecutive numConsecutive NewDate
0 2019-12-31 Z False 0 days 2019-12-31
1 2020-01-01 Z True 1 days 2019-12-31
2 2020-01-02 A True 2 days 2019-12-31
3 2020-01-09 B False 0 days 2020-01-09
4 2020-01-10 C True 1 days 2020-01-09
5 2020-01-17 B False 0 days 2020-01-17
6 2020-01-18 A True 1 days 2020-01-17
7 2020-01-24 A False 0 days 2020-01-24
8 2020-01-25 B True 1 days 2020-01-24
9 2019-05-17 C False 0 days 2019-05-17
10 2019-05-18 A True 1 days 2019-05-17
11 2019-05-24 A False 0 days 2019-05-24
12 2019-05-29 B False 0 days 2019-05-29

Parse dates and create time series from .csv

I am using a simple csv file which contains data on calory intake. It has 4 columns: cal, day, month, year. It looks like this:
cal month year day
3668.4333 1 2002 10
3652.2498 1 2002 11
3647.8662 1 2002 12
3646.6843 1 2002 13
...
3661.9414 2 2003 14
# data types
cal float64
month int64
year int64
day int64
I am trying to do some simple time series analysis. I hence would like to parse month, year, and day to a single column. I tried the following using pandas:
import pandas as pd
from pandas import Series, DataFrame, Panel
data = pd.read_csv('time_series_calories.csv', header=0, pars_dates=['day', 'month', 'year']], date_parser=True, infer_datetime_format=True)
My questions are: (1) How do I parse the data and (2) define the data type of the new column? I know there are quite a few other similar questions and answers (see e.g. here, here and here) - but I can't make it work so far.
You can use parameter parse_dates where define column names in list in read_csv:
import pandas as pd
import numpy as np
import io
temp=u"""cal,month,year,day
3668.4333,1,2002,10
3652.2498,1,2002,11
3647.8662,1,2002,12
3646.6843,1,2002,13
3661.9414,2,2003,14"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), parse_dates=[['year','month','day']])
print (df)
year_month_day cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
print (df.dtypes)
year_month_day datetime64[ns]
cal float64
dtype: object
Then you can rename column:
df.rename(columns={'year_month_day':'date'}, inplace=True)
print (df)
date cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414
Or better is pass dictionary with new column name to parse_dates:
df = pd.read_csv(io.StringIO(temp), parse_dates={'dates': ['year','month','day']})
print (df)
dates cal
0 2002-01-10 3668.4333
1 2002-01-11 3652.2498
2 2002-01-12 3647.8662
3 2002-01-13 3646.6843
4 2003-02-14 3661.9414

Resources