read date time column into pandas dataframe. retain seconds information in the dataframe - python-3.x

My csv file.
Timestamp
---------------------
1/4/2019 2:00:09 PM
1/4/2019 2:00:18 PM
I have a column date time information in a csv file . I want to read this as a timestamp column into a pandas dataframe. I want to retain the seconds information.
Effort 1:
I tried
def dateparse (timestamp):
return pd.datetime.strptime(timestamp, '%m/%d/%Y %H:%M:%S ')
df = pd.read_csv('file_name.csv', parse_dates['Timestamp'],date_parser=dateparse)
Above rounds off the seconds to something like
1/4/2019 2:00:00
Effort 2:
I thought of reading the entire file using and later convert it into dataframe.
with open('file name.csv') as f:
for line in f:
print(line)
But again here seconds information is rounded off.
edit 1:
The seconds info is truncated when I open this csv file in editors like sublime.

For me working omit date_parser=dateparse:
import pandas as pd
temp=u"""Timestamp1
1/4/2019 2:00:09 PM
1/4/2019 2:00:18 PM"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['Timestamp1'])
print (df)
Timestamp1
0 2019-01-04 14:00:09
1 2019-01-04 14:00:18
print (df.dtypes)
Timestamp1 datetime64[ns]
dtype: object
EDIT1:
Correct format of datetimes should be changed:
import pandas as pd
def dateparse (timestamp):
return pd.datetime.strptime(timestamp, '%m/%d/%Y %I:%M:%S %p')
temp=u"""Timestamp1
1/4/2019 2:00:09 AM
1/4/2019 2:00:09 PM
1/4/2019 2:00:18 PM"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['Timestamp1'],date_parser=dateparse)
print (df)
Timestamp1
0 2019-01-04 02:00:09
1 2019-01-04 14:00:09
2 2019-01-04 14:00:18
print (df.dtypes)
Timestamp1 datetime64[ns]
dtype: object
EDIT2:
df = pd.read_csv('send1.csv', parse_dates=['Timestamp'])
print (df)
Timestamp
0 2019-01-04 14:00:00
1 2019-01-04 14:00:00
2 2019-01-04 14:00:00
3 2019-01-04 14:00:00
4 2019-01-04 14:00:00
5 2019-01-04 14:00:00

Related

How to set datetime format for pandas dataframe column labels?

IPNI_RNC PATHID 2020-11-11 00:00:00 2020-11-12 00:00:00 2020-11-13 00:00:00 2020-11-14 00:00:00 2020-11-15 00:00:00 2020-11-16 00:00:00 2020-11-17 00:00:00 Last Day Violation Count
Above are the columns label after reading the excel file. There are 10 columns in df variable after reading the excel and 7 of the columns label are date.
My input data set is an excel file which changes everyday and I want to update it automatically. In excel, some columns label are date like 11-Nov-2020, 12-Nov-2020 but after reading the excel it becomes like 2020-11-11 00:00:00, 2020-11-12 00:00:00. I want to keep column labels as 11-Nov-2020, 12-Nov-2020 while reading excel by pd.read_excel if possible or I need to convert it later.
I am very new in python. Looking forward for your support
Thanks who have already came forward to cooperate me
You can of course use the standard python methods to parse the date values, but I would not recommend it, because this way you end up with python datetime objects and not with the pandas representation of dates. That means, it consumes more space, is probably not as efficient and you can't use the pandas methods to access e.g. the year. I'll show you, what I mean below.
In case you want to avoid the naming issue of your column names, you might want to try to prevent pandas to automatically assign the names and read the first line as data to fix it yourselfe automatically (see the section below about how you can do it).
The type conversion part:
# create a test setup with a small dataframe
import pandas as pd
from datetime import date, datetime, timedelta
df= pd.DataFrame(dict(id=range(10), date_string=[str(datetime.now()+ timedelta(days=d)) for d in range(10)]))
# test the python way:
df['date_val_python']= df['date_string'].map(lambda dt: str(dt))
# use the pandas way: (btw. if you want to explicitely
# specify the format, you can use the format= keyword)
df['date_val_pandas']= pd.to_datetime(df['date_string'])
df.dtypes
The output is:
id int64
date_string object
date_val_python object
date_val_pandas datetime64[ns]
dtype: object
As you can see date_val has type object, this is because it contains python objects of class datetime while date_val_pandas uses the internal datetime representation of pandas. You can now try:
df['date_val_pandas'].dt.year
# this will return a series with the year part of the date
df['date_val_python'].dt.year
# this will result in the following error:
AttributeError: Can only use .dt accessor with datetimelike values
See the pandas doc for to_datetime for more details.
The column naming part:
# read your dataframe as usual
df= pd.read_excel('c:/scratch/tmp/dates.xlsx')
rename_dict= dict()
for old_name in df.columns:
if hasattr(old_name, 'strftime'):
new_name= old_name.strftime('DD-MMM-YYYY')
rename_dict[old_name]= new_name
if len(rename_dict) > 0:
df.rename(columns=rename_dict, inplace=True)
This works, in case your column titles are stored as usual dates, which I suppose is true, because you get a time part after importing them.
strftime of the datetime module is the function you need:
If datetime is a datetime object, you can do
datetime.strftime("%d-%b-%Y")
Example:
>>> from datetime import datetime
>>> timestamp = 1528797322
>>> date_time = datetime.fromtimestamp(timestamp)
>>> print(date_time)
2018-06-12 11:55:22
>>> print(date_time.strftime("%d-%b-%Y"))
12-Jun-2018
In order to apply a function to certain dataframe columns, use:
datetime_cols_list = ['datetime_col1', 'datetime_col2', ...]
for col in dataframe.columns:
if col in datetime_cols_list:
dataframe[col] = dataframe[col].apply(lambda x: x.strftime("%d-%b-%Y"))
I am sure this can be done in multiple ways in pandas, this is just what came out the top of my head.
Example:
import pandas as pd
import numpy as np
np.random.seed(0)
# generate some random datetime values
rng = pd.date_range('2015-02-24', periods=5, freq='T')
other_dt_col = rng = pd.date_range('2016-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date': rng, 'Date2': other_dt_col,'Val': np.random.randn(len(rng)) })
print (df)
# Output:
# Date Date2 Val
# 0 2016-02-24 00:00:00 2016-02-24 00:00:00 1.764052
# 1 2016-02-24 00:01:00 2016-02-24 00:01:00 0.400157
# 2 2016-02-24 00:02:00 2016-02-24 00:02:00 0.978738
# 3 2016-02-24 00:03:00 2016-02-24 00:03:00 2.240893
# 4 2016-02-24 00:04:00 2016-02-24 00:04:00 1.867558
datetime_cols_list = ['Date', 'Date2']
for col in df.columns:
if col in datetime_cols_list:
df[col] = df[col].apply(lambda x: x.strftime("%d-%b-%Y"))
print (df)
# Output:
# Date Date2 Val
# 0 24-Feb-2016 24-Feb-2016 1.764052
# 1 24-Feb-2016 24-Feb-2016 0.400157
# 2 24-Feb-2016 24-Feb-2016 0.978738
# 3 24-Feb-2016 24-Feb-2016 2.240893
# 4 24-Feb-2016 24-Feb-2016 1.867558

How to read in unusual date\time format

I have a small df with a date\time column using a format I have never seen.
Pandas reads it in as an object even if I use parse_dates, and to_datetime() chokes on it.
The dates in the column are formatted as such:
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
I think the best approach is using a date parsing pattern. Something like this:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
But I simply do not know how to approach this format.
The datatime format for UTC is very specific for converting the offset.
strftime() and strptime() Format Codes
The format must be + or - and then 00:00
Use str.zfill to backfill the 0s between the sign and the integer
+08:00 or -08:00 or +10:00 or -10:00
import pandas as pd
# sample data
df = pd.DataFrame({'datetime': ['2019/12/29 GMT+8 18:00', '2019/12/15 GMT+8 05:00', '2019/12/15 GMT+10 05:00', '2019/12/15 GMT-10 05:00']})
# display(df)
datetime
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
2019/12/15 GMT+10 05:00
2019/12/15 GMT-10 05:00
# fix the format
df.datetime = df.datetime.str.split(' ').apply(lambda x: x[0] + x[2] + x[1][3:].zfill(3) + ':00')
# convert to a utc datetime
df.datetime = pd.to_datetime(df.datetime, format='%Y/%m/%d%H:%M%z', utc=True)
# display(df)
datetime
2019-12-29 10:00:00+00:00
2019-12-14 21:00:00+00:00
2019-12-14 19:00:00+00:00
2019-12-15 15:00:00+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 4 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 160.0 bytes
You could pass the custom format with GMT+8 in the middle and then subtract eight hours with timedelta(hours=8):
import pandas as pd
from datetime import datetime, timedelta
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d GMT+8 %H:%M') - timedelta(hours=8)
df
Date
0 2019-12-29 10:00:00
1 2019-12-14 21:00:00

Check whether a certain datetime value is missing in a given period

I have a df with DateTime index as follows:
DateTime
2017-01-02 15:00:00
2017-01-02 16:00:00
2017-01-02 18:00:00
....
....
2019-12-07 22:00:00
2019-12-07 23:00:00
Now, I want to know is there any time missing in the 1-hour interval. So, for instance, the 3rd reading is missing 1 reading as we went from 16:00 to 18:00 so is it possible to detect this?
Create date_range with minimal and maximal datetime and filter values by Index.isin with boolean indexing with ~ for inverting mask:
print (df)
DateTime
0 2017-01-02 15:00:00
1 2017-01-02 16:00:00
2 2017-01-02 18:00:00
r = pd.date_range(df['DateTime'].min(), df['DateTime'].max(), freq='H')
print (r)
DatetimeIndex(['2017-01-02 15:00:00', '2017-01-02 16:00:00',
'2017-01-02 17:00:00', '2017-01-02 18:00:00'],
dtype='datetime64[ns]', freq='H')
out = r[~r.isin(df['DateTime'])]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', freq='H')
Another idea is create DatetimeIndex with helper column, change frequency by Series.asfreq and filter index values with missing values:
s = df[['DateTime']].assign(val=1).set_index('DateTime')['val'].asfreq('H')
print (s)
DateTime
2017-01-02 15:00:00 1.0
2017-01-02 16:00:00 1.0
2017-01-02 17:00:00 NaN
2017-01-02 18:00:00 1.0
Freq: H, Name: val, dtype: float64
out = s.index[s.isna()]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', name='DateTime', freq='H')
Is it safe to assume that the datetime format will always be the same? If yes, why don't you extract the "hour" values from your respective timestamps and compare them to the interval you desire, e.g:
import re
#store some datetime values for show
datetimes=[
"2017-01-02 15:00:00",
"2017-01-02 16:00:00",
"2017-01-02 18:00:00",
"2019-12-07 22:00:00",
"2019-12-07 23:00:00"
]
#extract hour value via regex (first match always is the hours in this format)
findHour = re.compile("\d{2}(?=\:)")
prevx = findHour.findall(datetimes[1])[0]
#simple comparison: compare to previous value, calculate difference, set previous value to current value
for x in datetimes[2:]:
cmp = findHour.findall(x)[0]
diff = int(cmp) - int(prevx)
if diff > 1:
print("Missing Timestamp(s) between {} and {} hours!".format(prevx, cmp))
prevx = cmp

How to convert date stored in YYYYMMDD to datetime format in pandas

I have a column in dataframe which is in YYYYMMDD format i want convert into datetime format . How can do in pandas.
Input
20180504
20180516
20180516
20180517
**Expected Output**
Date datetime
20180504 04/5/2018 00:00:00
20180516 16/5/2018 00:00:00
20180516 16/5/2018 00:00:00
20180517 17/5/2018 00:00:00
Use to_datetime:
df['datetime'] = pd.to_datetime(df['Input'], format='%Y%m%d')
print (df)
Input datetime
0 20180504 2018-05-04
1 20180516 2018-05-16
2 20180516 2018-05-16
3 20180517 2018-05-17
If zero times they are not displayed in column, but if convert to list you can see it:
print (df['datetime'].tolist())
[Timestamp('2018-05-04 00:00:00'),
Timestamp('2018-05-16 00:00:00'),
Timestamp('2018-05-16 00:00:00'),
Timestamp('2018-05-17 00:00:00')]
If input is csv file:
df = pd.read_csv(file, parse_dates=['Input'])
If want same format like in question it is possible, but output is strings, not datetimes:
df['datetime'] = pd.to_datetime(df['Input'], format='%Y%m%d').dt.strftime('%d/%m%Y %H:%M:%S')
print (df)
Input datetime
0 20180504 04/052018 00:00:00
1 20180516 16/052018 00:00:00
2 20180516 16/052018 00:00:00
3 20180517 17/052018 00:00:00
print (df['datetime'].tolist())
['04/052018 00:00:00', '16/052018 00:00:00', '16/052018 00:00:00', '17/052018 00:00:00']

Select the data from between two timestamp in python

My query is regrading getting the data, given two timestamp in python.
I need to have a input field, where i can enter the two timestamp, then from the CSV read, i need to retrieve for that particular input.
Actaul Data(CSV)
Daily_KWH_System PowerScout Temperature Timestamp Visibility Daily_electric_cost kW_System
0 4136.900384 P371602077 0 07/09/2016 23:58 0 180.657705 162.224216
1 3061.657187 P371602077 66 08/09/2016 23:59 10 133.693074 174.193804
2 4099.614033 P371602077 63 09/09/2016 05:58 10 179.029562 162.774013
3 3922.490275 P371602077 63 10/09/2016 11:58 10 171.297701 169.230047
4 3957.128982 P371602077 88 11/09/2016 17:58 10 172.806125 164.099307
Example:
Input:
start date : 2-1-2017
end date :10-1-2017
Output
Timestamp Value
2-1-2017 10
3-1-2017 35
.
.
.
.
10-1-2017 25
The original CSV would contain all the data
Timestamp Value
1-12-2016 10
2-12-2016 25
.
.
.
1-1-2017 15
2-1-2017 10
.
.
.
10-1-2017 25
.
.
31-1-2017 50
use pd.read_csv to read the file
df = pd.read_csv('my.csv', index_col='Timestamp', parse_dates=[0])
Then use your inputs to slice
df[start_date:end_date]
It seems you need dayfirst=True in read_csv with select by [] if all start and end dates are in df.index:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Timestamp;Value
1-12-2016;10
2-12-2016;25
1-1-2017;15
2-1-2017;10
10-1-2017;25
31-1-2017;50"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
#if necessary add sep
#index_col=[0] convert first column to index
#parse_dates=[0] parse first column to datetime
df = pd.read_csv(StringIO(temp), sep=";", index_col=[0], parse_dates=[0], dayfirst=True)
print (df)
Value
Timestamp
2016-12-01 10
2016-12-02 25
2017-01-01 15
2017-01-02 10
2017-01-10 25
2017-01-31 50
print (df.index.dtype)
datetime64[ns]
print (df.index)
DatetimeIndex(['2016-12-01', '2016-12-02', '2017-01-01', '2017-01-02',
'2017-01-10', '2017-01-31'],
dtype='datetime64[ns]', name='Timestamp', freq=None)
start_date = pd.to_datetime('2-1-2017', dayfirst=True)
end_date = pd.to_datetime('10-1-2017', dayfirst=True)
print (df[start_date:end_date])
Value
Timestamp
2017-01-02 10
2017-01-10 25
If some dates are not in index you need boolean indexing:
start_date = pd.to_datetime('3-1-2017', dayfirst=True)
end_date = pd.to_datetime('10-1-2017', dayfirst=True)
print (df[(df.index > start_date) & (df.index > end_date)])
Value
Timestamp
2017-01-31 50

Resources