Need to read excel dates as decimals without automatically converting to date time - excel

I am reading in an excel sheet that has column 'Time (hr)' times in hours, minutes, seconds formatted like this : 64:45:00
I need to convert this to 64.75 hours
When I read this in with read_excel it automatically converts it to 1900-01-02 16:45
I have tried using dtype, converters, date_parse options in the read_excel function but always get an error
data = xl.parse(header = [0], dtype = {'Time (hr)': np.float64})
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
EDIT:
I found out that some of the values in the Time (hr) column are less than 24 hours therefore are read in as time only. For example 10:45:00 is just read in as a time so when I tried the solution I got this error:
TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.datetime'

You can try creating a dataframe first from the excel file using the following code test_df = xl.parse(name)
and then convert the date column to a int type like test_df['Time (hr)'].dt.strftime("%Y-%m-%d %H:%M").astype(int)

Here's what my test file dates.xlsx looks like:
Read it in and parse the dates as usual:
df = pd.read_excel('dates.xlsx', parse_dates=['Time (hr)'])
Time (hr)
0 1900-01-02 16:45:00
1 1900-01-02 07:10:00
2 1900-01-05 15:59:01
Excel's day one is 1-Jan-1900, so zero is:
epoch = dt.datetime(1899, 12, 31)
Subtract the epoch to get a timedelta and then convert to total seconds:
df['seconds'] = (df['Time (hr)'] - epoch).dt.total_seconds()
Time (hr) seconds
0 1900-01-02 16:45:00 233100.0
1 1900-01-02 07:10:00 198600.0
2 1900-01-05 15:59:01 489541.0
Make column for total hours:
df['hours'] = df.seconds / 3600
Time (hr) seconds hours
0 1900-01-02 16:45:00 233100.0 64.750000
1 1900-01-02 07:10:00 198600.0 55.166667
2 1900-01-05 15:59:01 489541.0 135.983611

Related

Non-standard Julian day time stamp

I have a timestamp in a non-standard format, its a concatenation of a number of elements. I'd like to convert at least the last part of the string into hours/minutes/seconds/decimal seconds so I can calculate the time gap between them (typically of the order of 2-5 seconds).
I have looked at this link but it assumes a 'proper' Julian time. How to convert Julian date to standard date?
My time stamp looks like this
1380643373
It is set up as ddd hh mm ss.s
This timestamp represent 138th day, 06:43:37.3
Is there a datetime method of working with this or do I need to strip out the various parts (hh,mm,ss.s) and concatenate them in some way? As I am only interested in the seconds, if I can just extract them I could deal with that by adding 60 if the second timestamp is smaller than the first - i.e event passes over the minute change boundary.
If you're only interested in seconds, you can do:
timestamp = 1380643373
seconds = (timestamp % 1000) / 10 # Gives 37.3
timestamp % 1000 gives you the last three digits of timestamp. Then you divide that by 10 to get seconds.
If it's a string, you can take the last three characters by slicing it.
timestamp = "1380643373"
seconds = int(timestamp[-3:]) / 10 # Gives 37.3
It's pretty easy to convert the timestamp to a datetime using the divmod() function repeatedly:
import datetime
base_date = datetime.datetime(2000, 1, 1, 0, 0, 0) # Midnight on Jan 1 2000
timestamp = 1380643373
timestamp, seconds = divmod(timestamp, 1000) # Gives 1380643, 373
seconds = seconds / 10 # Gives 37.3
timestamp, minutes = divmod(timestamp, 100) # Gives 13806, 43
days, hours = divmod(timestamp, 100) # Gives 138, 6
tdelta = datetime.timedelta(days=days, hours=hours, minutes=minutes, seconds=seconds) # Gives datetime.timedelta(days=138, seconds=24217, microseconds=300000)
new_date = base_date + tdelta

convert the h:m:s in minutes format

I have the following data. The idea is to multiply all the data.
however the minute column is in h:m:s format. So whenever i try to multiply i get an error.
and morever i need to convert the h:m:s in minutes format before i actually want to multiply.
tried with the following to convert this to minute
time1 = df['time']
time2 = time1.hour * 60 + time1.minute + time1.second
Create timedeltas by to_timedelta, convert to seconds by Series.dt.total_seconds and divide by 60:
df['Minutes'] = pd.to_timedelta(df['(MIN)']).dt.total_seconds().div(60)
If input valeus are python times also convert to strings:
df['Minutes'] = pd.to_timedelta(df['(MIN)'].astype(str)).dt.total_seconds().div(60)

Comparison between dates starts with -1

I have the following code:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame ({
'Date':['4/22/2020 14:32:10','4/21/2020 4:32:10','4/20/2020 1:32:10']
})
date ='04/22/2020'
datetime_object = datetime.strptime(date, '%m/%d/%Y')
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y %H:%M:%S')
days_diff = (datetime_object - df['Date']).dt.days
print(days_diff)
0 -1
1 0
2 1
Why the result is not looking like the one below? Why the no of days starts with -1 and not with 0?
0 0
1 1
2 2
This is because it's flooring the answers
for the first case
'4/22/2020 14:32:10' the diff is = -14/ 24 = ~ -0.6 days
o/p:- -1
for the second case
'4/21/2020 4:32:10' the diff is = 20/24 = ~ 0.8 days
o/p:- 0
for the third case
'4/20/2020 1:32:10' the difff is = 47/24 = ~1.9 days
o/p:- 1
I hope it helps.
Solution would be convert all the datetimes to dates
as in following line i have done with 'Date' column
days_diff = (datetime_object.date() - df['Date'].dt.date ).dt.days
In [32]: days_diff
Out[32]:
0 0
1 1
2 2
Name: Date, dtype: int64
The issue is to do with the fact you are subtracting the higher date from the lower date which leaves you with a negative result. In the datetime module, subtracting one date object from another creates a time delta object like so
days1 = self.toordinal()
days2 = other.toordinal()
secs1 = self._second + self._minute * 60 + self._hour * 3600
secs2 = other._second + other._minute * 60 + other._hour * 3600
base = timedelta(days1 - days2,
secs1 - secs2,
self._microsecond - other._microsecond)
If we mimic that with your dates we see the following days and secs created for each date object
737537 0
737537 52330
subtracting day2 from days1 and secs2 form secs 1 means we pass the following to the timedelta object
0 -52330
So we are saying create a time delta object where the difference is 0 days and negative 52,330 seconds. Which is quite correct. However the timedelta object is a complex object and allows fractional values, and also many other types, like weeks or minutes etc. it also does not apply any limits to the values. so in the seconds part you can pass 10 seconds or 100,000 seconds. Now 100,000 seconds is actually more seconds than there are in a day. So the code takes this into account and will divmod the seconds to work out if there are any extra days in these seconds.
days, seconds = divmod(seconds, 24*3600)
d += days
s += int(seconds) # can't overflow
Now Here the issue lies in understanding what divmod does. div mod will do a floor division and remainder of the calculation. Now in a positive case thats fine.
print(divmod(52330, 24*3600))
print(divmod(-52330, 24*3600))
(0, 52330)
(-1, 34070)
Since the floor division will round down to 0 days and return you the remaining seconds. However in the negative case the floor division will round down to -1 since -52330 / 86400 is -0.6056.... So floor division rounds this down to -1 and the remainder is the difference between between 86400 and 52330 so leaves 34070 seconds.
So you wouldnt face this issue if you are always subtracting the oldest date from the newest date so you never end up with a negative difference. Infact it doesnt make sense to subtract a newer date from an older date.
for the other cases you listed the difference between 4/21/2020 4:32:10 and 4/22/2020 00:00:00 is indeed 0 days since the difference is actually only 20 hours, this behavior is correct the difference is not 1 days its 20 hours.

Handle different time formats in a dataframe

I am working on a dataframe with a column regrouping different time format like
Time ID ...
0 1 hrs 1 min 1 sec 1
1 1 min 1 sec 2
2 1 sec 1
I would like to calculate the mean of the time column grouped by ids.
My problem is that the time format depends of the row.
I tried to use the mean() function on the Time column
df[["ID", "Time"]].groupby(["ID"]).agg(lambda x: x.mean())
but it does not work.
I tried to format to date to then calculate the mean, but the
format="%H hrs %M min %S sec" only apply to the first case and I get an Error:
ValueError: time data '1 min 1 sec' does not match format '%H hrs %M min %S sec' (search)
Convert Time to Timedelta and convert to seconds and call mean. Before doing it, you need replace hrs to hours.
s = pd.to_timedelta(df.Time.replace('hrs', 'hours', regex=True)).dt.total_seconds()
s.groupby(df.ID).mean()
Out[110]:
ID
1 1831.0
2 61.0
Name: Time, dtype: float64

Round Pandas date to nearest year/month

I am trying to round a pandas datetime column to its nearest year or month but I cannot figure out how to do it. For instance, this minimal example rounds to the closest hour:
pd.Timestamp.now().round('60min')
What I'd like is a way to replace the '60min' in order to round pd.Timestamp.now() to obtain either 2020-01-01 (for the year case) or 2019-08-01 (for the month case) (note that now() is exactly 2019-07-30 16:41:23.612004 at the time of asking!).
The pandas.Series.dt.round doc suggest a freq argument linking to this page, but trying the months/years options there return this error:
ValueError: is a non-fixed frequency
Any idea what I am missing?
If the column is really DateTime column (check with df.dtypes), you can get the year, month & day with the code below.
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['round_Year'] = df['Date']+ pd.offsets.YearBegin(-1)
rounds off to start of current year. Change -1 to 0 rounds off to start of next year.
df['round_Month'] = df['Date'] + pd.offsets.MonthBegin(-1)
rounds off to start of current Month. Change -1 to 0 rounds off to start of next Month
Example of rounding a Python Timestamp to the nearest half year:
from Pandas import Timestamp
def round_date_to_nearest_half_year(ts: Timestamp) -> Timestamp:
if 4 <= ts.month <=8:
return Timestamp(ts.year, 7, 1)
elif ts.month >=9:
return Timestamp(ts.year+1, 1, 1)
elif ts.month <= 3:
return Timestamp(ts.year, 1, 1)
else:
raise Exception("Logic error.")
Test:
print(round_date_to_nearest_half_year(Timestamp("2022-6-5")))
print(round_date_to_nearest_half_year(Timestamp("2022-7-3")))
print(round_date_to_nearest_half_year(Timestamp("2022-12-15")))
print(round_date_to_nearest_half_year(Timestamp("2023-1-5")))
Out:
2022-07-01 00:00:00
2022-07-01 00:00:00
2023-01-01 00:00:00
2023-01-01 00:00:00

Resources