Handle different time formats in a dataframe - python-3.x

I am working on a dataframe with a column regrouping different time format like
Time ID ...
0 1 hrs 1 min 1 sec 1
1 1 min 1 sec 2
2 1 sec 1
I would like to calculate the mean of the time column grouped by ids.
My problem is that the time format depends of the row.
I tried to use the mean() function on the Time column
df[["ID", "Time"]].groupby(["ID"]).agg(lambda x: x.mean())
but it does not work.
I tried to format to date to then calculate the mean, but the
format="%H hrs %M min %S sec" only apply to the first case and I get an Error:
ValueError: time data '1 min 1 sec' does not match format '%H hrs %M min %S sec' (search)

Convert Time to Timedelta and convert to seconds and call mean. Before doing it, you need replace hrs to hours.
s = pd.to_timedelta(df.Time.replace('hrs', 'hours', regex=True)).dt.total_seconds()
s.groupby(df.ID).mean()
Out[110]:
ID
1 1831.0
2 61.0
Name: Time, dtype: float64

Related

Comparison between dates starts with -1

I have the following code:
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame ({
'Date':['4/22/2020 14:32:10','4/21/2020 4:32:10','4/20/2020 1:32:10']
})
date ='04/22/2020'
datetime_object = datetime.strptime(date, '%m/%d/%Y')
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y %H:%M:%S')
days_diff = (datetime_object - df['Date']).dt.days
print(days_diff)
0 -1
1 0
2 1
Why the result is not looking like the one below? Why the no of days starts with -1 and not with 0?
0 0
1 1
2 2
This is because it's flooring the answers
for the first case
'4/22/2020 14:32:10' the diff is = -14/ 24 = ~ -0.6 days
o/p:- -1
for the second case
'4/21/2020 4:32:10' the diff is = 20/24 = ~ 0.8 days
o/p:- 0
for the third case
'4/20/2020 1:32:10' the difff is = 47/24 = ~1.9 days
o/p:- 1
I hope it helps.
Solution would be convert all the datetimes to dates
as in following line i have done with 'Date' column
days_diff = (datetime_object.date() - df['Date'].dt.date ).dt.days
In [32]: days_diff
Out[32]:
0 0
1 1
2 2
Name: Date, dtype: int64
The issue is to do with the fact you are subtracting the higher date from the lower date which leaves you with a negative result. In the datetime module, subtracting one date object from another creates a time delta object like so
days1 = self.toordinal()
days2 = other.toordinal()
secs1 = self._second + self._minute * 60 + self._hour * 3600
secs2 = other._second + other._minute * 60 + other._hour * 3600
base = timedelta(days1 - days2,
secs1 - secs2,
self._microsecond - other._microsecond)
If we mimic that with your dates we see the following days and secs created for each date object
737537 0
737537 52330
subtracting day2 from days1 and secs2 form secs 1 means we pass the following to the timedelta object
0 -52330
So we are saying create a time delta object where the difference is 0 days and negative 52,330 seconds. Which is quite correct. However the timedelta object is a complex object and allows fractional values, and also many other types, like weeks or minutes etc. it also does not apply any limits to the values. so in the seconds part you can pass 10 seconds or 100,000 seconds. Now 100,000 seconds is actually more seconds than there are in a day. So the code takes this into account and will divmod the seconds to work out if there are any extra days in these seconds.
days, seconds = divmod(seconds, 24*3600)
d += days
s += int(seconds) # can't overflow
Now Here the issue lies in understanding what divmod does. div mod will do a floor division and remainder of the calculation. Now in a positive case thats fine.
print(divmod(52330, 24*3600))
print(divmod(-52330, 24*3600))
(0, 52330)
(-1, 34070)
Since the floor division will round down to 0 days and return you the remaining seconds. However in the negative case the floor division will round down to -1 since -52330 / 86400 is -0.6056.... So floor division rounds this down to -1 and the remainder is the difference between between 86400 and 52330 so leaves 34070 seconds.
So you wouldnt face this issue if you are always subtracting the oldest date from the newest date so you never end up with a negative difference. Infact it doesnt make sense to subtract a newer date from an older date.
for the other cases you listed the difference between 4/21/2020 4:32:10 and 4/22/2020 00:00:00 is indeed 0 days since the difference is actually only 20 hours, this behavior is correct the difference is not 1 days its 20 hours.

Need to read excel dates as decimals without automatically converting to date time

I am reading in an excel sheet that has column 'Time (hr)' times in hours, minutes, seconds formatted like this : 64:45:00
I need to convert this to 64.75 hours
When I read this in with read_excel it automatically converts it to 1900-01-02 16:45
I have tried using dtype, converters, date_parse options in the read_excel function but always get an error
data = xl.parse(header = [0], dtype = {'Time (hr)': np.float64})
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
EDIT:
I found out that some of the values in the Time (hr) column are less than 24 hours therefore are read in as time only. For example 10:45:00 is just read in as a time so when I tried the solution I got this error:
TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.datetime'
You can try creating a dataframe first from the excel file using the following code test_df = xl.parse(name)
and then convert the date column to a int type like test_df['Time (hr)'].dt.strftime("%Y-%m-%d %H:%M").astype(int)
Here's what my test file dates.xlsx looks like:
Read it in and parse the dates as usual:
df = pd.read_excel('dates.xlsx', parse_dates=['Time (hr)'])
Time (hr)
0 1900-01-02 16:45:00
1 1900-01-02 07:10:00
2 1900-01-05 15:59:01
Excel's day one is 1-Jan-1900, so zero is:
epoch = dt.datetime(1899, 12, 31)
Subtract the epoch to get a timedelta and then convert to total seconds:
df['seconds'] = (df['Time (hr)'] - epoch).dt.total_seconds()
Time (hr) seconds
0 1900-01-02 16:45:00 233100.0
1 1900-01-02 07:10:00 198600.0
2 1900-01-05 15:59:01 489541.0
Make column for total hours:
df['hours'] = df.seconds / 3600
Time (hr) seconds hours
0 1900-01-02 16:45:00 233100.0 64.750000
1 1900-01-02 07:10:00 198600.0 55.166667
2 1900-01-05 15:59:01 489541.0 135.983611

Spotfire AVG value per hour

I have users that have random values per hour across 24 hours. I want to get their average value per hour as it increases. Such as: a value of 3 at 3pm, then 4 at 4pm, 5 at 5pm, find the average per hour and give the total average once there are no more timestamps.
I've tried this:
case
when (DatePart("hour",[AUDIT_TSP])>0) and (DatePart("hour",[AUDIT_TSP])<1)
then Date([AUDIT_TSP]) & " " & ":00" & ":00" & Second([AUDIT_TSP])
when (DatePart("hour",[AUDIT_TSP])=1) and (DatePart("minute",
[AUDIT_TSP])>0) then Date([AUDIT_TSP]) & " " & ":01" & ":00" &
Second([AUDIT_TSP])
else null end
This was based off of sporfire: calculate the avg per 15 minutes and I tweaked it for my use but couldn't get the code to show the avg hour and not 15min avg. So I figured to ask here.
My AUDIT_TSP is formatted with DateTime and example values look like:
4/15/2019 6:16:59 AM
4/15/2019 6:20:05 AM
The values are just shipments so, 1 shipment an hour, 2 shipments an hour, etc. Just trying to get the average per hour.
I don't expect the average per hour to show up on the timeline, the values it's showing here is the amount of shipments for each hour. If the average can be shown on the timeline, then great, if not, then I can audible and show it in a textbox if that's possible as well.
You could produce an intermediate column to mark each hour with
[hourBin] calculated as: Integer(ToEpochSeconds([time-stamp]) / 3600)
Here [time-stamp] is your timestamp column. ToEpochSeconds(..) counts the number of seconds from a reference date (usually 1st Jan 1970). When you divide it by 3600 and take the integer part, you get an hour counter.
Then average your values for each hour
Avg([value]) OVER ([hourBin])
And/or visualise the average and the spread around it in a box-plot of [value] with [hourBin] as the category.
If you want the intermediate column to look more meaningful you can rescale it by its first value so it starts from 0 (or add any number you wish):
Integer(ToEpochSeconds([time-stamp]) / 3600) - Min(Integer(ToEpochSeconds([time-stamp]) / 3600))

Computing Average In MM:SS in excel

Input: Minutes:Seconds
Output: Average in Minutes:Seconds
I currently have a sheet where we put in handle times for calls. We want to compute the average handling time in minutes:seconds. Now, currently we have minutes in Column A and seconds in Column B. In Column C, I convert A&B to total seconds. In Column D, I use =AVERAGE(C1:C6) to compute for average.
Question: Do we have an easier way to doit? Specifically, is there any formula that will allow me to simply input Minutes:Seconds in a single column and have the average in Minutes:Seconds to be calculated?
Option 1:
(If you can change your input format):
You need to set the data format as hh:mm:ss
Inputting the data in this format will allow excel to automatically detect the format and as such, allow for you to use the 'average' formula.
For example, if you have 3 entries for: 2 minutes, 1 min 30 secs and 1 minute, the data in col A should look like:
00:02:00
00:01:30
00:01:00
You can then run, for example, the formula:
=AVERAGE(A1:A3)
Note: By default, putting the data in the format "xx:yy", excel will assume that xx is the hours and yy is the minutes, so you should pass the initial 00: if you don't have your time in hours
Option 2:
(If you cannot change your input format):
If you need to stick to the format where col A contains the minutes and col b contains the seconds, you can use the following formula to pass "hh:mm:ss" data in col c:
=(A1/(24*60))+(B1/(24*60*60))
(default date value is set to '1 day' for a cell, so we divide by 24 hrs and 60 mins to convert it to 'minutes' from col A and 24*60*60 to get seconds from col B)
You can then use a folmula similar to the one below to calculate the average time in col C:
=AVERAGE(C1:C3)
Note: You would need to set column C to the 'Custom format':
hh:mm:ss

Convert 'x hrs y min z sec' to seconds

a) So I have a huge folder of .csv data with a column about time duration where the cells are 'x min y sec' (e.g. 15 min 29 sec) or 'x hrs y min z sec' (e.g. 1 hrs 48 min 28 sec). The cells are formatted by text.
I want to batch change them to the number of seconds, but I have no idea where to start. I can't get the data in another format.
I thought about somehow using 'hrs', 'min' or 'sec' as delimiters, but I don't know how to move from there. I also thought about using ' ' as delimiters, but then the first column is filled with either hours or minutes depending on the time duration.
I also thought about using PostgreSQL's SELECT EXTRACT(EPOCH FROM INTERVAL '5 days 3 hours'), but I haven't been able to work out how to use this on a column from a table.
b) Is there a better way to change this time format 'Fri Mar 14 11:29:27 EST 2014' to epoch time? Right now I'm thinking of using macros in Excel to get rid of 'Fri' and 'EST', then put the columns back together, then use the to_timestamp function in PostgreSQL.
In Excel if you have data in only those 2 formats and starting from A2 you can use this formula in B2 copied down to get the number of seconds:
=IFERROR(LEFT(A2,FIND("hrs",A2)-1)*3600,0)+SUM(MID(0&A2,FIND({"min","sec"},0&A2)-3,2)*{60,1})
It finds the relevant text then gets the number in front for each and multiplies by the relevant number to get seconds
You can do:
SELECT EXTRACT(EPOCH FROM column_name::interval)
FROM my_table;
The interval can use the regular time units (like hour), abbreviations thereof (hr) and plurals (hours). I am not sure about a combination of plural and abbreviation (hrs) though. If that does not work, UPDATE the column and replace() the sub-string "hrs" to "hours".
If you want to save the number of seconds in your table, then you convert the above statement into an UPDATE statement:
UPDATE my_table SET seconds_column = extract(epoch FROM column_name::interval);
I would split with space as the delimiter, then examine the second column. If it contains the string "hrs", then your seconds answer is:
3600 * column 1 + 60 * column 3 + column 5
Otherwise it is:
60 * column 1 + column 3

Resources