Spotfire: Calculate timestamp difference between two dates during working days and hours (8-17) - spotfire

I have tried to use datediff however it only gives the hour difference between two dates. I would like to receive only the hours that are during the working days and working hours. Example provided below:
Start Date:
19/08/2022 09:42:13
End Date:
22/08/2022 09:54:22
Outcome I receive from Datediff("HH", [Start Date], [End Date]) = 72.20
The actual outcome I would like would be approx. = 8.13

OK so you mean something like NetWorkHours rather than days. (I don't understand your comment about medians.)
Assuming you have a data table with a column for the start date and a column for the end date (and these are of DateTime data type)
you could create a TERR expression that takes the two dates as input and produces a number of worked hours as output.
I have used this answer as a start:
How to calculate networkdays minus holidays between 2 dates
but that does not cover hours. So here is my suggested solution.
The idea is to initially remove the start and end day (which are incomplete days) and calculate the number of whole days minus weekends as in the previous solution.
Then simply multiply it by the working hours in a day.
Then take the first and last day and calculate the hours worked.
Then add the two together.
Create a TERR expression function (from menu Data > Data Function Properties > Expression Function)
#start and end of working hours
startShift=8
endShift=17
#fill vector with holiday dates if required. Example:
holidayDates <- c(as.Date('18/04/2022',format='%d/%m/%Y'),as.Date('29/08/2022',format='%d/%m/%Y'))
#count complete days excluding holidays and weekends
allDays = seq.Date(from=as.Date(input1)+1, to=as.Date(input2)-1, by=1)
nonHolidays = as.Date(setdiff(allDays, holidayDates), origin="1970-01-01")
weekends =nonHolidays[weekdays(nonHolidays) %in% c("Saturday", "Sunday")]
nonHolidaysWeekends = as.Date(setdiff(nonHolidays, weekends), origin="1970-01-01")
hoursCompleteDays = length(nonHolidaysWeekends) *(endShift-startShift)
#count worked hours for first and last day
beginTime = as.POSIXlt(input1)
beginHour = beginTime$hour + beginTime$min/60
endTime = as.POSIXlt(input2)
endHour = endTime$hour + endTime$min/60
hoursFirstAndLastDay = (endShift-beginHour)+(endHour-startShift)
#add together
output = hoursCompleteDays + hoursFirstAndLastDay
call the TERR expression function e.g. TERR_netWorkingHours. This will give you the total hours worked.
Use it by creating a calculated column as:
TERR_netWorkingHours([startDate],[endDate])
where [startDate] and [endDate] are your original columns.

There are two main reasons why my previous answer did not work. Firstly, I tried to adapt an existing answer that worked for full days, but there are too many boundary cases when expanding it to fractional days (e.g. what if my start or end dates are the same, what if they fall over a weekend etc...). Secondly, a TERR expression function expects to work with vectorized inputs, which does not really work with a scenario with so many exceptions on the values of the inputs.
I think it works now (at least for my examples) if I create a TERR data function instead, which outputs a whole new table. I used the R library data.table to make it a bit more efficient. I heavily modified the algorithm to vectorize the steps into a temporary data table (schedule_df). There may be a more clever way, but I did not find it.
You might be able to just output a column by modifying I/O.
Here it is, hope it helps:
suppressWarnings(suppressPackageStartupMessages(library(data.table)))
setDT(dt)
######## main function
netWorkingHours = function(input1, input2) {
#Helper function
extractHour = function(x) {
x = as.POSIXlt(x)
return (x$hour + x$min/60)
}
#prepare ---
dotimes=FALSE
#start and end of working hours
startShift=8
endShift=17
weekend = c('Saturday','Sunday')
#process
input1d = as.Date(input1)
input2d = as.Date(input2)
#list all days including start and end
allDays = seq.Date(from=input1d, to=input2d, by=1)
Ndays=length(allDays)
#flag included days: if they are not weekends
#can be expanded to holidays
include=ifelse(weekdays(allDays) %in% c('Saturday','Sunday'),0,1)
#start building schedule
schedule_df=data.table(day=allDays,include=include)
schedule_df$index=c(1:Ndays)
#identify boundary days
schedule_df[,boundary:=ifelse(index==1 | index==Ndays,index,0)]
#initialize working hours
schedule_df$start=startShift
schedule_df$end=endShift
#modify start and end hours for boundary days
schedule_df[boundary==1 & max(boundary)>1, start :=extractHour(input1)]
schedule_df[boundary==1 & max(boundary)>1, start :=extractHour(input1)]
schedule_df[boundary==1 & max(boundary)>1, end :=endShift]
schedule_df[boundary==1 & max(boundary)==1, start :=extractHour(input1)]
schedule_df[boundary==1 & max(boundary)==1, end :=extractHour(input2)]
schedule_df[boundary>1 , start :=startShift]
schedule_df[boundary>1 , end :=extractHour(input2)]
#shift start and end hours by shift limits
schedule_df[,start1:=sapply(start,function(x) max(startShift,x))]
schedule_df[,end1 :=sapply(end,function(x) min(endShift,x))]
#calculate worked hours for each day
schedule_df$worked_hours=0
schedule_df[include==1,worked_hours:=ifelse(end1>start1,end1-start1,0)]
Nincluded = nrow(schedule_df[include==1])
output = ifelse(Nincluded>0,sum(schedule_df[include==1,'worked_hours']),0)
return (output)
}
######################## main
dt[,workedHours:= mapply(netWorkingHours,dt[['date1']],dt[['date2']])]

Related

How to get 1st calendar day of the current and next month based on a current date variable

I have a date variable calls today_date as below. I need to get the 1st calendar day of the current and next month.
In my case, today_date is 4/17/2021, I need to create two more variables calls first_day_current which should be 4/1/2021, and first_day_next which should be 5/1/2021.
Any suggestions are greatly appreciated
import datetime as dt
today_date
'2021-04-17'
Getting just the first date of a month is quite simple - since it equals 1 all the time. You can even do this without needing the datetime module to simplify calculations for you, if today_date is always a string "Year-Month-Day" (or any consistent format - parse it accordingly)
today_date = '2021-04-17'
y, m, d = today_date.split('-')
first_day_current = f"{y}-{m}-01"
y, m = int(y), int(m)
first_day_next = f"{y+(m==12)}-{m%12+1}-01"
If you want to use datetime.date(), then you'll anyway have to convert the string to (year, month, date) ints to give as arguments (or do today_date = datetime.date.today().
Then .replace(day=1) to get first_day_current.
datetime.timedelta can't add months (only upto weeks), so you'll need to use other libraries for this. But it's more imports and calculations to do the same thing in effect.
I found out pd.offsets could accomplish this task as below -
import datetime as dt
import pandas as pd
today_date #'2021-04-17' this is a variable that is being created in the program
first_day_current = today_date.replace(day=1) # this will be 2021-04-01
next_month = first_day_current + pd.offsets.MonthBegin(n=1)
first_day_next = next_month.strftime('%Y-%m-%d') # this will be 2021-05-01

How to calculate average datetime timestamps in python3

I have a code which I have it's performance timestamped, and I want to measure the average of time it took to run it on multiple computers, but I just cant figure out how to use the datetime module in python.
Here is how my procedure looks:
1) I have the code which simply writes into a text file the log, where the timestamp looks like
t1=datetime.datetime.now()
...
t2=datetime.datetime.now()
stamp= t2-t1
And that stamp variable is just written in say log.txt so in the log file it looks like 0:07:23.160896 so it seems like it's %H:%M:%S.%f format.
2) Then I run a second python script which reads in the log.txt file and it reads the 0:07:23.160896 value as a string.
The problem is I don't know how to work with this value because if I import it as a datetime it will also append and imaginary year and month and day to it, which I don't want, I simply just want to work with hours and minutes and seconds and microseconds to add them up or do an average.
For example I can just open it in Libreoffice and add the 0:07:23.160896 to 0:00:48.065130 which will give 0:08:11.226026 and then just divide by 2 which will give 0:04:05.613013, and I just can't possibly do that in python or I dont know how to do it.
I have tried everything, but neither datetime.datetime, nor datetime.timedelta allows simply multiplication and division like that. If I just do a y=datetime.datetime.strptime('0:07:23.160896','%H:%M:%S.%f') it will just give out 1900-01-01 00:07:23.160896 and I can't just take a y*2 like that, it doesnt allow arithmetic operations, plus if if I convert it into a timedelta it will also multiply the year,which is ridiculous. I simply just want to add and subtract and multiply time.
Please help me find a way to do this, and not just for 2 variables but possibly even a way to calculate the average of an entire list of timestamps like average(['0:07:23.160896' , '0:00:48.065130', '0:00:14.517086',...]) way.
I simply just want a way to calculate the average of many timestamps and give out it's average in the same format, just as you can just select a column in Libreoffice and take the AVERAGE() function which will give out the average timestamp in that column.
As you have done, you first read the string into a datetime-object using strptime: t = datetime.datetime.strptime(single_time,'%H:%M:%S.%f')
After that, convert the time part of your datestring into a timedelta, so you can easily calculate with times: tdelta = datetime.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second, microseconds=t.microsecond)
Now you can easily calculate with the timedelta object, and convert at the end of the calculations back into a string by str(tdsum)
import datetime
times = ['0:07:23.160896', '0:00:48.065130', '0:12:22.324251']
# convert times in iso-format into timedelta list
tsum = datetime.timedelta()
count = 0
for single_time in times:
t = datetime.datetime.strptime(single_time,'%H:%M:%S.%f')
tdelta = datetime.timedelta(hours=t.hour, minutes=t.minute, seconds=t.second, microseconds=t.microsecond)
tsum = tsum + tdelta
count = count + 1
taverage = tsum / count
average_time = str(taverage)
print(average_time)

Is it possible to schedule same task at different cron in celery without duplicating entry

I want to schedule a celery task at random possible days and also on last day of every month. Is it possible to store the schedule of a celery task in certain way that the task gets picked up both in middle of the month and the last day of month as well. Last day of month of course is not static. Is it possible to store two different cron expressions for single task. I tried a cron expression with L for last month as suggested in many sites. but it doesn't seem to be a standard.
I don't think it's supported out of the box, but it looks fairly doable to write a custom schedule. Celery supports crontab and solar schedules, and they are Python classes so I assume one could create its own.
I haven't tested it, but it looks like it should be something along the lines of:
import calendar
import datetime
from celery.schedules import BaseSchedule, schedstate
class last_day_month(BaseSchedule):
def is_due(self, last_run_at):
"""Return tuple of ``(is_due, next_time_to_run)``.
Note:
next time to run is in seconds.
See Also:
:meth:`celery.schedules.schedule.is_due` for more information.
"""
# Same as the one from crontab schedule class
rem_delta = self.remaining_estimate(last_run_at)
rem = max(rem_delta.total_seconds(), 0)
due = rem == 0
if due:
rem_delta = self.remaining_estimate(self.now())
rem = max(rem_delta.total_seconds(), 0)
return schedstate(due, rem)
def remaining_estimate(self, last_run_at):
"""Estimate of next run time.
Returns when the periodic task should run next as a
:class:`~datetime.timedelta`.
"""
last_run_at = self.maybe_make_aware(last_run_at)
now = self.maybe_make_aware(self.now())
if last_run_at.month == now.month:
# Already run this month: next time is next month
month = now.month + 1 if now.month < 12 else 1
year = now.year + 1 if now.month < 12 else 1
else:
# Otherwise, should run this month next
month, year = now.month, now.year
# We've got the month and year, now get the last day
_first_day, last_day = calendar.monthrange(year, month)
# Build the datetime for next run
next_utc = self.maybe_make_aware(
datetime.datetime(year=year, month=month, day=last_day)
)
# Return the timedelta from now
return next_utc - now
This is to be used in place of the crontab class in your Celery schedule

multiple if conditions in lambda does not work as expected

I have a problem trying to piece together a multiple conditional if.
UDF.daysinmonth(x) returns number of days in a month
latest_date.month returns the month of datetime object (eg 3 for 2019-3-10) - latest_date=df['offtake_date'].max()
df.insert(loc=20, column='bbls_mbd_mth',value=df['bbls'] / df['offtake_date'].apply(lambda x: UDF.daysinmonth(x) if x.month!=latest_date.month and x.year!=latest_date.year else latest_date.day))
This doesn't work, if x.month!=latest_date.month and x.year!=latest_date.year. For all of 2019, it returns the latest day in the data rather than the number of days in past months. For 2018, it works fine.
This doesn't work either.
df.insert(loc=20, column='bbls_mbd_mth',value=np.nan)
for i, row in df.iterrows():
ifor_val = df.at[i,'bbls']/latest_date.day
if ((df.at[i,'offtake_date'].month!=latest_date.month) and (df.at[i,'offtake_date'].year!=latest_date.year)):
ifor_val = df.at[i,'bbls']/(UDF.daysinmonth(df.at[i,'offtake_date']))
df.at[i,'bbls_mbd_mth'] = ifor_val
But it works when I flip the logic
for i, row in df.iterrows():
ifor_val = df.at[i,'bbls']/(UDF.daysinmonth(df.at[i,'offtake_date'])
if ((df.at[i,'offtake_date'].month==latest_date.month) and (df.at[i,'offtake_date'].year==latest_date.year)):
ifor_val = df.at[i,'bbls']/latest_date.day)
df.at[i,'bbls_mbd_mth'] = ifor_val
I think I am missing something real basic.... any help appreciated.
Thanks Julian for the prompt reply. I found out my error, it's a logic error. The missing bracket is just typo.
apply(lambda x: UDF.daysinmonth(x) if x.month!=latest_date.month and x.year!=latest_date.year else latest_date.day))
I wanted to apply the latest_day only if the date falls into the latest month and year. By doing the above, it's the same as
apply(lambda x: UDF.daysinmonth(x) if x.year!=latest_date.year else latest_date.day))
regardless of the month, which is not what I want. I reversed the logic and it worked
apply(lambda x: latest_date.day if x.month==latest_date.month and x.year==latest_date.year else UDF.daysinmonth(x)))

How to increment counters based on a column value being fixed in a Window?

I have a dataset that, over time, indicates the region where certain users were located. From this dataset I want to calculate the number of nights that they spent at each location. By "spending the night" I mean: take the last location seen of a user until 23h59 of a certain day; if all observed locations from that user until 05:00 the next day, or the first one after that if there is none yet, match the last of the previous day, that's a night spent at that location.
| Timestamp| User| Location|
|1462838468|49B4361512443A4DA...|1|
|1462838512|49B4361512443A4DA...|1|
|1462838389|49B4361512443A4DA...|2|
|1462838497|49B4361512443A4DA...|3|
|1465975885|6E9E0581E2A032FD8...|1|
|1457723815|405C238E25FE0B9E7...|1|
|1457897289|405C238E25FE0B9E7...|2|
|1457899229|405C238E25FE0B9E7...|11|
|1457972626|405C238E25FE0B9E7...|9|
|1458062553|405C238E25FE0B9E7...|9|
|1458241825|405C238E25FE0B9E7...|9|
|1458244457|405C238E25FE0B9E7...|9|
|1458412513|405C238E25FE0B9E7...|6|
|1458412292|405C238E25FE0B9E7...|6|
|1465197963|6E9E0581E2A032FD8...|6|
|1465202192|6E9E0581E2A032FD8...|6|
|1465923817|6E9E0581E2A032FD8...|5|
|1465923766|6E9E0581E2A032FD8...|2|
|1465923748|6E9E0581E2A032FD8...|2|
|1465923922|6E9E0581E2A032FD8...|2|
I'm guessing I need to use Window functions here, and I've used PySpark for other things in the past, but I'm a bit at a loss as to where to start here.
I think in the end you do need to have a function that takes a series of events and outputs nights spent... something like (example just to get the idea):
def nights_spent(location_events):
# location_events is a list of events that have time and location
location_events = sort_by_time(location_events)
nights = []
prev_event = None
for event in location_events[1:]:
if prev_location is not None:
if next_day(prev_event.time, event.time) \
and same_location(prev_event.location, event.location):
# TODO: How do you handle when prev_event
# and event are more than 1 day apart?
nights.append(prev_location)
prev_location = location
return nights
Then, I think that a good first approach is to first group by user so that you get all events (with location and time) for a given user.
Then you can feed that list of events to the function above, and you'll have all the (user, nights_spent) rows in an RDD.
So, in general, the RDD would look something like:
nights_spent_per_user = all_events.map(lambda x => (x.user, [(x.time, x.location)])).reduce(lambda a, b: a + b).map(x => (x[0], nights_spent(x[1])))
Hope that helps to get you started.

Resources