Graphing of Timestamps from PostgreSQL and external program - excel

I am looking for a way to graph the time a Toy Store stock boy was checking into his job. There would be a defined start and end time for the overall graph but looking to span the amount of time he/she spent on the job.
The actual database would simply have the times the stock boy checked into work and the time he/she clocked out. Example:
timeshifts table
employerID | start_time | end_time
---------------------------------------------
1 | 2014-12-10 09:00:00 | 2014-12-10 09:37:00
1 | 2014-12-10 09:53:00 | 2014-12-10 11:44:00
1 | 2014-12-10 12:00:00 | 2014-12-10 15:00:00
after extracting the data and importing into (??), my IDEAL graph output would look something like
I know PSGL can't do this on its own but not sure if I need to structure the data in any special format to calculate the X distance (something like an age(end_time,start_time) or the like).
Thank you for your help ahead of time!

In PostgreSQL, you can subtract timestamps.
select employer_id, start_time, end_time, end_time - start_time as elapsed_time
from timeshifts;
employer_id start_time end_time elapsed_time
--
1 2014-12-10 09:00:00-05 2014-12-10 09:37:00-05 00:37:00
1 2014-12-10 09:53:00-05 2014-12-10 11:44:00-05 01:51:00
1 2014-12-10 12:00:00-05 2014-12-10 15:00:00-05 03:00:00
Whether Excel can recognize the values in "elapsed_time" is a different thing. It might be easier to do the subtraction in Excel.
create temp table timeshifts (
employer_id integer not null,
start_time timestamp with time zone not null,
end_time timestamp with time zone not null,
check (start_time < end_time),
primary key (employer_id, start_time)
);
insert into timeshifts values
(1, '2014-12-10 09:00:00', '2014-12-10 09:37:00'),
(1, '2014-12-10 09:53:00', '2014-12-10 11:44:00'),
(1, '2014-12-10 12:00:00', '2014-12-10 15:00:00');

Related

subtract second datetime row from first datetime row of a column if another column shows duplicate values

I have a dataframe with two columns Order date and Customer(which have duplicates of only 2 values which has been sorted), I want to subtract the second Order date of the second occurrence of a Customer from the first Order date. Order date is in datetime format
here is a sample of the table
context I'm trying to calculate the time it takes for a customer to make a second order\
Order date Customer
4260 2022-11-11 16:29:00 (App admin)
8096 2022-10-22 12:54:00 (App admin)
996 2021-09-22 20:30:00 10013
946 2021-09-14 15:16:00 10013
3499 2022-04-20 12:17:00 100151
... ... ...
2856 2022-03-21 13:49:00 99491
2788 2022-03-18 12:15:00 99523
2558 2022-03-08 12:07:00 99523
2580 2022-03-04 16:03:00 99762
2544 2022-03-02 15:40:00 99762
I have tried deleting by index but it returns just the first two values.
expected output should be another dataframe with just the Customer name and the difference between the Second and first Order dates of the duplicate customer in minutes
expected output:
| Customer | difference in minutes |
| -------- | -------- |
| 1232 | 445.0 |
|(App Admin)| 3432.0 |
| 1145 | 2455.0 |
|6653 | 32.0 |
You can use groupby:
df['Order date'] = pd.to_datetime(df['Order date'])
out = (df.groupby('Customer', as_index=False)['Order date']
.agg(lambda x: (x.iloc[0] - x.iloc[-1]).total_seconds() / 60)
.query('`Order date` != 0'))
print(out)
# Output:
Customer Order date
0 (App admin) 29015.0
1 10013 11834.0
4 99523 14408.0
5 99762 2903.0

PySpark: (broadcast) joining two datasets on closest datetimes/unix

I am using PySpark and are close to giving up on my problem.
I have two data sets: one very very very large one (set A) and one that is rather small (set B).
They are of the form:
Data set A:
variable | timestampA
---------------------------------
x | 2015-01-01 09:29:21
y | 2015-01-01 12:01:57
Data set B:
different information | timestampB
-------------------------------------------
info a | 2015-01-01 09:30:00
info b | 2015-01-01 09:30:00
info a | 2015-01-01 12:00:00
info b | 2015-01-01 12:00:00
A has many rows where each row has a different time stamp. B has a time stamp every couple of minutes. The main problem here is, that there are no exact time stamps that match in both data sets.
My goal is to join the data sets on the nearest time stamp. An additional problem arises since I want to join in a specific way.
For each entry in A, I want to map the entire information for the closest timestamp while duplicating the entry in A. So, the result should look like:
Final data set
variable | timestampA | information | timestampB
--------------------------------------------------------------------------
x | 2015-01-01 09:29:21 | info a | 2015-01-01 09:30:00
x | 2015-01-01 09:29:21 | info b | 2015-01-01 09:30:00
y | 2015-01-01 12:01:57 | info a | 2015-01-01 12:00:00
y | 2015-01-01 12:01:57 | info b | 2015-01-01 12:00:00
I am very new to PySpark (and also stackoverflow). I figured that I probably need to use a window function and/or a broadcast join, but I really have no point to start and would appreciate any help. Thank you!
You can you use broadcast to avoid shuffling.
If understand correctly you have timestamps in set_B which are consequent with some determined interval? If so you can do the following:
from pyspark.sql import functions as F
# assuming 5 minutes is your interval in set_B
interval = 'INTERVAL {} SECONDS'.format(5 * 60 / 2)
res = set_A.join(F.broadcast(set_B), (set_A['timestampA'] > (set_B['timestampB'] - F.expr(interval))) & (set_A['timestampA'] <= (set_B['timestampB'] + F.expr(interval))))
Output:
+--------+-------------------+------+-------------------+
|variable| timestampA| info| timestampB|
+--------+-------------------+------+-------------------+
| x|2015-01-01 09:29:21|info a|2015-01-01 09:30:00|
| x|2015-01-01 09:29:21|info b|2015-01-01 09:30:00|
| y|2015-01-01 12:01:57|info a|2015-01-01 12:00:00|
| y|2015-01-01 12:01:57|info b|2015-01-01 12:00:00|
+--------+-------------------+------+-------------------+
If you don't have determined interval then only cross join and then finding min(timestampA - timestampB) interval can do the trick. You can do that with window function and row_number function like following:
w = Window.partitionBy('variable', 'info').orderBy(F.abs(F.col('timestampA').cast('int') - F.col('timestampB').cast('int')))
res = res.withColumn('rn', F.row_number().over(w)).filter('rn = 1').drop('rn')

Presto: how to specify time interval using the current date and timezone

How to rewrite the following query:
WHERE (
parsedTime BETWEEN
TIMESTAMP '2019-10-29 00:00:00 America/New_York' AND
TIMESTAMP '2019-11-11 23:59:59 America/New_York'
)
but making the interval dynamic: from 14 days ago to current_date?
Presto provides quite handy functionality interval within date and time functions and operations.
-- Creating sample dataset
WITH dataset AS (
SELECT
'engineering' as department,
ARRAY[
TIMESTAMP '2019-11-05 00:00:00',
TIMESTAMP '2018-10-29 00:00:00'
] as parsedTime_array
)
SELECT department, parsedTime FROM dataset
CROSS JOIN UNNEST(parsedTime_array) as t(parsedTime)
-- Filtering records for the past 14 days from current_date
WHERE(
parsedTime > current_date - interval '14' day
)
Result
| department | parsedTime
---------------------------------------
1 | engineering | 2019-11-05 00:00:00.000
Update 2019-11-11
Note: current_date returns the current date as of the start of the query. I think, Athena would always use UTC time, but not 100% sure. So to extract current date in a particular time zone, I'd suggest to use timestamps with time zone conversion. Although it is true that
current_timestamp = current_timestamp at TIME ZONE 'America/New_York'
since AT TIME ZONE represents the same instant in time but differs only in the time zone used to print them. However the following is not always true due to 5 hour offset.
DATE(current_timestamp) = DATE(current_timestamp at TIME ZONE 'America/New_York')
This can be easily verified with:
WITH dataset AS (
SELECT
ARRAY[
TIMESTAMP '2019-10-29 23:59:59 UTC',
TIMESTAMP '2019-10-30 00:00:00 UTC',
TIMESTAMP '2019-10-30 04:59:59 UTC',
TIMESTAMP '2019-10-30 05:00:00 UTC'
] as parsedTime_array
)
SELECT
parsedTime AS "Time UTC",
DATE(parsedTime) AS "Date UTC",
DATE(parsedTime at TIME ZONE 'America/New_York') AS "Date NY",
to_unixtime(DATE(parsedTime)) AS "Unix UTC",
to_unixtime(DATE(parsedTime at TIME ZONE 'America/New_York')) AS "Unix NY"
FROM
dataset,
UNNEST(parsedTime_array) as t(parsedTime)
Result. Here we can see that 2 NY timestamps fall into 2019-10-29 and 2019-10-30 whereas for UTC timestamps it is only 1 and 3 respectively.
Time UTC | Date UTC | Date NY | Unix UTC | Unix NY
-----------------------------|------------|------------|------------|------------
2019-10-29 23:59:59.000 UTC | 2019-10-29 | 2019-10-29 | 1572307200 | 1572307200
2019-10-30 00:00:00.000 UTC | 2019-10-30 | 2019-10-29 | 1572393600 | 1572307200
2019-10-30 04:59:59.000 UTC | 2019-10-30 | 2019-10-30 | 1572393600 | 1572393600
2019-10-30 05:00:00.000 UTC | 2019-10-30 | 2019-10-30 | 1572393600 | 1572393600
Now, let's fast forward a month. There was a change to winter time in NY on 3rd or November 2019. However, timestamp in UTC format is not affected by it. Therefore:
WITH dataset AS (
SELECT
ARRAY[
TIMESTAMP '2019-11-29 23:59:59 UTC',
TIMESTAMP '2019-11-30 00:00:00 UTC',
TIMESTAMP '2019-11-30 04:59:59 UTC',
TIMESTAMP '2019-11-30 05:00:00 UTC'
] as parsedTime_array
)
SELECT
parsedTime AS "Time UTC",
DATE(parsedTime) AS "Date UTC",
DATE(parsedTime at TIME ZONE 'America/New_York') AS "Date NY",
to_unixtime(DATE(parsedTime)) AS "Unix UTC",
to_unixtime(DATE(parsedTime at TIME ZONE 'America/New_York')) AS "Unix NY"
FROM
dataset,
UNNEST(parsedTime_array) as t(parsedTime)
Result. Here we can see that 3 NY timestamps fall into 2019-11-29 and 1 falling into 2019-11-30, whereas for UTC timestamps ratio of 1/3 remained the same.
Time UTC | Date UTC | Date NY | Unix UTC | Unix NY
-----------------------------|------------|------------|------------|------------
2019-11-29 23:59:59.000 UTC | 2019-11-29 | 2019-11-29 | 1574985600 | 1574985600
2019-11-30 00:00:00.000 UTC | 2019-11-30 | 2019-11-29 | 1575072000 | 1574985600
2019-11-30 04:59:59.000 UTC | 2019-11-30 | 2019-11-29 | 1575072000 | 1574985600
2019-11-30 05:00:00.000 UTC | 2019-11-30 | 2019-11-30 | 1575072000 | 1575072000
Furthermore, different countries switch to winter/summer time on different dates. For instance in 2019, London (UK) moved clock 1 hour back on 27 October 2019, whereas NY (USA) moved clock 1 hour back on 3 November 2019.

Generate a interval based time series using Spark SQL

I am new to Spark sql. I want to generate the following series of start time and end time which have an interval of 5 seconds for current date. So in lets say I am running my job on 1st Jan 2018 I want a series of start time and end time which have a difference of 5 seconds. So there will be 17280 records for 1 day
START TIME | END TIME
-----------------------------------------
01-01-2018 00:00:00 | 01-01-2018 00:00:04
01-01-2018 00:00:05 | 01-01-2018 00:00:09
01-01-2018 00:00:10 | 01-01-2018 00:00:14
.
.
01-01-2018 23:59:55 | 01-01-2018 23:59:59
01-02-2018 00:00:00 | 01-01-2018 00:00:05
I know I can generate this data-frame using a scala for loop. My constraint is that I can use only queries to do this.
Is there any way I can create this data structure using select * constructs?

Time varies in postgres server and excel

I am trying a query which groups the data by months.
test_db=# select date_trunc('month', install_ts) AS month, count(id) AS count from api_booking group by month order by month asc;
month | count
------------------------+-------
2016-08-01 00:00:00+00 | 297
2016-09-01 00:00:00+00 | 2409
2016-10-01 00:00:00+00 | 2429
2016-11-01 00:00:00+00 | 3512
(4 rows)
This is the output in my postgres db shell.
How ever, when I try this query in excel, this is the output,
month | count
------------------------+-------
2016-07-31 17:00:00+00 | 297
2016-08-31 17:00:00+00 | 2409
2016-09-30 17:00:00+00 | 2429
2016-10-31 17:00:00+00 | 3512
(4 rows)
The problem is I think excel is understanding date format in some different timezone.
So, How can I tell excel to read it correctly?
OR any solution to this problem?
Try...
select date(date_trunc('month', install_ts)) AS month, count(id) AS count from api_booking
The date() strips out the time from a date with a time.

Resources