Presto: how to specify time interval using the current date and timezone - presto

How to rewrite the following query:
WHERE (
parsedTime BETWEEN
TIMESTAMP '2019-10-29 00:00:00 America/New_York' AND
TIMESTAMP '2019-11-11 23:59:59 America/New_York'
)
but making the interval dynamic: from 14 days ago to current_date?

Presto provides quite handy functionality interval within date and time functions and operations.
-- Creating sample dataset
WITH dataset AS (
SELECT
'engineering' as department,
ARRAY[
TIMESTAMP '2019-11-05 00:00:00',
TIMESTAMP '2018-10-29 00:00:00'
] as parsedTime_array
)
SELECT department, parsedTime FROM dataset
CROSS JOIN UNNEST(parsedTime_array) as t(parsedTime)
-- Filtering records for the past 14 days from current_date
WHERE(
parsedTime > current_date - interval '14' day
)
Result
| department | parsedTime
---------------------------------------
1 | engineering | 2019-11-05 00:00:00.000
Update 2019-11-11
Note: current_date returns the current date as of the start of the query. I think, Athena would always use UTC time, but not 100% sure. So to extract current date in a particular time zone, I'd suggest to use timestamps with time zone conversion. Although it is true that
current_timestamp = current_timestamp at TIME ZONE 'America/New_York'
since AT TIME ZONE represents the same instant in time but differs only in the time zone used to print them. However the following is not always true due to 5 hour offset.
DATE(current_timestamp) = DATE(current_timestamp at TIME ZONE 'America/New_York')
This can be easily verified with:
WITH dataset AS (
SELECT
ARRAY[
TIMESTAMP '2019-10-29 23:59:59 UTC',
TIMESTAMP '2019-10-30 00:00:00 UTC',
TIMESTAMP '2019-10-30 04:59:59 UTC',
TIMESTAMP '2019-10-30 05:00:00 UTC'
] as parsedTime_array
)
SELECT
parsedTime AS "Time UTC",
DATE(parsedTime) AS "Date UTC",
DATE(parsedTime at TIME ZONE 'America/New_York') AS "Date NY",
to_unixtime(DATE(parsedTime)) AS "Unix UTC",
to_unixtime(DATE(parsedTime at TIME ZONE 'America/New_York')) AS "Unix NY"
FROM
dataset,
UNNEST(parsedTime_array) as t(parsedTime)
Result. Here we can see that 2 NY timestamps fall into 2019-10-29 and 2019-10-30 whereas for UTC timestamps it is only 1 and 3 respectively.
Time UTC | Date UTC | Date NY | Unix UTC | Unix NY
-----------------------------|------------|------------|------------|------------
2019-10-29 23:59:59.000 UTC | 2019-10-29 | 2019-10-29 | 1572307200 | 1572307200
2019-10-30 00:00:00.000 UTC | 2019-10-30 | 2019-10-29 | 1572393600 | 1572307200
2019-10-30 04:59:59.000 UTC | 2019-10-30 | 2019-10-30 | 1572393600 | 1572393600
2019-10-30 05:00:00.000 UTC | 2019-10-30 | 2019-10-30 | 1572393600 | 1572393600
Now, let's fast forward a month. There was a change to winter time in NY on 3rd or November 2019. However, timestamp in UTC format is not affected by it. Therefore:
WITH dataset AS (
SELECT
ARRAY[
TIMESTAMP '2019-11-29 23:59:59 UTC',
TIMESTAMP '2019-11-30 00:00:00 UTC',
TIMESTAMP '2019-11-30 04:59:59 UTC',
TIMESTAMP '2019-11-30 05:00:00 UTC'
] as parsedTime_array
)
SELECT
parsedTime AS "Time UTC",
DATE(parsedTime) AS "Date UTC",
DATE(parsedTime at TIME ZONE 'America/New_York') AS "Date NY",
to_unixtime(DATE(parsedTime)) AS "Unix UTC",
to_unixtime(DATE(parsedTime at TIME ZONE 'America/New_York')) AS "Unix NY"
FROM
dataset,
UNNEST(parsedTime_array) as t(parsedTime)
Result. Here we can see that 3 NY timestamps fall into 2019-11-29 and 1 falling into 2019-11-30, whereas for UTC timestamps ratio of 1/3 remained the same.
Time UTC | Date UTC | Date NY | Unix UTC | Unix NY
-----------------------------|------------|------------|------------|------------
2019-11-29 23:59:59.000 UTC | 2019-11-29 | 2019-11-29 | 1574985600 | 1574985600
2019-11-30 00:00:00.000 UTC | 2019-11-30 | 2019-11-29 | 1575072000 | 1574985600
2019-11-30 04:59:59.000 UTC | 2019-11-30 | 2019-11-29 | 1575072000 | 1574985600
2019-11-30 05:00:00.000 UTC | 2019-11-30 | 2019-11-30 | 1575072000 | 1575072000
Furthermore, different countries switch to winter/summer time on different dates. For instance in 2019, London (UK) moved clock 1 hour back on 27 October 2019, whereas NY (USA) moved clock 1 hour back on 3 November 2019.

Related

subtract second datetime row from first datetime row of a column if another column shows duplicate values

I have a dataframe with two columns Order date and Customer(which have duplicates of only 2 values which has been sorted), I want to subtract the second Order date of the second occurrence of a Customer from the first Order date. Order date is in datetime format
here is a sample of the table
context I'm trying to calculate the time it takes for a customer to make a second order\
Order date Customer
4260 2022-11-11 16:29:00 (App admin)
8096 2022-10-22 12:54:00 (App admin)
996 2021-09-22 20:30:00 10013
946 2021-09-14 15:16:00 10013
3499 2022-04-20 12:17:00 100151
... ... ...
2856 2022-03-21 13:49:00 99491
2788 2022-03-18 12:15:00 99523
2558 2022-03-08 12:07:00 99523
2580 2022-03-04 16:03:00 99762
2544 2022-03-02 15:40:00 99762
I have tried deleting by index but it returns just the first two values.
expected output should be another dataframe with just the Customer name and the difference between the Second and first Order dates of the duplicate customer in minutes
expected output:
| Customer | difference in minutes |
| -------- | -------- |
| 1232 | 445.0 |
|(App Admin)| 3432.0 |
| 1145 | 2455.0 |
|6653 | 32.0 |
You can use groupby:
df['Order date'] = pd.to_datetime(df['Order date'])
out = (df.groupby('Customer', as_index=False)['Order date']
.agg(lambda x: (x.iloc[0] - x.iloc[-1]).total_seconds() / 60)
.query('`Order date` != 0'))
print(out)
# Output:
Customer Order date
0 (App admin) 29015.0
1 10013 11834.0
4 99523 14408.0
5 99762 2903.0

Use PIVOT TABLE to show occupation percentage by month from date ranges. google sheet

my first time here.
I have a lease table where each row has the follow:
Lease number | Tenant name | unit number | checkin date | checkout date.
I need to have a table (prefer a PIVOT) that will show me the total occupation (by %) per month for each year. Keep in mind that the checkin-checkout dates can be from mid of one month till mid of another month (oct 3 2020 - feb 17 2021)
Expected result:
| 2019 | 2020 | 2021
jan | 0% | 58% | 67%
feb | 67% | 21% | 42%
mar ....
as I'm not a professional sim looking for a simple/easy way to resolve it.

Computing First Day of Previous Quarter in Spark SQL

How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below:
input_date | start_date
------------------------
2020-01-21 | 2019-10-01
2020-02-06 | 2019-10-01
2020-04-15 | 2020-01-01
2020-07-10 | 2020-04-01
2020-10-20 | 2020-07-01
2021-02-04 | 2020-10-01
The Quarters generally are:
1 | Jan - Mar
2 | Apr - Jun
3 | Jul - Sep
4 | Oct - Dec
Note:I am using Spark SQL v2.4.
Any help is appreciated. Thanks.
Use the date_trunc with the negation of 3 months.
df.withColumn("start_date", to_date(date_trunc("quarter", expr("input_date - interval 3 months"))))
.show()
+----------+----------+
|input_date|start_date|
+----------+----------+
|2020-01-21|2019-10-01|
|2020-02-06|2019-10-01|
|2020-04-15|2020-01-01|
|2020-07-10|2020-04-01|
|2020-10-20|2020-07-01|
|2021-02-04|2020-10-01|
+----------+----------+
Personally I would create a table with the dates in from now for the next twenty years using excel or something and just reference that table.

Generate a interval based time series using Spark SQL

I am new to Spark sql. I want to generate the following series of start time and end time which have an interval of 5 seconds for current date. So in lets say I am running my job on 1st Jan 2018 I want a series of start time and end time which have a difference of 5 seconds. So there will be 17280 records for 1 day
START TIME | END TIME
-----------------------------------------
01-01-2018 00:00:00 | 01-01-2018 00:00:04
01-01-2018 00:00:05 | 01-01-2018 00:00:09
01-01-2018 00:00:10 | 01-01-2018 00:00:14
.
.
01-01-2018 23:59:55 | 01-01-2018 23:59:59
01-02-2018 00:00:00 | 01-01-2018 00:00:05
I know I can generate this data-frame using a scala for loop. My constraint is that I can use only queries to do this.
Is there any way I can create this data structure using select * constructs?

Time varies in postgres server and excel

I am trying a query which groups the data by months.
test_db=# select date_trunc('month', install_ts) AS month, count(id) AS count from api_booking group by month order by month asc;
month | count
------------------------+-------
2016-08-01 00:00:00+00 | 297
2016-09-01 00:00:00+00 | 2409
2016-10-01 00:00:00+00 | 2429
2016-11-01 00:00:00+00 | 3512
(4 rows)
This is the output in my postgres db shell.
How ever, when I try this query in excel, this is the output,
month | count
------------------------+-------
2016-07-31 17:00:00+00 | 297
2016-08-31 17:00:00+00 | 2409
2016-09-30 17:00:00+00 | 2429
2016-10-31 17:00:00+00 | 3512
(4 rows)
The problem is I think excel is understanding date format in some different timezone.
So, How can I tell excel to read it correctly?
OR any solution to this problem?
Try...
select date(date_trunc('month', install_ts)) AS month, count(id) AS count from api_booking
The date() strips out the time from a date with a time.

Resources