How to Left Join in Presto SQL? - presto

Can't for the life of me figure out a simple left join in Presto, even after reading the documentation. I'm very familiar with Postgres and tested my query there to make sure there wasn't a glaring error on my part. Please reference code below:
select * from
(select cast(order_date as date),
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1)
left join
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1) as b
on a.order_date = b.summary_date
Running that gives the following error
SQL Error: Failed to run query
Failed to run query
line 1:1: mismatched input 'on' expecting {'(', 'SELECT', 'DESC', 'WITH',
'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'GRANT',
'REVOKE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'CALL', 'PREPARE', 'DEALLOCATE', 'EXECUTE'} (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a33a6671-07a2-4d7b-bb75-f70f7b82409e)
line 1:1: mismatched input 'on' expecting {'(', 'SELECT', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'GRANT', 'REVOKE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'CALL', 'PREPARE', 'DEALLOCATE', 'EXECUTE'} (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a33a6671-07a2-4d7b-bb75-f70f7b82409e)

The first problem I notice is that your join clause assumes the first sub-query is aliased as a, but it is not aliased at all. I recommend aliasing that table to see if that fixes it (I also recommend aliasing the order_date column explicitly outside of the cast() statement since you are joining on that column).
Try this:
select * from
(select cast(order_date as date) as order_date,
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1) as a
left join
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1) as b
on a.order_date = b.summary_date

One option is to declare your subqueries by using with:
with a as
(select cast(order_date as date),
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1),
b as
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1)
select * from a
left join b
on a.order_date = b.summary_date;

Related

Synapse Spark SQL Delta Merge Mismatched Input Error

I am trying to update the historical table, but am getting a merge error. When I run this cell:
%%sql
select * from main
UNION
select * from historical
where Summary_Employee_ID=25148
I get a two row table that looks like:
EmployeeID Name
25148 Wendy Clampett
25148 Wendy Monkey
I'm trying to update the Name... using the following merge command
%%sql
MERGE INTO main m
using historical h
on m.Employee_ID=h.Employee_ID
WHEN MATCHED THEN
UPDATE SET
m.Employee_ID=h.Employee_ID,
m.Name=h.Name
WHEN NOT MATCHED THEN
INSERT(Employee,Name)
VALUES(h.Employee,h.Name)
Here's my error:
Error:
mismatched input 'MERGE' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)
Synapse doesn't support the sql merge, like databricks. However, you can use the python solution. Note historical was really my updates...
So for the above, I used:
import delta
main = delta.DeltaTable.forPath(spark,"path")
(main
.alias("main")
.merge(historical.alias("historical"),
.whenMatchedUpdate(set = {main.Employee_ID=historical.Employee_ID})
.whenNotMathcedInsert(values =
{"employeeID":"historical.employeeID","name"="historical.name})
.execute()
)
Your goal is to upsert the target table historical, but as per your query the target table is set to main instead of historical and also the update statement set to main and insert statement set to historical
Try the following,
%%sql
MERGE INTO historical target
using main source
on source.Employee_ID=target.Employee_ID
WHEN MATCHED THEN
UPDATE SET
target.Name=source.Name
WHEN NOT MATCHED THEN
INSERT(Employee,Name)
VALUES(source.Employee,source.Name)
It's supported in Spark 3.0 that's currently in preview, so this might be worth a try. I did see the same error on the Spark 3.0 pool, but it's quite misleading as it actually means that you're trying to merge on duplicate data or that you're offering duplicate data to the original set. I've validated this by querying the delta lake and the raw file for duplicates with the serverless SQL Pool and Polybase.

PySpark window with condition

I have a dataset with application logs that show when a certain app was launched or closed. Sometimes, the related events may be missing entirely from the logs. I want to match each app start with the related end event (if it exists).
Here's an illustrative dataset:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([['Group1', 'Logon', 'Name1', '2021-02-05T19:03:00.000+0000'],
['Group1', 'Start', 'Name1', '2021-02-05T19:04:00.000+0000'],
['Group1', 'Start', 'Name1', '2021-02-05T19:05:00.000+0000'],
['Group1', 'End', 'Name1', '2021-02-05T19:06:00.000+0000'],
['Group1', 'End', 'Name3', '2021-02-05T19:06:01.000+0000'],
['Group1', 'End', 'Name1', '2021-02-05T19:07:00.000+0000'],
['Group2', 'Start', 'Name1', '2021-02-05T19:04:00.000+0000'],
['Group2', 'Start', 'Name1', '2021-02-05T19:05:00.000+0000'],
['Group2', 'Start', 'Name2', '2021-02-05T19:06:00.000+0000'],
['Group2', 'End', 'Name1', '2021-02-05T19:07:00.000+0000'],
['Group2', 'Close', 'Name1', '2021-02-05T19:07:00.000+0000'],
], ['group', 'type', 'name', 'time'])
df = df.withColumn('time', F.col('time').cast('timestamp'))
For each group separately, I want to put a common identifier to each 'Start' and 'End' event if they have the same 'name'. In other words, for each 'Start' event I want to find the first 'End' event that has not already been matched to another 'Start' event.
The expected result could be something like the following picture:
I don't mind if the identifier (i.e. 'my_group') is an ID, a timestamp or if it is monotonically increasing across groups. I just want to be able to match the relevant events within each group.
What I've tried
I thought about using window functions in order to identify the end time of 'Start' events and the start time of 'End' events. However, I cannot restrict to searching only for 'End' events (and 'Start' events respectively). Also, I cannot apply the logic described above of finding the first 'End' event that has not already been matched to another 'Start' event.
Here's my code:
app_session_window_down = Window.partitionBy('group', "name").orderBy(F.col("time").cast('long')).rangeBetween(1, Window.unboundedFollowing) #search in the future
app_session_window_up = Window.partitionBy('group', "name").orderBy(F.col("time").cast('long')).rangeBetween(Window.unboundedPreceding, -1) #search in the past
df = df.withColumn("app_time_end", F.when((F.col("type") == 'Start'), F.first(F.col('time'), ignorenulls=True).over(app_session_window_down)).otherwise(F.lit('None')))\
.withColumn("app_time_start", F.when((F.col("type") == 'End'), F.last(F.col('time'), ignorenulls=True).over(app_session_window_up)).otherwise(F.col('app_time_end')))
which gives:
This is nowhere close to what I want to achieve. Any hints?
Explanations are in the inline comments:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'my_group', # the column you wanted
F.when(
F.col('type').isin(['Start', 'End']),
F.row_number().over(Window.partitionBy('group', 'name', 'type').orderBy('time'))
)
).withColumn(
'max_group', # helper column: get maximum row_number for each group ; will be used later
F.least(
F.max(
F.when(
F.col('type') == 'Start', F.col('my_group')
).otherwise(0)
).over(Window.partitionBy('group', 'name')),
F.max(
F.when(
F.col('type') == 'End', F.col('my_group')
).otherwise(0)
).over(Window.partitionBy('group', 'name'))
)
).withColumn(
'my_group', # mask the rows which don't have corresponding 'start'/'end'
F.when(
F.col('my_group') <= F.col('max_group'),
F.col('my_group')
)
).withColumn(
'my_group', # add the group name
F.when(F.col('my_group').isNotNull(), F.concat_ws('_', 'group', 'name', 'my_group'))
).drop('max_group').orderBy('group', 'time')
df2.show()
+------+-----+-----+-------------------+--------------+
| group| type| name| time| my_group|
+------+-----+-----+-------------------+--------------+
|Group1|Logon|Name1|2021-02-05 19:03:00| null|
|Group1|Start|Name1|2021-02-05 19:04:00|Group1_Name1_1|
|Group1|Start|Name1|2021-02-05 19:05:00|Group1_Name1_2|
|Group1| End|Name1|2021-02-05 19:06:00|Group1_Name1_1|
|Group1| End|Name3|2021-02-05 19:06:01| null|
|Group1| End|Name1|2021-02-05 19:07:00|Group1_Name1_2|
|Group2|Start|Name1|2021-02-05 19:04:00|Group2_Name1_1|
|Group2|Start|Name1|2021-02-05 19:05:00| null|
|Group2|Start|Name2|2021-02-05 19:06:00| null|
|Group2| End|Name1|2021-02-05 19:07:00|Group2_Name1_1|
|Group2|Close|Name1|2021-02-05 19:07:00| null|
+------+-----+-----+-------------------+--------------+

How to create temporary view in Spark SQL using a CTE?

I'm attempting to create a temp view in Spark SQL using a with the statement:
create temporary view cars as (
with models as (
select 'abc' as model
)
select model from models
)
But this error is thrown:
error in SQL statement: ParseException:
mismatched input 'with' expecting {'(', 'SELECT', 'FROM', 'DESC', 'VALUES', 'TABLE', 'INSERT', 'DESCRIBE', 'MAP', 'MERGE', 'UPDATE', 'REDUCE'}(line 2, pos 8)
== SQL ==
create temporary view cars as (
with models as (
--------^^^
select 'abc' as model
)
select model from models
)
Removing brackets after first as makes it work:
create temporary view cars as
with models as (
select 'abc' as model
)
select model from models

Presto combining two columns and output as one

i'm trying to combine two columns together in presto.
this is part of a query, and it has to be formatted in a certain way.
SELECT 'Display' AS channel,
DBM.dated,
DBM.market,
DBM.impressions,
DBM.clicks,
sum(DBM.amount_spent_EUR)+sum(DBm.platform_fee) as DBM.amount_spent_EUR
FROM
(
SELECT
DATE_FORMAT(DATE_PARSE(date,'%Y/%m/%d'),'%Y-%m-%d') AS dated,
trim(SPLIT_PART(insertion_order,'|',3)) AS market,
sum(cast(impressions as double)) as impressions,
sum(cast(clicks as double)) as clicks,
sum(CAST(media_cost_advertiser_currency AS DOUBLE)*1.15) AS amount_spent_EUR,
sum(CAST(media_fee_1_adv_currency AS DOUBLE)*1.15) as platform_fee
FROM ralph_lauren_google_sheet_dbm_data_2
WHERE dated <= {{days_ago 1}}
GROUP BY 1,2
)DBM
the error is as following:
Query 20190814_125505_19433_rcrut failed: line 1:144: extraneous input
'.' expecting {, ',', 'EXCEPT', 'FROM', 'GROUP', 'HAVING',
'INTERSECT', 'LIMIT', 'ORDER', 'UNION', 'WHERE'}
the error is the dbm.amount_spent_eur. this column has to come out like this.
How can I get around this?
You can use double quotes in such cases.
as "DBM.amount_spent_EUR"

How to convert a weird date time string with timezone into a timestamp (PySpark)

I have a column called datetime which is a string of form
Month Name DD YYYY H:MM:SS,nnn AM/PM TZ
where nnn is the nanosecond precision, AM/PM is self explanatory and TZ is the timezone for example MDT
For example:
Mar 18 2019 9:48:08,576 AM MDT
Mar 18 2019 9:48:08,623 AM MDT
Mar 18 2019 9:48:09,273 AM MDT
The nanosecond precision is importance since the logs are so close in time. TZ is optional as they're all in the same timezone but ideally would like to capture this too.
Is PySpark able to handle this? I've tried using unix_timestamp with no luck.
Edit
Tried
%sql
formatw = 'MMM dd yyyy H:mm:ss,SSS a z'
select to_date(string)
from table
Get error:
Error in SQL statement: ParseException:
mismatched input 'format' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'MERGE', 'UPDATE', 'CONVERT', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD', 'OPTIMIZE'}(line 1, pos 0)
I would recommend you to take a look to pyspark.sql.functions.to_date(col, format=None) function.
From the documentation:
Converts a Column of pyspark.sql.types.StringType or pyspark.sql.types.TimestampType into pyspark.sql.types.DateType using the optionally specified format. Specify formats according to SimpleDateFormats. By default, it follows casting rules to pyspark.sql.types.DateType if the format is omitted (equivalent to col.cast("date")).
So, you can use all the Date patterns specified in Java - SimpleDateFormat.
If you want to use the Python formats, then I would recommend defining your own UDF using datetime. But, using the Spark one has better performance and it's already defined.
Besides, is it nanoseconds or milliseconds (H:mm:ss,SSS)?

Resources