How to create temporary view in Spark SQL using a CTE? - apache-spark

I'm attempting to create a temp view in Spark SQL using a with the statement:
create temporary view cars as (
with models as (
select 'abc' as model
)
select model from models
)
But this error is thrown:
error in SQL statement: ParseException:
mismatched input 'with' expecting {'(', 'SELECT', 'FROM', 'DESC', 'VALUES', 'TABLE', 'INSERT', 'DESCRIBE', 'MAP', 'MERGE', 'UPDATE', 'REDUCE'}(line 2, pos 8)
== SQL ==
create temporary view cars as (
with models as (
--------^^^
select 'abc' as model
)
select model from models
)

Removing brackets after first as makes it work:
create temporary view cars as
with models as (
select 'abc' as model
)
select model from models

Related

HowTo insert into tableName with select and specifying insert columns at Jooq?

I'm using Jooq to generate SQL
Here is resulting query
insert into MY_TABLE -- I want INSERT INTO(firstField,secondField)
select
?,
?
where not exists (
select 1
from MY_TABLE
where (
firstField = ?
)
)
returning id
MY_TABLE DDL:
create table IF NOT EXISTS MY_TABLE
(
id SERIAL PRIMARY KEY,
firstField int not null,
secondField int not null
)
I can't make Jooq add field names next to insert into MY_TABLE
My builder:
JooqBuilder.default()
.insertInto(table("MY_TABLE"))
.select(
select(
param(classOf[Int]), // 1
param(classOf[Int]), // 2
)
.whereNotExists(select(inline(1))
.from(table("MY_TABLE"))
.where(
DSL.noCondition()
.and(field("firstField", classOf[Long]).eq(0L))
)
)
).returning(field("id")).getSQL
I've tried
.insertInto(table("MY_TABLE"),field("firstField"), field("secondField"))
UPD:
I was confused by compiler exception.
The right solution is
```scala
JooqBuilder.default()
.insertInto(table("MY_TABLE"),
field("firstField",classOf[Int]),
field("secondField",classOf[Int])
)
.select(
select(
param(classOf[Int]),
param(classOf[Int])
)
.whereNotExists(select(inline(1))
.from(table("MY_TABLE"))
.where(
DSL.noCondition()
.and(field("firstField", classOf[Long]).eq(0L))
)
)
).returning(field("id")).getSQL
The thing is that Jooq takes field types from insertInto and doesn't compile if select field types don't match.
I've tried
.insertInto(table("MY_TABLE"),
field("firstField"),
field("secondField")
)
and it didn't compile since no match with
.select(
select(
param(classOf[Int]), // 1
param(classOf[Int]) // 2
)
I've added types to insertInto fields and got match, two ints in insert, two ints in select.
Jooq generated expected query
insert into MY_TABLE -- I want INSERT INTO(firstField,secondField)
select
?,
?
where not exists (
select 1
from MY_TABLE
where (
firstField = ?
)
)
jOOQ just generates exactly the SQL you tell it to generate. You're not listing firstField,secondField in jOOQ, so jOOQ doesn't list them in SQL. To list them in jOOQ, just add:
// ...
.insertInto(table("MY_TABLE"), field("firstField", classOf[Long]), ...)
// ...
Obviously, even without using the code generator, you can reuse expressions by assigning them to local variables:
val t = table("MY_TABLE")
val f1 = field("firstField", classOf[Long])
val f2 = field("secondField", classOf[Long])
And then:
// ...
.insertInto(t, f1, f2)
// ...
Using the code generator
Note that if you were using the code generator, which jOOQ recommends, your query would be much simpler:
ctx.insertInto(MY_TABLE, MY_TABLE.FIRST_FIELD, MY_TABLE.SECOND_FIELD)
.values(v1, v2)
.onDuplicateKeyIgnore()
.returningResult(MY_TABLE.ID)
.fetch();

Using AVG in Spark with window function

I have the following SQL Query:
Select st.Value,
st.Id,
ntile(2) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls),
AVG(St.Value) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls, St.Date)
FROM table tb
INNER JOIN staging st on St.Id = tb.Id
I've tried to adapt this to Spark/PySpark using window function, my code is below:
windowSpec_1 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls")
windowSpec_2 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls", "staging.Date")
df= table.join(
staging,
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)
Although I'm getting the following error:
pyspark.sql.utils.AnalysisException: Can't extract value from Value#42928: need struct type but got decimal(16,6)
How Can I solve this problem? Is it necessary to group data?
Maybe you forgot to assign alias to staging?:
df= table.join(
staging.alias("staging"),
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)

Synapse Spark SQL Delta Merge Mismatched Input Error

I am trying to update the historical table, but am getting a merge error. When I run this cell:
%%sql
select * from main
UNION
select * from historical
where Summary_Employee_ID=25148
I get a two row table that looks like:
EmployeeID Name
25148 Wendy Clampett
25148 Wendy Monkey
I'm trying to update the Name... using the following merge command
%%sql
MERGE INTO main m
using historical h
on m.Employee_ID=h.Employee_ID
WHEN MATCHED THEN
UPDATE SET
m.Employee_ID=h.Employee_ID,
m.Name=h.Name
WHEN NOT MATCHED THEN
INSERT(Employee,Name)
VALUES(h.Employee,h.Name)
Here's my error:
Error:
mismatched input 'MERGE' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)
Synapse doesn't support the sql merge, like databricks. However, you can use the python solution. Note historical was really my updates...
So for the above, I used:
import delta
main = delta.DeltaTable.forPath(spark,"path")
(main
.alias("main")
.merge(historical.alias("historical"),
.whenMatchedUpdate(set = {main.Employee_ID=historical.Employee_ID})
.whenNotMathcedInsert(values =
{"employeeID":"historical.employeeID","name"="historical.name})
.execute()
)
Your goal is to upsert the target table historical, but as per your query the target table is set to main instead of historical and also the update statement set to main and insert statement set to historical
Try the following,
%%sql
MERGE INTO historical target
using main source
on source.Employee_ID=target.Employee_ID
WHEN MATCHED THEN
UPDATE SET
target.Name=source.Name
WHEN NOT MATCHED THEN
INSERT(Employee,Name)
VALUES(source.Employee,source.Name)
It's supported in Spark 3.0 that's currently in preview, so this might be worth a try. I did see the same error on the Spark 3.0 pool, but it's quite misleading as it actually means that you're trying to merge on duplicate data or that you're offering duplicate data to the original set. I've validated this by querying the delta lake and the raw file for duplicates with the serverless SQL Pool and Polybase.

Use Common Table Expression with Pony ORM

I have the following query that contains a common table expression:
WITH example AS (
SELECT unnest(ARRAY['foo', 'bar', 'baz']) as col
)
SELECT *
FROM example
Trying to use it in database.select(query) throws pony.orm.dbapiprovider.ProgrammingError: syntax error at or near "WITH", and database.select(raw_sql(query)) throws TypeError: expected string or bytes-like object.
How can I select data using a CTE with ponyorm?
To use a query containing a CTE, call the execute function on the database and fetch the rows with the returned cursor:
cursor = database.execute("""
WITH example AS (
SELECT unnest(ARRAY['foo', 'bar', 'baz']) as col
)
SELECT *
FROM example
""")
rows = cursor.fetchall()
Note: The cursor is a class from psycopg2, so while this solution does use the pony library the solution may differ depending on the database being used.

How to Left Join in Presto SQL?

Can't for the life of me figure out a simple left join in Presto, even after reading the documentation. I'm very familiar with Postgres and tested my query there to make sure there wasn't a glaring error on my part. Please reference code below:
select * from
(select cast(order_date as date),
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1)
left join
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1) as b
on a.order_date = b.summary_date
Running that gives the following error
SQL Error: Failed to run query
Failed to run query
line 1:1: mismatched input 'on' expecting {'(', 'SELECT', 'DESC', 'WITH',
'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'GRANT',
'REVOKE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'CALL', 'PREPARE', 'DEALLOCATE', 'EXECUTE'} (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a33a6671-07a2-4d7b-bb75-f70f7b82409e)
line 1:1: mismatched input 'on' expecting {'(', 'SELECT', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'GRANT', 'REVOKE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'CALL', 'PREPARE', 'DEALLOCATE', 'EXECUTE'} (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: a33a6671-07a2-4d7b-bb75-f70f7b82409e)
The first problem I notice is that your join clause assumes the first sub-query is aliased as a, but it is not aliased at all. I recommend aliasing that table to see if that fixes it (I also recommend aliasing the order_date column explicitly outside of the cast() statement since you are joining on that column).
Try this:
select * from
(select cast(order_date as date) as order_date,
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1) as a
left join
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1) as b
on a.order_date = b.summary_date
One option is to declare your subqueries by using with:
with a as
(select cast(order_date as date),
count(distinct(source_order_id)) as prim_orders,
sum(quantity) as prim_tickets,
sum(sale_amount) as prim_revenue
from table_a
where order_date >= date '2018-01-01'
group by 1),
b as
(select summary_date,
sum(impressions) as sem_impressions,
sum(clicks) as sem_clicks,
sum(spend) as sem_spend,
sum(total_orders) as sem_orders,
sum(total_tickets) as sem_tickets,
sum(total_revenue) as sem_revenue
from table_b
where site like '%SEM%'
and summary_date >= date '2018-01-01'
group by 1)
select * from a
left join b
on a.order_date = b.summary_date;

Resources