Merge in Spark SQL - WHEN NOT MATCHED BY SOURCE THEN - apache-spark

I am coding Python and Spark SQL in Databricks and I am using spark 2.4.5.
I have two tables.
Create table IF NOT EXISTS db_xsi_ed_faits_shahgholi_ardalan.Destination
(
id Int,
Name string,
Deleted int
) USING Delta;
Create table IF NOT EXISTS db_xsi_ed_faits_shahgholi_ardalan.Source
(
id Int,
Name string,
Deleted int
) USING Delta;
I need to ran a Merge command between my source and destination. I wrote below command
%sql
MERGE INTO db_xsi_ed_faits_shahgholi_ardalan.Destination AS D
USING db_xsi_ed_faits_shahgholi_ardalan.Source AS S
ON (S.id = D.id)
-- UPDATE
WHEN MATCHED AND S.Name <> D.Name THEN
UPDATE SET
D.Name = S.Name
-- INSERT
WHEN NOT MATCHED THEN
INSERT (id, Name, Deleted)
VALUES (S.id, S.Name, S.Deleted)
-- DELETE
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET
D.Deleted = 1
When i ran this command i have below error:
It seems that we do not have NOT MATCHED BY SOURCE in spark! I need a solution to do that.

I wrote this code but still i am looking for better approach
%sql
MERGE INTO db_xsi_ed_faits_shahgholi_ardalan.Destination AS D
USING db_xsi_ed_faits_shahgholi_ardalan.Source AS S
ON (S.id = D.id)
-- UPDATE
WHEN MATCHED AND S.Name <> D.Name THEN
UPDATE SET
D.Name = S.Name
-- INSERT
WHEN NOT MATCHED THEN
INSERT (id, Name, Deleted)
VALUES (S.id, S.Name, S.Deleted)
;
%sql
-- Logical delete
UPDATE db_xsi_ed_faits_shahgholi_ardalan.Destination
SET Deleted = 1
WHERE db_xsi_ed_faits_shahgholi_ardalan.Destination.id in
(
SELECT
D.id
FROM db_xsi_ed_faits_shahgholi_ardalan.Destination AS D
LEFT JOIN db_xsi_ed_faits_shahgholi_ardalan.Source AS S ON (S.id = D.id)
WHERE S.id is null
)

Related

HowTo insert into tableName with select and specifying insert columns at Jooq?

I'm using Jooq to generate SQL
Here is resulting query
insert into MY_TABLE -- I want INSERT INTO(firstField,secondField)
select
?,
?
where not exists (
select 1
from MY_TABLE
where (
firstField = ?
)
)
returning id
MY_TABLE DDL:
create table IF NOT EXISTS MY_TABLE
(
id SERIAL PRIMARY KEY,
firstField int not null,
secondField int not null
)
I can't make Jooq add field names next to insert into MY_TABLE
My builder:
JooqBuilder.default()
.insertInto(table("MY_TABLE"))
.select(
select(
param(classOf[Int]), // 1
param(classOf[Int]), // 2
)
.whereNotExists(select(inline(1))
.from(table("MY_TABLE"))
.where(
DSL.noCondition()
.and(field("firstField", classOf[Long]).eq(0L))
)
)
).returning(field("id")).getSQL
I've tried
.insertInto(table("MY_TABLE"),field("firstField"), field("secondField"))
UPD:
I was confused by compiler exception.
The right solution is
```scala
JooqBuilder.default()
.insertInto(table("MY_TABLE"),
field("firstField",classOf[Int]),
field("secondField",classOf[Int])
)
.select(
select(
param(classOf[Int]),
param(classOf[Int])
)
.whereNotExists(select(inline(1))
.from(table("MY_TABLE"))
.where(
DSL.noCondition()
.and(field("firstField", classOf[Long]).eq(0L))
)
)
).returning(field("id")).getSQL
The thing is that Jooq takes field types from insertInto and doesn't compile if select field types don't match.
I've tried
.insertInto(table("MY_TABLE"),
field("firstField"),
field("secondField")
)
and it didn't compile since no match with
.select(
select(
param(classOf[Int]), // 1
param(classOf[Int]) // 2
)
I've added types to insertInto fields and got match, two ints in insert, two ints in select.
Jooq generated expected query
insert into MY_TABLE -- I want INSERT INTO(firstField,secondField)
select
?,
?
where not exists (
select 1
from MY_TABLE
where (
firstField = ?
)
)
jOOQ just generates exactly the SQL you tell it to generate. You're not listing firstField,secondField in jOOQ, so jOOQ doesn't list them in SQL. To list them in jOOQ, just add:
// ...
.insertInto(table("MY_TABLE"), field("firstField", classOf[Long]), ...)
// ...
Obviously, even without using the code generator, you can reuse expressions by assigning them to local variables:
val t = table("MY_TABLE")
val f1 = field("firstField", classOf[Long])
val f2 = field("secondField", classOf[Long])
And then:
// ...
.insertInto(t, f1, f2)
// ...
Using the code generator
Note that if you were using the code generator, which jOOQ recommends, your query would be much simpler:
ctx.insertInto(MY_TABLE, MY_TABLE.FIRST_FIELD, MY_TABLE.SECOND_FIELD)
.values(v1, v2)
.onDuplicateKeyIgnore()
.returningResult(MY_TABLE.ID)
.fetch();

MssqlRow to json string without knowing structure and data type on compile time [duplicate]

Using PostgreSQL I can have multiple rows of json objects.
select (select ROW_TO_JSON(_) from (select c.name, c.age) as _) as jsonresult from employee as c
This gives me this result:
{"age":65,"name":"NAME"}
{"age":21,"name":"SURNAME"}
But in SqlServer when I use the FOR JSON AUTO clause it gives me an array of json objects instead of multiple rows.
select c.name, c.age from customer c FOR JSON AUTO
[{"age":65,"name":"NAME"},{"age":21,"name":"SURNAME"}]
How to get the same result format in SqlServer ?
By constructing separate JSON in each individual row:
SELECT (SELECT [age], [name] FOR JSON PATH, WITHOUT_ARRAY_WRAPPER)
FROM customer
There is an alternative form that doesn't require you to know the table structure (but likely has worse performance because it may generate a large intermediate JSON):
SELECT [value] FROM OPENJSON(
(SELECT * FROM customer FOR JSON PATH)
)
no structure better performance
SELECT c.id, jdata.*
FROM customer c
cross apply
(SELECT * FROM customer jc where jc.id = c.id FOR JSON PATH , WITHOUT_ARRAY_WRAPPER) jdata (jdata)
Same as Barak Yellin but more lazy:
1-Create this proc
CREATE PROC PRC_SELECT_JSON(#TBL VARCHAR(100), #COLS VARCHAR(1000)='D.*') AS BEGIN
EXEC('
SELECT X.O FROM ' + #TBL + ' D
CROSS APPLY (
SELECT ' + #COLS + '
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER
) X (O)
')
END
2-Can use either all columns or specific columns:
CREATE TABLE #TEST ( X INT, Y VARCHAR(10), Z DATE )
INSERT #TEST VALUES (123, 'TEST1', GETDATE())
INSERT #TEST VALUES (124, 'TEST2', GETDATE())
EXEC PRC_SELECT_JSON #TEST
EXEC PRC_SELECT_JSON #TEST, 'X, Y'
If you're using PHP add SET NOCOUNT ON; in the first row (why?).

Cannot update existing row on conflict in PostgreSQL with Psycopg2

I have the following function defined to insert several rows with iteration in Python using Psycopg2 and PostgreSQL 11.
When I receive the same obj (with same id), I want to update its date.
def insert_execute_values_iterator(
connection,
objs: Iterator[Dict[str, Any]],
page_size: int = 1000,
) -> None:
with connection.cursor() as cursor:
try:
psycopg2.extras.execute_values(cursor, """
INSERT INTO objs(\
id,\
date,\
) VALUES %s \
ON CONFLICT (id) \
DO UPDATE SET (date) = (EXCLUDED.date) \
""", ((
obj['id'],
obj['date'],
) for obj in objs), page_size=page_size)
except (Exception, Error) as error:
print("Error while inserting as in database", error)
When a conflict happens on the unique primary key of the table while inserting an element, I get the error:
Error while inserting as in database ON CONFLICT DO UPDATE command
cannot affect row a second time
HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values.
FYI, the clause works on PostgreSQL directly but not from the Python code.
Use unique VALUE-combinations in your INSERT statement:
create table foo(id int primary key, date date);
This should work:
INSERT INTO foo(id, date)
VALUES(1,'2021-02-17')
ON CONFLICT(id)
DO UPDATE SET date = excluded.date;
This one won't:
INSERT INTO foo(id, date)
VALUES(1,'2021-02-17') , (1, '2021-02-16') -- 2 conflicting rows
ON CONFLICT(id)
DO UPDATE SET date = excluded.date;
DEMO
You can fix this by using DISTINCT ON() in a SELECT statement:
INSERT INTO foo(id, date)
SELECT DISTINCT ON(id) id, date
FROM (VALUES(1,CAST('2021-02-17' AS date)) , (1, '2021-02-16')) s(id, date)
ORDER BY id, date ASC
ON CONFLICT(id)
DO UPDATE SET date = excluded.date;

update and insert into Azure data warehouse using Azure data factory pipelines

I'm trying to run an adf copy pipeline with and update and insert statements that is supposed to replace merge statement. basically a statement like:
UPDATE TARGET
SET ProductName = SOURCE.ProductName,
TARGET.Rate = SOURCE.Rate
FROM Products AS TARGET
INNER JOIN UpdatedProducts AS SOURCE
ON TARGET.ProductID = SOURCE.ProductID
WHERE TARGET.ProductName <> SOURCE.ProductName
OR TARGET.Rate <> SOURCE.Rate
INSERT Products (ProductID, ProductName, Rate)
SELECT SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate
FROM UpdatedProducts AS SOURCE
WHERE NOT EXISTS
(
SELECT 1
FROM Products
WHERE ProductID = SOURCE.ProductID
)
If the target is an azure sql db I would use this way: https://www.taygan.co/blog/2018/04/20/upsert-to-azure-sql-db-with-azure-data-factory
but if the target is an adw a stored procedure option doesn't exist! any suggestion? do I have to have a staging table first then I run the update and insert statements from stg_table to target_table? or maybe there is any possibility to do it directly from adf?
If you can't use a stored procedure, my suggestion would be to create a second copy data transform. Run the pre-script on the second transform and drop the table since its a temp table that you created on the first.
BEGIN
MERGE Target AS target_sqldb
USING TempTable AS source_tblstg
ON (target_sqldb.Id= source_tblstg.Id)
WHEN MATCHED THEN
UPDATE SET
[Name] = source_tblstg.Name,
[State] = source_tblstg.State
WHEN NOT MATCHED THEN
INSERT([Name], [State])
VALUES (source_tblstg.Name, source_tblstg.State);
DROP TABLE TempTable;
END

How to optimize DELETE .. NOT IN .. SUBQUERY in Firebird

I've this kind of delete query:
DELETE
FROM SLAVE_TABLE
WHERE ITEM_ID NOT IN (SELECT ITEM_ID FROM MASTER_TABLE)
Are there any way to optimize this?
You can use EXECUTE BLOCK for sequential scanning of detail table and deleting records where no master record is matched.
EXECUTE BLOCK
AS
DECLARE VARIABLE C CURSOR FOR
(SELECT d.id
FROM detail d LEFT JOIN master m
ON d.master_id = m.id
WHERE m.id IS NULL);
DECLARE VARIABLE I INTEGER;
BEGIN
OPEN C;
WHILE (1 = 1) DO
BEGIN
FETCH C INTO :I;
IF(ROW_COUNT = 0)THEN
LEAVE;
DELETE FROM detail WHERE id = :I;
END
CLOSE C;
END
(NOT) IN can usually be optimized by using (NOT) EXISTS instead.
DELETE
FROM SLAVE_TABLE
WHERE NOT EXISTS (SELECT 1 FROM MASTER_TABLE M WHERE M.ITEM_ID = ITEM_ID)
I am not sure what you are trying to do here, but to me this query indicates that you should be using foreign keys to enforce these kind of constraints, not run queries to cleanup the mess afterwards.

Resources