How to delete duplicate rows in Azure Synapse

How to delete duplicate rows in Azure Synapse - azure

How can I delete duplicate rows from Azure Synapse Analytics? I'd like to delete one of the rows where audit_date = '2022-08-10' and city = 'LA'. I'd like to keep only 1 row. I've tried using the CTE method( Row_number()... ). Unfortunately, SQL Pool doesn't support Delete statements with CTE.
audit_date
city
number_of_toys
number_of_balloons
number_of_drinks
2022-08-10
LA
35
100
40
2022-08-10
NY
20
70
30
2022-08-10
LA
35
102
40

You can do this using DELETE and ROW_NUMBER(). I have created a similar table with the sample data that you have given.
Now use the ROW_NUMBER() function to partition by audit_date and city based on your condition.
SELECT *, ROW_NUMBER() OVER (PARTITION BY audit_date,city ORDER BY audit_date,city) AS row_num FROM demo where audit_date='2022-08-10' and city='LA'
⦁ You can use the following query to complete the delete operation only on the rows where row_num > 1.
DELETE my_table FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY audit_date,city ORDER BY audit_date,city)
AS row_num FROM demo where audit_date='2022-08-10' and city='LA'
) my_table
where row_num>1
This way you can delete duplicate records by retaining one row using DELETE and ROW_NUMBER() as demonstrated above.

Related

looping string list and get no record count from table

I have string values get from a table using listagg(column,',')
so I want to loop this string list and set into where clause for another table
then I want to get a count when no any records in the table (Number of times with no record)
I'm writing this inside the plsql procedure
order_id
name
10
test1
20
test2
22
test3
25
test4
col_id
product
order_id
1
pro1
10
2
pro2
30
3
pro2
38
expected result : count(Number of times with no record) in 2nd table
count = 3
because there is no any record of 20,22,25 order ids in 2nd table
only have record for order_id - 10
my queries
SELECT listagg(ord.order_id,',')
into wk_orderids
from orders ord,
where ord.id_no = wk_id_no;
loop
-- do my stuff
end loop
wk_orderids values = ('10','20','22','25')
I want to loop this one(wk_orderids) and set it one by one into a select query where clause
then want to get the count Number of times with no record

If you want to count ORDER_IDs in the 2nd table that don't exist in ORDER_ID column of the 1st table, then your current approach looks as if you were given a task to do that in the most complicated way. Aggregating values, looping through them, adding values into a where clause (which then requires dynamic SQL) ... OK, but - why? Why not simply
select count(*)
from (select order_id from first_table
minus
select order_id from second_table
);

How to optimize a delete on table which doesn't have any primary key but has a column which has TimeStamp?

My process is doing a insert into to a backup table 'B from a table 'A' which gets updated daily[truncate and load] in the azure sql db.
A column 'TSP' [eg value =2022-12-19T22:06:01.950994] is present in both tables. TSP for all rows inserted in a day is same.
Later in the day, I'm supposed to delete older data.
Currently using 'delete from 'B' where TSP<'today-1day' logic
Is there a way to optimize this delete using index or something?
SSMS suggested to create non clustered index on the table.TSP column.
I tested it but seems there isn't much difference.
If this was the data:
50mil TSP1
50mil TSP2
50mil TSP3
My expectation was it would skip scanning TSP2,TSP3 rows and delete TSP1.
Whereas if table doesn't have index it would need to scan all 150mil rows.

The batched delete operation utilizes a view to simplify the execution plan, and that can be achieved using Fast Ordered Delete Operation. This is achieved by refreshing the table once which in turn reduces the amount of I/O required.
Below are the sample queries: -
CREATE TABLE tableA
(
id int,
TSP Datetime DEFAULT GETDATE(),
[Log] NVARCHAR(250)
)
WHILE #I <=1000 BEGIN INSERT INTO tableA VALUES(#I, GETDATE()-1, concat('Log message ', #I) ) SET #I=#I+1 END
Option 1:- using CTE
;WITH DeleteData
AS
(SELECT id, TSP, Log FROM tableA
WHERE CAST(tsp AS DATE) = CAST(GETDATE() AS DATE))
DELETE FROM DeleteData
Option 2:- using SQL views
CREATE VIEW VW_tableA AS (SELECT * FROM tableA WHERE CAST(tsp AS DATE) = CAST(GETDATE()-1 AS DATE))
delete from VW_tableA
Reference 1: An article by John Sansom on fast-sql-server-delete.
Reference 2: Similar SO thread.

Using Presto HLL to count rolling wau, mau

I am using Presto SQL, Hyperloglog to calculate dau, wau, and mau. Tho I am getting exact same number for all of them. Anyone can suggest what's wrong with my query?
With dau_hll AS (
SELECT
dt,
platform,
service,
account,
country,
CAST(APPROX_SET(userid) AS VARBINARY) AS job_hll_sketch
FROM
xx
GROUP BY
1,2,3,4,5
)
SELECT dt, platform, service, country,
CARDINALITY(CAST(job_hll_sketch AS HYPERLOGLOG)) AS dau,
cardinality(merge(CAST(job_hll_sketch AS HYPERLOGLOG)) OVER (PARTITION BY dt,platform,service,account,country ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)) AS wau,
cardinality(merge(CAST(job_hll_sketch AS HYPERLOGLOG)) OVER (PARTITION BY dt,platform,service,account,country ORDER BY dt ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)) AS mau
from dau_hll

How to do compare/subtract records

Table A having 20 records and table B showing 19 records. How to find that one record is which is missing in table B. How to do compare/subtract records of these two tables; to find that one record. Running query in Apache Superset.

The exact answer depends on which column(s) define whether two records are the same. Assuming you wanted to use some primary key column for the comparison, you could try:
SELECT a.*
FROM TableA a
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.pk = a.pk);
If you wanted to use more than one column to compare records from the two tables, then you would just add logic to the exists clause, e.g. for three columns:
WHERE NOT EXISTS (SELECT 1 FROM TableB b WHERE b.col1 = a.col1 AND
b.col2 = a.col2 AND
b.col3 = a.col3)

How to optimize a join?

I have a query to join the tables. How do I optimize to run it faster?
val q = """
| select a.value as viewedid,b.other as otherids
| from bm.distinct_viewed_2610 a, bm.tets_2610 b
| where FIND_IN_SET(a.value, b.other) != 0 and a.value in (
| select value from bm.distinct_viewed_2610)
|""".stripMargin
val rows = hiveCtx.sql(q).repartition(100)
Table descriptions:
hive> desc distinct_viewed_2610;
OK
value string
hive> desc tets_2610;
OK
id int
other string
the data looks like this:
hive> select * from distinct_viewed_2610 limit 5;
OK
1033346511
1033419148
1033641547
1033663265
1033830989
and
hive> select * from tets_2610 limit 2;
OK
1033759023
103973207,1013425393,1013812066,1014099507,1014295173,1014432476,1014620707,1014710175,1014776981,1014817307,1023740250,1031023907,1031188043,1031445197
distinct_viewed_2610 table has 1.1 million records and i am trying to get similar id's for that from table tets_2610 which has 200 000 rows by splitting second column.
for 100 000 records it is taking 8.5 hrs to complete the job with two machines
one with 16 gb ram and 16 cores
second with 8 gb ram and 8 cores.
Is there a way to optimize the query?

Now you are doing cartesian join. Cartesian join gives you 1.1M*200K = 220 billion rows. After Cartesian join it filtered by where FIND_IN_SET(a.value, b.other) != 0
Analyze your data.
If 'other' string contains 10 elements in average then exploding it will give you 2.2M rows in table b. And if suppose only 1/10 of rows will join then you will have 2.2M/10=220K rows because of INNER JOIN.
If these assumptions are correct then exploding array and join will perform better than Cartesian join+filter.
select distinct a.value as viewedid, b.otherids
from bm.distinct_viewed_2610 a
inner join (select e.otherid, b.other as otherids
from bm.tets_2610 b
lateral view explode (split(b.other ,',')) e as otherid
)b on a.value=b.otherid
And you do not need this :
and a.value in (select value from bm.distinct_viewed_2610)
Sorry I cannot test query, do it yourself please.

If you are using orc formate change to parquet as per your data i woud say choose range partition.
Choose proper parallization to execute fast.
I have answred on follwing link may be help you.
Spark doing exchange of partitions already correctly distributed
Also please read it
http://dev.sortable.com/spark-repartition/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to delete duplicate rows in Azure Synapse - azure

Related

looping string list and get no record count from table

How to optimize a delete on table which doesn't have any primary key but has a column which has TimeStamp?

Using Presto HLL to count rolling wau, mau

How to do compare/subtract records

How to optimize a join?

Categories

Resources