ADW - Query performance issues - azure

I have an Azure SQL Warehouse setup of DW500c of gen2 and i have a Data Vault model in it with several tables.
I am trying to execute one query that i think is taking too much time.
Here is the query i have been executing:
SELECT
H_PROFITCENTER.[BK_PROFITCENTER]
,H_ACCOUNT.[BK_ACCOUNT]
,H_LOCALCURRENCY.[BK_CURRENCY]
,H_DOCUMENTCURRENCY.[BK_CURRENCY]
,H_COSTCENTER.[BK_COSTCENTER]
,H_COMPANY.[BK_COMPANY]
,H_CURRENCY.[BK_CURRENCY]
,H_INTERNALORDER.[BK_INTERNALORDER]
,H_VERSION.[BK_VERSION]
,H_COSTELEMENT.[BK_COSTELEMENT]
,H_CALENDARDATE.[BK_DATE]
,H_VALUETYPEREPORT.[BK_VALUETYPEREPORT]
,H_FISCALPERIOD.[BK_FISCALPERIOD]
,H_COUNTRY.[BK_COUNTRY]
,H_FUNCTIONALAREA.[BK_FUNCTIONALAREA]
,SLADI.[LINE_ITEM]
,SLADI.[AMOUNT]
,SLADI.[CREDIT]
,SLADI.[DEBIT]
,SLADI.[QUANTITY]
,SLADI.[BALANCE]
,SLADI.[LOADING_DATE]
FROM [dwh].[L_ACCOUNTINGDOCUMENTITEMS] AS LADI
INNER JOIN [dwh].[SL_ACCOUNTINGDOCUMENTITEMS] AS SLADI ON LADI.[HK_ACCOUNTINGDOCUMENTITEMS] = SLADI.[HK_ACCOUNTINGDOCUMENTITEMS]
LEFT JOIN dwh.H_PROFITCENTERAS H_PROFITCENTER ON H_PROFITCENTER.[HK_PROFITCENTER] = LADI.[HK_PROFITCENTER]
LEFT JOIN dwh.H_ACCOUNT AS H_ACCOUNT ON H_ACCOUNT.[HK_ACCOUNT] = LADI.[HK_ACCOUNT]
LEFT JOIN dwh.H_CURRENCY AS H_LOCALCURRENCY ON H_LOCALCURRENCY.[HK_CURRENCY] = LADI.[HK_LOCALCURRENCY]
LEFT JOIN dwh.H_CURRENCY AS H_DOCUMENTCURRENCY ON H_DOCUMENTCURRENCY.[HK_CURRENCY] = LADI.[HK_DOCUMENTCURRENCY]
LEFT JOIN dwh.H_COSTCENTER AS H_COSTCENTER ON H_COSTCENTER.[HK_COSTCENTER] = LADI.[HK_COSTCENTER]
LEFT JOIN dwh.H_COMPANY AS H_COMPANY ON H_COMPANY.[HK_COMPANY] = LADI.[HK_COMPANY]
LEFT JOIN dwh.H_CURRENCY AS H_CURRENCY ON H_CURRENCY.[HK_CURRENCY] = LADI.[HK_CURRENCY]
LEFT JOIN dwh.H_INTERNALORDERAS H_INTERNALORDER ON H_INTERNALORDER.[HK_INTERNALORDER] = LADI.[HK_INTERNALORDER]
LEFT JOIN dwh.H_VERSION AS H_VERSION ON H_VERSION.[HK_VERSION] = LADI.[HK_VERSION]
LEFT JOIN dwh.H_COSTELEMENT AS H_COSTELEMENT ON H_COSTELEMENT.[HK_COSTELEMENT] = LADI.[HK_COSTELEMENT]
LEFT JOIN dwh.H_DATE AS H_CALENDARDATE ON H_CALENDARDATE.[HK_DATE] = LADI.[HK_CALENDARDATE]
LEFT JOIN dwh.H_VALUETYPEREPORTAS H_VALUETYPEREPORT ON H_VALUETYPEREPORT.[HK_VALUETYPEREPORT] = LADI.[HK_VALUETYPEREPORT]
LEFT JOIN dwh.H_FISCALPERIODAS H_FISCALPERIOD ON H_FISCALPERIOD.[HK_FISCALPERIOD] = LADI.[HK_FISCALPERIOD]
LEFT JOIN dwh.H_COUNTRY AS H_COUNTRY ON H_COUNTRY.[HK_COUNTRY] = LADI.[HK_COUNTRY]
LEFT JOIN dwh.H_FUNCTIONALAREAAS H_FUNCTIONALAREA ON H_FUNCTIONALAREA.[HK_FUNCTIONALAREA] = LADI.[HK_FUNCTIONALAREA]
This query is taking me 22 minutes to execute.
I must say that it returns around 1200000000 rows.
[L_ACCOUNTINGDOCUMENTITEMS] and [SL_ACCOUNTINGDOCUMENTITEMS] are hash distributed by [HK_ACCOUNTINGDOCUMENTITEMS] column and all other tables were created with replicated table distribution.
Also, i activated in azure datawarehouse automatic statistics creation.
Can anyone help me to understand how can i speed it up?

Here are some things to try out to see if you make this faster -
Create a table using 'Create Table as Select' (CTAS) with RoundRobin option for your query and take the timing of that. I have a feeling that returning that large amount of rows to your client could be a big contributor to the time. If the CTAS finishes in lets say 5 minutes, you can safely say that the rest of the time is being taken by return operation.
If not, You can materialize some of the left joins into a table and then add that table to the main query to see if that finishes faster.
You can also look at explain plans to see if you can cut down some steps by aligning the tables on a common key.

Related

WITH SCHEMABINDING on Azure SQL Server Views

I am trying to speed up an Azure SQL Database View and I have read that I should start by using
WITH SCHEMABINDING, however I always get the same error:
Parse error at line: 1, column: 28: Incorrect syntax near 'SCHEMABINDING'. I get the same error on both views below.
We store our data in Azure Data Warehouse ( which I think is called now Azure Synapse Analytics) and I am wondering if this is some sort of limitation that exists and I am unaware of.
I have created non clustered indexes on the tables the view reads and they seem to speed it up, however I am also a bit unsure on that one. I do not know if they automatically kick in or if I need to type WITH (NOEXPAND).
Thanks
CREATE VIEW dbo.vKey1 with SCHEMABINDING AS SELECT KeyUser, Name FROM dbo.Test
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE VIEW [dbo].[email campaigns data test]
with SCHEMABINDING
AS SELECT j.emailname AS 'Email Name',j.emailsubject AS 'Email Subject',COALESCE(l.centre_name,'Non Centre Specific Campaign') AS Centre,j.Category,s.JobId,s.BatchId,s.eventdate_utc As 'Sent Date',COUNT(s.subscriberkey) AS 'Sends',COUNT(b.subscriberkey) as 'Bounce',COUNT(o.subscriberkey) as 'Open',COUNT(c.subscriberkey) as 'Click'
FROM salesforce_marketing_cloud.sent s
JOIN [salesforce_marketing_cloud].[job] j on j.jobid=s.jobid
LEFT JOIN salesforce_marketing_cloud.bounce b on s.jobid=b.jobid AND s.listid=b.listid AND s.batchid=b.batchid AND s.subscriberkey=b.subscriberkey
LEFT JOIN salesforce_marketing_cloud.opens o on s.jobid=o.jobid AND s.listid=o.listid AND s.batchid=o.batchid AND s.subscriberkey=o.subscriberkey
LEFT JOIN salesforce_marketing_cloud.click c on o.jobid=c.jobid AND o.listid=c.listid AND o.batchid=c.batchid AND o.subscriberkey=c.subscriberkey
LEFT JOIN salesforce_marketing_cloud.location mcl on j.fromname=mcl.fromname
LEFT JOIN [location].[location] l ON l.salesforce_marketing_cloud_location_id = mcl.id AND l.[type] = 'Centre'
WHERE s.eventdate_utc>='1/03/2020'
AND (o.IsUnique=1 OR o.IsUnique is Null) AND (c.IsUnique=1 OR c.IsUnique is Null)
GROUP BY j.emailname,j.emailsubject,COALESCE(l.centre_name,'Non Centre Specific Campaign'),j.Category,s.JobId,s.BatchId,s.eventdate_utc;
GO

Is there any table which we can refer to get the partition name in Azure?

I want to know the partition name (similar to PART_NAME in Hive) of the table that has been partitioned specifically on Azure, I've been looking in the System tables (like sys.partitions) for this information but was not able to figure it out, therefore wanted to ask is there any other system tables which I can refer to get the detail about the partition name.
To make the question more clear I've added a screen shot. In the photo I've highlighted a column name on which partition is being performed, I want to retrieve that column name, so is there any way to retrieve that from any System tables?
You have to join a bunch of tables together to get worthwhile Partition information. Here is a sample:
SELECT s.[name] AS [schema]
,t.[name] AS [table]
, i.[name] AS [index_name]
, p.[partition_id]
, p.[partition_number]
, p.[rows] AS [partition_rows]
, rng.[value] AS [partition_value]
, ds.[data_space_id]
, ps.[function_id]
from sys.tables t
join sys.schemas s
on s.[schema_id] = t.[schema_id]
join sys.indexes i
on t.[object_id] = i.[object_id]
join sys.partitions p
on p.[object_id] = t.[object_id]
join sys.data_spaces ds
on i.[data_space_id] = ds.[data_space_id]
join sys.partition_schemes ps
on ps.[data_space_id] = i.[data_space_id]
join sys.partition_functions pf
on ps.[function_id] = pf.[function_id]
join sys.partition_range_values rng
on pf.[function_id] = rng.[function_id] AND rng.[boundary_id] = p.[partition_number]
where s.[name] = #schema_name
and t.[name] = #table_name
Which returns something like the following:

How to get sub query columns in main query with WHERE EXISTS in PostgreSQL?

I am stuck with a query which takes more time in JOIN, I want to use WHERE EXISTS in place of JOIN since as performance wise EXISTS takes less time than it.
I have modified the query and it's executing as per expectation but I am not able to use sub query's columns in my main query
Here is my query
SELECT MAX(st.grade_level::integer) AS grades ,
scl.sid AS org_sourced_id
FROM schedules_53b055b75cd237fde3af904c1e726e12 sch
LEFT JOIN schools scl ON(sch.school_id=scl.school_id)
AND scl.batch_id=sch.batch_id
AND scl.client_id = sch.client_id
AND sch.run_id = scl.run_id
WHERE EXISTS
(SELECT t.term_id,t.abbreviation
FROM terms t
WHERE (sch.term = t.term_id)
AND t.batch_id=sch.batch_id
AND t.client_id = sch.client_id
AND t.run_id = sch.run_id)
AND EXISTS
(SELECT st.grade_level,
st.sid
FROM students st
WHERE (sch.student_id=st.sid)
AND st.batch_id= sch.batch_id
AND st.client_id = sch.client_id
AND st.run_id = sch.run_id)
GROUP BY scl.sid ,
sch.course_name ,
sch.course_number,
sch.school_id
And I am getting this error:
ERROR: missing FROM-clause entry for table "st"
SQL state: 42P01
Character: 29
I have only used one column here just for sample but I have to use more fields from sub query.
My main aim is that how can I achieve this with EXISTS or any alternate solution which is more optimal as performance wise
I am using pg module on Node.js since as back end I am using Node.js.
UPDATE
Query with JOIN
SELECT MAX(st.grade_level::integer) AS grades ,
scl.sid AS org_sourced_id
FROM schedules_53b055b75cd237fde3af904c1e726e12 sch
LEFT JOIN schools scl ON(sch.school_id=scl.school_id)
AND scl.batch_id=sch.batch_id
AND scl.client_id = sch.client_id
AND sch.run_id = scl.run_id
LEFT JOIN terms t ON (sch.term = t.term_id)
AND t.batch_id=sch.batch_id
AND t.client_id = sch.client_id
AND t.run_id = sch.run_id
LEFT JOIN students st ON (sch.student_id=st.sid)
AND st.batch_id= sch.batch_id
AND st.client_id = sch.client_id
AND st.run_id = sch.run_id
GROUP BY scl.sid ,
sch.course_name ,
sch.course_number,
sch.school_id

Effecientcy of using multiple subselects

I'm trying to gather multiple related pieces of data for a master account and create a view (e.g. overdue balance, account balance, debt recovery status, interest hold). Will this approach be effecient? Database platforms are Informix, Oracle and Sql Server. Doing some statistics on Informix I'm just getting 1 sequential scan of auubmast. I assume the sub-selects are quite effecient because they filter down to the account number immediately. I may need many sub-selects before I'm finished. On top of the question of efficiency are there any other 'tidy' approaches?
Thank you.
select
auubmast.acc_num,
auubmast.cls_cde,
auubmast.acc_typ,
(select
sum(auubtrnh.trn_bal)
from auubtrnh, aualtrcd
where aualtrcd.trn_cde = auubtrnh.trn_cde
and auubtrnh.acc_num = auubmast.acc_num
and (auubtrnh.due_dte < current or aualtrcd.trn_typ = 'I')
) as ovd_bal,
(select
sum(auubytdb.ytd_bal)
from auubytdb, auubsvgr
where auubytdb.acc_num = auubmast.acc_num
and auubsvgr.svc_grp = auubmast.svc_grp
and auubytdb.bil_yer = auubsvgr.bil_yer
) as acc_bal,
(select
max(cur_stu)
from audemast
where mdu_acc = auubmast.acc_num
and mdu_ref = 'UB'
) as drc_stu,
(select
hol_typ
from aualhold
where mdu_acc = auubmast.acc_num
and mdu_ref = 'UB'
and pro_num = 2601
and (hol_til is null or hol_til > current)
) as int_hld
from auubmast
In general, the answer to this is that correlated subqueries should be avoided whenever possible.
Using them will result in a full table scan for your view, which is bad. The only times you want to use subqueries like this is if you can limit the range of the main select to only a few rows, or if there really is no other choice.
When you're running into situations like this, you might want to consider adding columns and precalculating them on an update trigger, rather than using subqueries. This will save your database a thrashing.

Cannot link MS Access query with subquery

I have created a query with a subquery in Access, and cannot link it in Excel 2003: when I use the menu Data -> Import External Data -> Import Data... and select the mdb file, the query is not present in the list. If I use the menu Data -> Import External Data -> New Database Query..., I can see my query in the list, but at the end of the import wizard I get this error:
Too few parameters. Expected 2.
My guess is that the query syntax is causing the problem, in fact the query contains a subquery. So, I'll try to describe the query goal and the resulting syntax.
Table Positions
ID (Autonumber, Primary Key)
position (double)
currency_id (long) (references Currency.ID)
portfolio (long)
Table Currency
ID (Autonumber, Primary Key)
code (text)
Query Goal
Join the 2 tables
Filter by portfolio = 1
Filter by currency.code in ("A", "B")
Group by currency and calculate the sum of the positions for each currency group an call the result: sumOfPositions
Calculate abs(sumOfPositions) on each currency group
Calculate the sum of the previous results as a single result
Query
The query without the final sum can be created using the Design View. The resulting SQL is:
SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")));
in order to calculate the final SUM I did the following (in the SQL View):
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
So, the question is: is there a better way for structuring the query in order to make the export work?
I can't see too much wrong with it, but I would take out some of the junk Access puts in and scale down the query to this, hopefully this should run ok:
SELECT Sum(Abs(A.SumOfPosition)) As SumAbs
FROM (SELECT C.code, Sum(P.position) AS SumOfposition
FROM Currency As C INNER JOIN Positions As P ON C.ID = P.currency_id
WHERE P.portfolio=1
GROUP BY C.code
HAVING C.code In ("A","B")) As A
It might be worth trying to declare your parameters in the MS Access query definition and define their datatypes. This is especially important when you are trying to use the query outside of MS Access itself, since it can't auto-detect the parameter types. This approach is sometimes hit or miss, but worth a shot.
PARAMETERS [[Positions].[portfolio]] Long, [[Currency].[code]] Text ( 255 );
SELECT Sum(Abs([temp].[SumOfposition])) AS sumAbs
FROM [SELECT Currency.code, Sum(Positions.position) AS SumOfposition
FROM [Currency] INNER JOIN Positions ON Currency.ID = Positions.currency_id
WHERE (((Positions.portfolio)=1))
GROUP BY Currency.code
HAVING (((Currency.code) In ("A","B")))]. AS temp;
I have solved my problems thanks to the fact that the outer query is doing a trivial sum. When choosing New Database Query... in Excel, at the end of the process, after pressing Finish, an Import Data form pops up, asking
Where do you want to put the data?
you can click on Create a PivotTable report... . If you define the PivotTable properly, Excel will display only the outer sum.

Resources