Mysql slow multiple join query - version 5.7 - mysql-5.7

I have the following query:
SELECT *
FROM `ResearchEntity` AS `ResearchEntity`
LEFT OUTER JOIN `UserEntity` AS `createdBy` ON `ResearchEntity`.`createdById` = `createdBy`.`id`
LEFT OUTER JOIN `Item1Entity` AS `item1` ON `ResearchEntity`.`id` = `Item1`.`researchId`
LEFT OUTER JOIN `Item2Entity` AS `item2` ON `ResearchEntity`.`id` = `Item2`.`researchId`
LEFT OUTER JOIN `Item3Entity` AS `item3` ON `ResearchEntity`.`id` = `Item3`.`researchId`
LEFT OUTER JOIN `Item4Entity` AS `item4` ON `ResearchEntity`.`id` = `Item4`.`researchId`
LEFT OUTER JOIN `Item5Entity` AS `item5` ON `ResearchEntity`.`id` = `Item5`.`researchId`
LEFT OUTER JOIN `Item6Entity` AS `item6` ON `ResearchEntity`.`id` = `Item6`.`researchId`
LEFT OUTER JOIN `Item7Entity` AS `item7` ON `ResearchEntity`.`id` = `Item7`.`researchId`
LEFT OUTER JOIN `Item8Entity` AS `item8` ON `ResearchEntity`.`id` = `Item8`.`researchId`
LEFT OUTER JOIN `Item9Entity` AS `item9` ON `ResearchEntity`.`id` = `Item9`.`researchId`
LEFT OUTER JOIN `Item10Entity` AS `item10` ON `ResearchEntity`.`id` = `Item10`.`researchId`
ORDER BY `ResearchEntity`.`id` DESC
LIMIT 20, 40;
UserEntity has 4 records
Item*Entity each has 15 records
When I remove the first join, the UserEntity join the query runs very fast, but with it is runs in 50 seconds.
The query is built dynamically in runtime using a ORM.
Why UserEntity is causing this much trouble?
Thanks

Try changing your select * to select specific fields from each entity, the load of your query will be reduced and run time should be faster. something like:
SELECT ResearchEntity.field1 as ReField1, item1.name as Item1Name, ...
-- Edit
Also, try to keep the smaller tables to the left of the JOIN statements.
Ensure you index the joining fields and the joining fields are INT
If you are sure that one Table is causing too much time, consider fetching results of that table separately in your code

Related

Databricks AnalysisException: Column 'l' does not exist

I have a very strange occurrence with my code.
I keep on getting the error
AnalysisException: Column 'homepage_url' does not exist
However, when I do a select with cross Joins the column does actually exist.
Can someone take a look at my cross joins and let me know if that is where the problem is
SELECT DISTINCT
account.xpd_relationshipstatus AS CRM_xpd_relationshipstatus
,REPLACE(owneridname,'Data.Import #','') AS MontaguOwner
,account.ts_montaguoffice AS Montagu_Office
,CAST(account.ts_reminderdatesetto AS DATE) AS CRM_ts_reminderdatesetto
,CAST(account.ts_lastdatestatuschanged AS DATE) AS YearofCRMtslastdatestatuschanged
,organizations.name AS nameCB
,organizations.homepage_url
,iff(e like 'www.%', e, 'www.' + e) AS website
,left(category_list,charindex(',',category_list +',' )-1) AS category_CB
-- ,case when charindex(',',category_list,0) > 0 then left(category_list,charindex(',',category_list)-1) else category_list end as category_CB
,organizations.category_groups_list AS category_groups_CB
FROM basecrmcbreport.account
LEFT OUTER JOIN basecrmcbreport.CRM2CBURL_Lookup
ON account.Id = CRM2CBURL_Lookup.Key
LEFT OUTER JOIN basecrmcbreport.organizations
ON CRM2CBURL_Lookup.CB_URL_KEY = organizations.cb_url
cross Join (values (charindex('://', homepage_url))) a(a)
cross Join (values (iff(a = 0, 1, a + 3))) b(b)
cross Join (values (charindex('/', homepage_url, b))) c(c)
cross Join (values (iff(c = 0, length(homepage_url) + 1, c))) d(d)
cross Join (values (substring(homepage_url, b, d - b))) e(e)
Without the cross Joins
The main reason for cross join (or any join) to recognize the column when you select not when using table valued functions is that joins are used on tables only.
To use table valued functions, one must use cross apply or outer apply. But these are not supported in Databricks sql.
The following is the demo data I am using:
I tried using inner join on a table valued function using the following query and got the same error:
select d1.*,a from demo1 inner join (values(if(d1.team = 'OG',2,1))) a;
Instead, using the select query, the joins work as that is how they function:
select d1.*,a.no_of_wins from demo1 d1 inner join (select id,case team when 'OG' then 2 when 'TS' then 1 end as no_of_wins from demo1) a on d1.id=a.id;
So, the remedy for this problem is to replace all the table valued functions on which you are using joins with SELECT statements.

Spark inner join API returns too many records

File link here I have two same dataframes each has 27817 rows.Try to inner join these dataframes it returns 128954989 rows.
dataframe1.join(dataframe2,"_c0").count
res16: Long = 128954989
how to resolve this .
It happens because your join is creating cartesian products. If you want to keep the rows on the left side of the join you can do a left join like:
dataframe1.join(dataframe2,"_c0", "left")
Also you have more types of joins and you have to select one of them deppending on your necessity. Here you can see the join trypes with examples:

ADW - Query performance issues

I have an Azure SQL Warehouse setup of DW500c of gen2 and i have a Data Vault model in it with several tables.
I am trying to execute one query that i think is taking too much time.
Here is the query i have been executing:
SELECT
H_PROFITCENTER.[BK_PROFITCENTER]
,H_ACCOUNT.[BK_ACCOUNT]
,H_LOCALCURRENCY.[BK_CURRENCY]
,H_DOCUMENTCURRENCY.[BK_CURRENCY]
,H_COSTCENTER.[BK_COSTCENTER]
,H_COMPANY.[BK_COMPANY]
,H_CURRENCY.[BK_CURRENCY]
,H_INTERNALORDER.[BK_INTERNALORDER]
,H_VERSION.[BK_VERSION]
,H_COSTELEMENT.[BK_COSTELEMENT]
,H_CALENDARDATE.[BK_DATE]
,H_VALUETYPEREPORT.[BK_VALUETYPEREPORT]
,H_FISCALPERIOD.[BK_FISCALPERIOD]
,H_COUNTRY.[BK_COUNTRY]
,H_FUNCTIONALAREA.[BK_FUNCTIONALAREA]
,SLADI.[LINE_ITEM]
,SLADI.[AMOUNT]
,SLADI.[CREDIT]
,SLADI.[DEBIT]
,SLADI.[QUANTITY]
,SLADI.[BALANCE]
,SLADI.[LOADING_DATE]
FROM [dwh].[L_ACCOUNTINGDOCUMENTITEMS] AS LADI
INNER JOIN [dwh].[SL_ACCOUNTINGDOCUMENTITEMS] AS SLADI ON LADI.[HK_ACCOUNTINGDOCUMENTITEMS] = SLADI.[HK_ACCOUNTINGDOCUMENTITEMS]
LEFT JOIN dwh.H_PROFITCENTERAS H_PROFITCENTER ON H_PROFITCENTER.[HK_PROFITCENTER] = LADI.[HK_PROFITCENTER]
LEFT JOIN dwh.H_ACCOUNT AS H_ACCOUNT ON H_ACCOUNT.[HK_ACCOUNT] = LADI.[HK_ACCOUNT]
LEFT JOIN dwh.H_CURRENCY AS H_LOCALCURRENCY ON H_LOCALCURRENCY.[HK_CURRENCY] = LADI.[HK_LOCALCURRENCY]
LEFT JOIN dwh.H_CURRENCY AS H_DOCUMENTCURRENCY ON H_DOCUMENTCURRENCY.[HK_CURRENCY] = LADI.[HK_DOCUMENTCURRENCY]
LEFT JOIN dwh.H_COSTCENTER AS H_COSTCENTER ON H_COSTCENTER.[HK_COSTCENTER] = LADI.[HK_COSTCENTER]
LEFT JOIN dwh.H_COMPANY AS H_COMPANY ON H_COMPANY.[HK_COMPANY] = LADI.[HK_COMPANY]
LEFT JOIN dwh.H_CURRENCY AS H_CURRENCY ON H_CURRENCY.[HK_CURRENCY] = LADI.[HK_CURRENCY]
LEFT JOIN dwh.H_INTERNALORDERAS H_INTERNALORDER ON H_INTERNALORDER.[HK_INTERNALORDER] = LADI.[HK_INTERNALORDER]
LEFT JOIN dwh.H_VERSION AS H_VERSION ON H_VERSION.[HK_VERSION] = LADI.[HK_VERSION]
LEFT JOIN dwh.H_COSTELEMENT AS H_COSTELEMENT ON H_COSTELEMENT.[HK_COSTELEMENT] = LADI.[HK_COSTELEMENT]
LEFT JOIN dwh.H_DATE AS H_CALENDARDATE ON H_CALENDARDATE.[HK_DATE] = LADI.[HK_CALENDARDATE]
LEFT JOIN dwh.H_VALUETYPEREPORTAS H_VALUETYPEREPORT ON H_VALUETYPEREPORT.[HK_VALUETYPEREPORT] = LADI.[HK_VALUETYPEREPORT]
LEFT JOIN dwh.H_FISCALPERIODAS H_FISCALPERIOD ON H_FISCALPERIOD.[HK_FISCALPERIOD] = LADI.[HK_FISCALPERIOD]
LEFT JOIN dwh.H_COUNTRY AS H_COUNTRY ON H_COUNTRY.[HK_COUNTRY] = LADI.[HK_COUNTRY]
LEFT JOIN dwh.H_FUNCTIONALAREAAS H_FUNCTIONALAREA ON H_FUNCTIONALAREA.[HK_FUNCTIONALAREA] = LADI.[HK_FUNCTIONALAREA]
This query is taking me 22 minutes to execute.
I must say that it returns around 1200000000 rows.
[L_ACCOUNTINGDOCUMENTITEMS] and [SL_ACCOUNTINGDOCUMENTITEMS] are hash distributed by [HK_ACCOUNTINGDOCUMENTITEMS] column and all other tables were created with replicated table distribution.
Also, i activated in azure datawarehouse automatic statistics creation.
Can anyone help me to understand how can i speed it up?
Here are some things to try out to see if you make this faster -
Create a table using 'Create Table as Select' (CTAS) with RoundRobin option for your query and take the timing of that. I have a feeling that returning that large amount of rows to your client could be a big contributor to the time. If the CTAS finishes in lets say 5 minutes, you can safely say that the rest of the time is being taken by return operation.
If not, You can materialize some of the left joins into a table and then add that table to the main query to see if that finishes faster.
You can also look at explain plans to see if you can cut down some steps by aligning the tables on a common key.

JPQL LEFT JOIN is not working

I want to get the list of all Branch even if they have no accounts with user role
Query query = em.createQuery("SELECT NEW com.package.BranchInstructors(b,a) FROM Branch b LEFT JOIN b.accounts a WHERE b.dFlg = 0 AND a.userRole = :role ORDER BY b.name ASC");
query.setParameter("role", "user");
return query.getResultList();
Unfortunately it's returning only Branches with user role, it's like doing INNER JOIN instead.
Any idea what's going on?
just add the a.userRole is null condition to your query to avoid filtering the null userRole that you got from the left join
SELECT NEW com.package.BranchInstructors(b,a)
FROM Branch b
LEFT JOIN b.accounts a
WHERE b.dFlg = 0
AND (a.userRole = :role OR a.userRole IS NULL)
ORDER BY b.name ASC"
The problem is in your WHERE vs LEFT JOIN clause.
If you use LEFT JOIN table Accounts and use this table in the WHERE with AND condition, it behaves like a JOIN.
So, you can use WITH in the LEFT JOIN:
Query query = em.createQuery("SELECT NEW com.package.BranchInstructors(b,a) FROM Branch b
LEFT JOIN b.accounts a WITH a.userRole = :role
WHERE b.dFlg = 0 ORDER BY b.name ASC");

How to make subquery fast

for an author overview we are looking for a query which will show all the authors including their best book. The problem with this query is that it lacks speed. There are only about 1500 authors and the query do generate the overview is currently taking 20 seconds.
The main problem seems te be generating the average rating of all the books per person.
By selecting the following query, it is still rather fast
select
person.id as pers_id,
person.firstname,
person.suffix,
person.lastname,
thriller.title,
year(thriller.orig_pubdate) as year,
thriller.id as thrill_id,
count(user_rating.id) as nr,
AVG(user_rating.rating) as avgrating
from
thriller
inner join
thriller_form
on thriller_form.thriller_id = thriller.id
inner join
thriller_person
on thriller_person.thriller_id = thriller.id
and thriller_person.person_type_id = 1
inner join
person
on person.id = thriller_person.person_id
left outer join
user_rating
on user_rating.thriller_id = thriller.id
and user_rating.rating_type_id = 1
where thriller.id in
(select top 1 B.id from thriller as B
inner join thriller_person as C on B.id=C.thriller_id
and person.id=C.person_id)
group by
person.firstname,
person.suffix,
person.lastname,
thriller.title,
year(thriller.orig_pubdate),
thriller.id,
person.id
order by
person.lastname
However, if we make the subquery a little more complex by selecting the book with the average rating it takes a full 20 seconds to generate a resultset.
The query would then be as follows:
select
person.id as pers_id,
person.firstname,
person.suffix,
person.lastname,
thriller.title,
year(thriller.orig_pubdate) as year,
thriller.id as thrill_id,
count(user_rating.id) as nr,
AVG(user_rating.rating) as avgrating
from
thriller
inner join
thriller_form
on thriller_form.thriller_id = thriller.id
inner join
thriller_person
on thriller_person.thriller_id = thriller.id
and thriller_person.person_type_id = 1
inner join
person
on person.id = thriller_person.person_id
left outer join
user_rating
on user_rating.thriller_id = thriller.id
and user_rating.rating_type_id = 1
where thriller.id in
(select top 1 B.id from thriller as B
inner join thriller_person as C on B.id=C.thriller_id
and person.id=C.person_id
inner join user_rating as D on B.id=D.thriller_id
group by B.id
order by AVG(D.rating))
group by
person.firstname,
person.suffix,
person.lastname,
thriller.title,
year(thriller.orig_pubdate),
thriller.id,
person.id
order by
person.lastname
Anyone got a good suggestion to speed up this query?
Calculating an average requires a table scan since you've got to sum the values and then divide by the number of (relevant) rows. This in turn means that you're doing a lot of rescanning; that's slow. Can you calculate the averages once and store them? That would let your query use those pre-computed values. (Yes, it denormalizes the data, but denormalizing for performance is often necessary; there's a trade-off between performance and minimal data.)
It might be appropriate to use a temporary table as the store of the averages.

Resources