Spark join performance optimization

Spark join performance optimization - apache-spark

I have to execute the below query using spark. How can I optimize the join.
Each data frame has records in millions
SELECT DISTINCT col1,
col2,
col3...
FROM ool
INNER JOIN ooh ON ool.header_key = ooh.header_key
AND ool.org_key = ooh.org_key
INNER JOIN msib ON ool.inventory_item_key = msib.inventory_item_key
AND ool.ship_from_org_key = msib.o_key
INNER JOIN curr ON curr.from_currency = ooh.transactional_curr_code
AND date_format(curr.date, 'yyyy-mm-dd') = date_format(ooh.date, 'yyyy-mm-dd')
INNER JOIN mtl_parameters mp ON ool.ship_from_org_key = mp.o_key
INNER JOIN ood ON ood.o_key = mp.o_key
INNER JOIN ot ON ooh.order_type_key = ot.transaction_type_key
INNER JOIN csu ON ool.ship_to_org_key = csu.site_use_key
INNER JOIN csa ON csu.site_key = csa._site_key
INNER JOIN ps ON csa.party_key = ps.party_key
INNER JOIN hca ON csa.account_key = hca.account_key
INNER JOIN hp ON hca.party_key = hp.party_key
INNER JOIN hl ON ps.location_key = hl.location_key
INNER JOIN csu1 ON ool.invoice_to_key = csu1.use_key
INNER JOIN csa1 ON ool.invoice_to_key = csu1.use_key
AND csu1.cust_acctkey = csa1.custkey
INNER JOIN ps1 ON csa1.party_key = ps1.party_key
INNER JOIN hca1 ON csa1.cust_key = hca1.cust_key
INNER JOIN hp1 ON hca1.party_key = hp1.party_key
INNER JOIN hl1 ON ps1.loc_key = hl1.loc_key
INNER JOIN hou ON ool.or_key = hou.o_key
How can I optimize this join in pyspark?
ooh and ool are the driver dataframes and their record count will be in hundreds of million range.

Related

Mysql slow multiple join query - version 5.7

I have the following query:
SELECT *
FROM `ResearchEntity` AS `ResearchEntity`
LEFT OUTER JOIN `UserEntity` AS `createdBy` ON `ResearchEntity`.`createdById` = `createdBy`.`id`
LEFT OUTER JOIN `Item1Entity` AS `item1` ON `ResearchEntity`.`id` = `Item1`.`researchId`
LEFT OUTER JOIN `Item2Entity` AS `item2` ON `ResearchEntity`.`id` = `Item2`.`researchId`
LEFT OUTER JOIN `Item3Entity` AS `item3` ON `ResearchEntity`.`id` = `Item3`.`researchId`
LEFT OUTER JOIN `Item4Entity` AS `item4` ON `ResearchEntity`.`id` = `Item4`.`researchId`
LEFT OUTER JOIN `Item5Entity` AS `item5` ON `ResearchEntity`.`id` = `Item5`.`researchId`
LEFT OUTER JOIN `Item6Entity` AS `item6` ON `ResearchEntity`.`id` = `Item6`.`researchId`
LEFT OUTER JOIN `Item7Entity` AS `item7` ON `ResearchEntity`.`id` = `Item7`.`researchId`
LEFT OUTER JOIN `Item8Entity` AS `item8` ON `ResearchEntity`.`id` = `Item8`.`researchId`
LEFT OUTER JOIN `Item9Entity` AS `item9` ON `ResearchEntity`.`id` = `Item9`.`researchId`
LEFT OUTER JOIN `Item10Entity` AS `item10` ON `ResearchEntity`.`id` = `Item10`.`researchId`
ORDER BY `ResearchEntity`.`id` DESC
LIMIT 20, 40;
UserEntity has 4 records
Item*Entity each has 15 records
When I remove the first join, the UserEntity join the query runs very fast, but with it is runs in 50 seconds.
The query is built dynamically in runtime using a ORM.
Why UserEntity is causing this much trouble?
Thanks

Try changing your select * to select specific fields from each entity, the load of your query will be reduced and run time should be faster. something like:
SELECT ResearchEntity.field1 as ReField1, item1.name as Item1Name, ...
-- Edit
Also, try to keep the smaller tables to the left of the JOIN statements.
Ensure you index the joining fields and the joining fields are INT
If you are sure that one Table is causing too much time, consider fetching results of that table separately in your code

Is there something wrong with this Hive Query?

I am new to Hive and wanted to understand what's wrong with this query?
df_tickets = hiveContext.sql("""select distinct oe.*,o.*,so.*
from
efms_gold.ms_bvoip_order_extension oe
join efms_gold.ms_order o
on oe.ms_order_id = o.ms_order_id join efms_gold.ms_sub_order so on so.ms_order_id = o.ms_order_id
left outer join efms_gold.ms_job j on j.entity_id = so.ms_sub_order_id
join efms_gold.ms_task t on t.wf_job_id = j.wf_job_id
where t.name RLIKE 'Error|Correct|Create AOTS Ticket'
and o.order_type = 900
and o.entered_date between date_sub(current_date(),3) and date_sub(current_date(),2)
and j.entity_type = 5
""")
I am getting the below error.
An error occurred while calling o68.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
HashAggregate(keys=[ms_bvoip_order_extension_id#0, ms_order_id#1,
related_ontime_id#2, related_billing_id#3, related_cnr_id#4, billable_ind#5, service_biller_code#6, mrs_ind#7,
initiate_separate_turnup_ind#8, entered_date#9, router_inst_pretest_date#10, site_surviv_ind#11,
main_contract_type#12, on_time_ind#13, porting_ind#14, nb_fmc_ind#15, wip_status_reason#16,
follow_up_date#17, wip_notes#18, sales_escalation_contact#19, sales_escalation_level#20, sales_escalation_office#21,
sales_escalation_email#22, ttu_mismatch_reason#23, ... 407 more fields]

How can I write this query using TypeORM createQueryBuilder?

I want to write this raw query using the TypeORM createQueryBuilder. How can I achieve this?
SELECT *
FROM Location l
INNER JOIN LocationType t ON l.TypeId = t.Id
LEFT JOIN Location p ON l.ParentId = p.Id
LEFT JOIN (SELECT locationId, SUM(Units) TotalUnits FROM ItemInventory GROUP BY LocationId) qty ON l.Id = qty.LocationId`

self taught -syntax error i can find- this query works in my system but no embedded in excel to same database

new to this (very new- and self teaching).....i have a query that draws from multiple tables on my computer system that gets all the appraised values and sales values from a subdivision. in my system, it runs the query fine. but when i try to convert it to run embedded in an excel sheet it gives me error saying no column name for 2 c and 3 c. when i put punctuation around the column names it says there is a syntax error with the alias "as c" at the bottom-- been awake too long--- what am i doing wrong ?:
select distinct pv.prop_id, ac.file_as_name,
'sale_type' , 'deed_date' , 'sale_date' , 'sale_type' , 'sale_price' ,
(pv.land_hstd_val + pv.land_non_hstd_val + pv.ag_market + pv.timber_market)as land_val,
(pv.imprv_hstd_val + pv.imprv_non_hstd_val)as imprv_val,
pv.market, pv.abs_subdv_cd
from property_val pv with (nolock)
inner join prop_supp_assoc psa with (nolock) on
pv.prop_id = psa.prop_id
and pv.prop_val_yr = psa.owner_tax_yr
and pv.sup_num = psa.sup_num
inner join property p with (nolock) on
pv.prop_id = p.prop_id
inner join owner o with (nolock) on
pv.prop_id = o.prop_id
and pv.prop_val_yr = o.owner_tax_yr
and pv.sup_num = o.sup_num
inner join account ac with (nolock) on
o.owner_id = ac.acct_id
left outer join
(select cop.prop_id,
convert(varchar(20), co.deed_dt, 101)as deed_date,
convert(varchar(20), s.sl_dt, 101)as sale_date,
s.sl_price as sale_price, s.sl_type_cd as sale_type
from chg_of_owner_prop_assoc cop with (nolock)
inner join chg_of_owner co with (nolock) on
co.chg_of_owner_id = cop.chg_of_owner_id
inner join sale s with (nolock) on
co.chg_of_owner_id = s.chg_of_owner_id
where cop.seq_num = 0) as c
on c.prop_id = pv.prop_id
where pv.prop_val_yr = 2016
and(pv.prop_inactive_dt is null or udi_parent ='t')
and pv.abs_subdv_cd in('s3579')
order by pv.abs_subdv_cd, pv.prop_id

Is it SQL Server? Try surrounding column names with square brackets instead of quotes.

Inner Join with more than a field

Precise to do a select with inner join that has relationship in more than a field among the tables
Exemple:
DataSet dt = new Select().From(SubConta.Schema)
.InnerJoin(PlanoContabilSubConta.EmpSubContaColumn, SubConta.CodEmpColumn)
.InnerJoin(PlanoContabilSubConta.FilSubContaColumn, SubConta.CodFilColumn)
.InnerJoin(PlanoContabilSubConta.SubContaColumn, SubConta.TradutorColumn)
.Where(PlanoContabilSubConta.Columns.EmpContabil).IsEqualTo(cEmp)
.And(PlanoContabilSubConta.Columns.FilContabil).IsEqualTo(cFil)
.And(PlanoContabilSubConta.Columns.Conta).IsEqualTo(cTrad)
.ExecuteDataSet();
But the generated sql is wrong:
exec sp_executesql N'/* GetDataSet() */ SELECT [dbo].[SubContas].[CodEmp], [dbo].[SubContas].[CodFil], [dbo].[SubContas].[Tradutor], [dbo].[SubContas].[Descricao], [dbo].[SubContas].[Inativa], [dbo].[SubContas].[DataImplantacao]
FROM [dbo].[SubContas]
INNER JOIN [dbo].[PlanoContabilSubContas] ON [dbo].[SubContas].[CodEmp] = [dbo].[PlanoContabilSubContas].[EmpSubConta]
INNER JOIN [dbo].[PlanoContabilSubContas] ON [dbo].[SubContas].[CodFil] = [dbo].[PlanoContabilSubContas].[FilSubConta]
INNER JOIN [dbo].[PlanoContabilSubContas] ON [dbo].[SubContas].[Tradutor] = [dbo].[PlanoContabilSubContas].[SubConta]
WHERE EmpContabil = #EmpContabil0
AND FilContabil = #FilContabil1
AND Conta = #Conta2
',N'#EmpContabil0 varchar(1),#FilContabil1 varchar(1),#Conta2 varchar(1)',#EmpContabil0='1',#FilContabil1='1',#Conta2='1'
What should be made to generate this sql?
exec sp_executesql N'/* GetDataSet() */ SELECT [dbo].[SubContas].[CodEmp], [dbo].[SubContas].[CodFil], [dbo].[SubContas].[Tradutor], [dbo].[SubContas].[Descricao], [dbo].[SubContas].[Inativa], [dbo].[SubContas].[DataImplantacao]
FROM [dbo].[SubContas]
INNER JOIN [dbo].[PlanoContabilSubContas] ON [dbo].[SubContas].[CodEmp] = [dbo].[PlanoContabilSubContas].[EmpSubConta] AND
[dbo].[SubContas].[CodFil] = [dbo].[PlanoContabilSubContas].[FilSubConta] AND
[dbo].[SubContas].[Tradutor] = [dbo].[PlanoContabilSubContas].[SubConta]
WHERE EmpContabil = #EmpContabil0
AND FilContabil = #FilContabil1
AND Conta = #Conta2
',N'#EmpContabil0 varchar(1),#FilContabil1 varchar(1),#Conta2 varchar(1)',#EmpContabil0='1',#FilContabil1='1',#Conta2='1'

Looks like you can't join on multiple columns in SubSonic 2, you could just create a SQL view that joins as you want it to and then get SubSonic to generate a model for that view.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark join performance optimization - apache-spark

Related

Mysql slow multiple join query - version 5.7

Is there something wrong with this Hive Query?

How can I write this query using TypeORM createQueryBuilder?

self taught -syntax error i can find- this query works in my system but no embedded in excel to same database

Inner Join with more than a field

Categories

Resources