Is there something wrong with this Hive Query?

Is there something wrong with this Hive Query? - apache-spark

I am new to Hive and wanted to understand what's wrong with this query?
df_tickets = hiveContext.sql("""select distinct oe.*,o.*,so.*
from
efms_gold.ms_bvoip_order_extension oe
join efms_gold.ms_order o
on oe.ms_order_id = o.ms_order_id join efms_gold.ms_sub_order so on so.ms_order_id = o.ms_order_id
left outer join efms_gold.ms_job j on j.entity_id = so.ms_sub_order_id
join efms_gold.ms_task t on t.wf_job_id = j.wf_job_id
where t.name RLIKE 'Error|Correct|Create AOTS Ticket'
and o.order_type = 900
and o.entered_date between date_sub(current_date(),3) and date_sub(current_date(),2)
and j.entity_type = 5
""")
I am getting the below error.
An error occurred while calling o68.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
HashAggregate(keys=[ms_bvoip_order_extension_id#0, ms_order_id#1,
related_ontime_id#2, related_billing_id#3, related_cnr_id#4, billable_ind#5, service_biller_code#6, mrs_ind#7,
initiate_separate_turnup_ind#8, entered_date#9, router_inst_pretest_date#10, site_surviv_ind#11,
main_contract_type#12, on_time_ind#13, porting_ind#14, nb_fmc_ind#15, wip_status_reason#16,
follow_up_date#17, wip_notes#18, sales_escalation_contact#19, sales_escalation_level#20, sales_escalation_office#21,
sales_escalation_email#22, ttu_mismatch_reason#23, ... 407 more fields]

Related

Getting Databricks error Error SparkUnsupportedOperationException: [INTERNAL_ERROR] Cannot generate code for expression: outer

In Databricks sql while executing SQL with NOT EXISTS operator (using correlated subquery) its not working. Getting Databricks error Error SparkUnsupportedOperationException: [INTERNAL_ERROR] Cannot generate code for expression: outer.
Below is the sql query
SELECT in_cs.COMM_ID AS CUSTOMER_SERVICE_EPIC_ID,
Data.CUR_VALUE_DATETIME AS VALUE_INSTANT,
FROM hive_metastore.RAW_CLARITY.SMRTDTA_ELEM_DATA Data
INNER JOIN hive_metastore.RAW_CLARITY.SMRTDTA_ELEM_VALUE Value
ON Data.HLV_ID = Value.HLV_ID
INNER JOIN hive_metastore.RAW_CLARITY.CLARITY_CONCEPT SmartDataElement
ON Data.ELEMENT_ID = SmartDataElement.CONCEPT_ID
INNER JOIN hive_metastore.RAW_CLARITY.CUST_SERVICE in_cs
ON Data.RECORD_ID_NUMERIC = in_cs.COMM_ID AND NOT EXISTS
( SELECT 1 FROM hive_metastore.RAW_CLARITY.CUST_SERVICE AS cs
LEFT JOIN hive_metastore.RAW_CLARITY.CAL_REFERENCE_CRM AS crc
ON cs.COMM_ID = crc.REF_CRM_ID
LEFT JOIN hive_metastore.RAW_CLARITY.CAL_COMM_TRACKING AS cct
ON crc.COMM_ID = cct.COMM_ID
WHERE cct.COMM_ID IS NULL AND in_cs.COMM_ID = cs.COMM_ID)

Spark join performance optimization

I have to execute the below query using spark. How can I optimize the join.
Each data frame has records in millions
SELECT DISTINCT col1,
col2,
col3...
FROM ool
INNER JOIN ooh ON ool.header_key = ooh.header_key
AND ool.org_key = ooh.org_key
INNER JOIN msib ON ool.inventory_item_key = msib.inventory_item_key
AND ool.ship_from_org_key = msib.o_key
INNER JOIN curr ON curr.from_currency = ooh.transactional_curr_code
AND date_format(curr.date, 'yyyy-mm-dd') = date_format(ooh.date, 'yyyy-mm-dd')
INNER JOIN mtl_parameters mp ON ool.ship_from_org_key = mp.o_key
INNER JOIN ood ON ood.o_key = mp.o_key
INNER JOIN ot ON ooh.order_type_key = ot.transaction_type_key
INNER JOIN csu ON ool.ship_to_org_key = csu.site_use_key
INNER JOIN csa ON csu.site_key = csa._site_key
INNER JOIN ps ON csa.party_key = ps.party_key
INNER JOIN hca ON csa.account_key = hca.account_key
INNER JOIN hp ON hca.party_key = hp.party_key
INNER JOIN hl ON ps.location_key = hl.location_key
INNER JOIN csu1 ON ool.invoice_to_key = csu1.use_key
INNER JOIN csa1 ON ool.invoice_to_key = csu1.use_key
AND csu1.cust_acctkey = csa1.custkey
INNER JOIN ps1 ON csa1.party_key = ps1.party_key
INNER JOIN hca1 ON csa1.cust_key = hca1.cust_key
INNER JOIN hp1 ON hca1.party_key = hp1.party_key
INNER JOIN hl1 ON ps1.loc_key = hl1.loc_key
INNER JOIN hou ON ool.or_key = hou.o_key
How can I optimize this join in pyspark?
ooh and ool are the driver dataframes and their record count will be in hundreds of million range.

self taught -syntax error i can find- this query works in my system but no embedded in excel to same database

new to this (very new- and self teaching).....i have a query that draws from multiple tables on my computer system that gets all the appraised values and sales values from a subdivision. in my system, it runs the query fine. but when i try to convert it to run embedded in an excel sheet it gives me error saying no column name for 2 c and 3 c. when i put punctuation around the column names it says there is a syntax error with the alias "as c" at the bottom-- been awake too long--- what am i doing wrong ?:
select distinct pv.prop_id, ac.file_as_name,
'sale_type' , 'deed_date' , 'sale_date' , 'sale_type' , 'sale_price' ,
(pv.land_hstd_val + pv.land_non_hstd_val + pv.ag_market + pv.timber_market)as land_val,
(pv.imprv_hstd_val + pv.imprv_non_hstd_val)as imprv_val,
pv.market, pv.abs_subdv_cd
from property_val pv with (nolock)
inner join prop_supp_assoc psa with (nolock) on
pv.prop_id = psa.prop_id
and pv.prop_val_yr = psa.owner_tax_yr
and pv.sup_num = psa.sup_num
inner join property p with (nolock) on
pv.prop_id = p.prop_id
inner join owner o with (nolock) on
pv.prop_id = o.prop_id
and pv.prop_val_yr = o.owner_tax_yr
and pv.sup_num = o.sup_num
inner join account ac with (nolock) on
o.owner_id = ac.acct_id
left outer join
(select cop.prop_id,
convert(varchar(20), co.deed_dt, 101)as deed_date,
convert(varchar(20), s.sl_dt, 101)as sale_date,
s.sl_price as sale_price, s.sl_type_cd as sale_type
from chg_of_owner_prop_assoc cop with (nolock)
inner join chg_of_owner co with (nolock) on
co.chg_of_owner_id = cop.chg_of_owner_id
inner join sale s with (nolock) on
co.chg_of_owner_id = s.chg_of_owner_id
where cop.seq_num = 0) as c
on c.prop_id = pv.prop_id
where pv.prop_val_yr = 2016
and(pv.prop_inactive_dt is null or udi_parent ='t')
and pv.abs_subdv_cd in('s3579')
order by pv.abs_subdv_cd, pv.prop_id

Is it SQL Server? Try surrounding column names with square brackets instead of quotes.

What does "Correlated scalar subqueries must be Aggregated" mean?

I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?

You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min

Spark sql when joining two or more tables using two select statements

This is my statement:
val Porders = sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
( SELECT so.ROW_ID FROM s_order so
JOIN s_order_item soi
ON so.ROW_ID = soi.ORDER_ID
JOIN s_order_type sot
ON so.ORDER_TYPE_ID = sot.ROW_ID
JOIN s_product sp
ON soi.PROD_ID = sp.ROW_ID
WHERE (sp.NAME like '%VIP%' OR sp.NAME like '%BIZ%' OR sp.NAME like '%UniFi%')
AND LOWER(sot.NAME) = 'new install')
""")
I receive the following error:
ERROR : java.lang.RuntimeException: [3.3] failure: identifier expected
( SELECT so.ROW_ID FROM s_order so JOIN s_order_item soi
^
What could be the reason?

The reason, why this happens, is that subqueries are not supported: see Spark-4226.
Even a query like
sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
(SELECT * FROM s_order)
""")
does not work currently (speaking of Spark SQL 1.5.1)
Try to replace your subquery by a join, e.g. https://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is there something wrong with this Hive Query? - apache-spark

Related

Getting Databricks error Error SparkUnsupportedOperationException: [INTERNAL_ERROR] Cannot generate code for expression: outer

Spark join performance optimization

self taught -syntax error i can find- this query works in my system but no embedded in excel to same database

What does "Correlated scalar subqueries must be Aggregated" mean?

Spark sql when joining two or more tables using two select statements

Categories

Resources