I am new to Hive and wanted to understand what's wrong with this query?
df_tickets = hiveContext.sql("""select distinct oe.*,o.*,so.*
from
efms_gold.ms_bvoip_order_extension oe
join efms_gold.ms_order o
on oe.ms_order_id = o.ms_order_id join efms_gold.ms_sub_order so on so.ms_order_id = o.ms_order_id
left outer join efms_gold.ms_job j on j.entity_id = so.ms_sub_order_id
join efms_gold.ms_task t on t.wf_job_id = j.wf_job_id
where t.name RLIKE 'Error|Correct|Create AOTS Ticket'
and o.order_type = 900
and o.entered_date between date_sub(current_date(),3) and date_sub(current_date(),2)
and j.entity_type = 5
""")
I am getting the below error.
An error occurred while calling o68.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
HashAggregate(keys=[ms_bvoip_order_extension_id#0, ms_order_id#1,
related_ontime_id#2, related_billing_id#3, related_cnr_id#4, billable_ind#5, service_biller_code#6, mrs_ind#7,
initiate_separate_turnup_ind#8, entered_date#9, router_inst_pretest_date#10, site_surviv_ind#11,
main_contract_type#12, on_time_ind#13, porting_ind#14, nb_fmc_ind#15, wip_status_reason#16,
follow_up_date#17, wip_notes#18, sales_escalation_contact#19, sales_escalation_level#20, sales_escalation_office#21,
sales_escalation_email#22, ttu_mismatch_reason#23, ... 407 more fields]
Related
In Databricks sql while executing SQL with NOT EXISTS operator (using correlated subquery) its not working. Getting Databricks error Error SparkUnsupportedOperationException: [INTERNAL_ERROR] Cannot generate code for expression: outer.
Below is the sql query
SELECT in_cs.COMM_ID AS CUSTOMER_SERVICE_EPIC_ID,
Data.CUR_VALUE_DATETIME AS VALUE_INSTANT,
FROM hive_metastore.RAW_CLARITY.SMRTDTA_ELEM_DATA Data
INNER JOIN hive_metastore.RAW_CLARITY.SMRTDTA_ELEM_VALUE Value
ON Data.HLV_ID = Value.HLV_ID
INNER JOIN hive_metastore.RAW_CLARITY.CLARITY_CONCEPT SmartDataElement
ON Data.ELEMENT_ID = SmartDataElement.CONCEPT_ID
INNER JOIN hive_metastore.RAW_CLARITY.CUST_SERVICE in_cs
ON Data.RECORD_ID_NUMERIC = in_cs.COMM_ID AND NOT EXISTS
( SELECT 1 FROM hive_metastore.RAW_CLARITY.CUST_SERVICE AS cs
LEFT JOIN hive_metastore.RAW_CLARITY.CAL_REFERENCE_CRM AS crc
ON cs.COMM_ID = crc.REF_CRM_ID
LEFT JOIN hive_metastore.RAW_CLARITY.CAL_COMM_TRACKING AS cct
ON crc.COMM_ID = cct.COMM_ID
WHERE cct.COMM_ID IS NULL AND in_cs.COMM_ID = cs.COMM_ID)
I have to execute the below query using spark. How can I optimize the join.
Each data frame has records in millions
SELECT DISTINCT col1,
col2,
col3...
FROM ool
INNER JOIN ooh ON ool.header_key = ooh.header_key
AND ool.org_key = ooh.org_key
INNER JOIN msib ON ool.inventory_item_key = msib.inventory_item_key
AND ool.ship_from_org_key = msib.o_key
INNER JOIN curr ON curr.from_currency = ooh.transactional_curr_code
AND date_format(curr.date, 'yyyy-mm-dd') = date_format(ooh.date, 'yyyy-mm-dd')
INNER JOIN mtl_parameters mp ON ool.ship_from_org_key = mp.o_key
INNER JOIN ood ON ood.o_key = mp.o_key
INNER JOIN ot ON ooh.order_type_key = ot.transaction_type_key
INNER JOIN csu ON ool.ship_to_org_key = csu.site_use_key
INNER JOIN csa ON csu.site_key = csa._site_key
INNER JOIN ps ON csa.party_key = ps.party_key
INNER JOIN hca ON csa.account_key = hca.account_key
INNER JOIN hp ON hca.party_key = hp.party_key
INNER JOIN hl ON ps.location_key = hl.location_key
INNER JOIN csu1 ON ool.invoice_to_key = csu1.use_key
INNER JOIN csa1 ON ool.invoice_to_key = csu1.use_key
AND csu1.cust_acctkey = csa1.custkey
INNER JOIN ps1 ON csa1.party_key = ps1.party_key
INNER JOIN hca1 ON csa1.cust_key = hca1.cust_key
INNER JOIN hp1 ON hca1.party_key = hp1.party_key
INNER JOIN hl1 ON ps1.loc_key = hl1.loc_key
INNER JOIN hou ON ool.or_key = hou.o_key
How can I optimize this join in pyspark?
ooh and ool are the driver dataframes and their record count will be in hundreds of million range.
new to this (very new- and self teaching).....i have a query that draws from multiple tables on my computer system that gets all the appraised values and sales values from a subdivision. in my system, it runs the query fine. but when i try to convert it to run embedded in an excel sheet it gives me error saying no column name for 2 c and 3 c. when i put punctuation around the column names it says there is a syntax error with the alias "as c" at the bottom-- been awake too long--- what am i doing wrong ?:
select distinct pv.prop_id, ac.file_as_name,
'sale_type' , 'deed_date' , 'sale_date' , 'sale_type' , 'sale_price' ,
(pv.land_hstd_val + pv.land_non_hstd_val + pv.ag_market + pv.timber_market)as land_val,
(pv.imprv_hstd_val + pv.imprv_non_hstd_val)as imprv_val,
pv.market, pv.abs_subdv_cd
from property_val pv with (nolock)
inner join prop_supp_assoc psa with (nolock) on
pv.prop_id = psa.prop_id
and pv.prop_val_yr = psa.owner_tax_yr
and pv.sup_num = psa.sup_num
inner join property p with (nolock) on
pv.prop_id = p.prop_id
inner join owner o with (nolock) on
pv.prop_id = o.prop_id
and pv.prop_val_yr = o.owner_tax_yr
and pv.sup_num = o.sup_num
inner join account ac with (nolock) on
o.owner_id = ac.acct_id
left outer join
(select cop.prop_id,
convert(varchar(20), co.deed_dt, 101)as deed_date,
convert(varchar(20), s.sl_dt, 101)as sale_date,
s.sl_price as sale_price, s.sl_type_cd as sale_type
from chg_of_owner_prop_assoc cop with (nolock)
inner join chg_of_owner co with (nolock) on
co.chg_of_owner_id = cop.chg_of_owner_id
inner join sale s with (nolock) on
co.chg_of_owner_id = s.chg_of_owner_id
where cop.seq_num = 0) as c
on c.prop_id = pv.prop_id
where pv.prop_val_yr = 2016
and(pv.prop_inactive_dt is null or udi_parent ='t')
and pv.abs_subdv_cd in('s3579')
order by pv.abs_subdv_cd, pv.prop_id
Is it SQL Server? Try surrounding column names with square brackets instead of quotes.
I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?
You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min
This is my statement:
val Porders = sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
( SELECT so.ROW_ID FROM s_order so
JOIN s_order_item soi
ON so.ROW_ID = soi.ORDER_ID
JOIN s_order_type sot
ON so.ORDER_TYPE_ID = sot.ROW_ID
JOIN s_product sp
ON soi.PROD_ID = sp.ROW_ID
WHERE (sp.NAME like '%VIP%' OR sp.NAME like '%BIZ%' OR sp.NAME like '%UniFi%')
AND LOWER(sot.NAME) = 'new install')
""")
I receive the following error:
ERROR : java.lang.RuntimeException: [3.3] failure: identifier expected
( SELECT so.ROW_ID FROM s_order so JOIN s_order_item soi
^
What could be the reason?
The reason, why this happens, is that subqueries are not supported: see Spark-4226.
Even a query like
sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
(SELECT * FROM s_order)
""")
does not work currently (speaking of Spark SQL 1.5.1)
Try to replace your subquery by a join, e.g. https://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html