Unable to query complex SQL statements, from hive table using pyspark - apache-spark

Hi I am trying to query hive table from spark context.
my code:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table('select * from db.table_name')
bank.show()
simple queries like this works fine, without any error.
but when I try with below query.
query = """with table1 as ( select distinct a,b
from db_first.table_first
order by b )
--select * from table1 order by b
,c as ( select *
from db_first.table_two)
--select * from c
,d as ( select *
from c
where upper(e) = 'Y')
--select * from d
,f as ( select table1.b
,cast(regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',1) as Int) aid1
,regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',2) aid2
,cast(regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',3) as Int) aid3
,from_unixtime(cast(substr(lastdbupdatedts,1,10) as int),"yyyy-MM-dd
HH:mm:ss") lastupdts
,d.*
from d
left outer join table1
on d.hiba = table1.a)
select * from f order by b,aid1,aid2,aid3 limit 100"""
I get the below error, please help.
ParseExceptionTraceback (most recent call last)
<ipython-input-27-cedb6fad210d> in <module>()
3 hive_context = HiveContext(sc)
4 #bank = hive_context.table("bdalab.test_prodapt_inv")
----> 5 bank = hive_context.table(first)
ParseException: u"\nmismatched input '*' expecting <EOF>(line 1, pos 7)\n\n== SQL ==\nselect *

You need to use .sql method instead of .table method if we are using sql query.
1.Using .table method then we need to provide table name:
>>> hive_context.table("<db_name>.<table_name>").show()
2.Using .sql method then provide your with cte expression:
>>> first ="with cte..."
>>> hive_context.sql(first).show()

Related

EOF in multi-line string error in PySpark

I was running the following query in PySpark, the SQL query runs fine on Hive
spark.sql(f"""
create table DEFAULT.TMP_TABLE as
select b.customer_id, prob
from
bi_prod.TMP_score_1st_day a,
(select customer_id, prob from BI_prod.Tbl_brand1_scoring where insert_date = 20230101
union all
select customer_id, prob from BI_prod.Tbl_brand2_scoring where insert_date = 20230101)  b
where a.customer_id = b.customer_id
""")
This produces the following error
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
Need to fix this error, can't find out why error is occurring.
I recommend rewriting the code in a more Pythonic way.
from pyspark.sql.functions import col
df1 = (
spark.table('bi_prod.TMP_score_1st_day')
.filter(col('insert_date') == '20230101')
.select('customer_id')
)
df2 = (
spark.table('bi_prod.Tbl_brand2_scoring')
.filter(col('insert_date') == '20230101')
.select('customer_id', 'prob')
)
df = df1.join(df2, 'customer_id')
df.show(1, vertical=True)
Let me know how this works for you and if you still get the same error.

Execute SQL query in python pandas

How can we execute sql in python pandas :
SQL :
select
a.*,
b.vol1 / sum(vol1) over (
partition by a.sale, a.d_id,
a.month, a.p_id
) vol_r,
a.vol2* b.vol1/ sum(b.vol1) over (
partition by a.sale, a.d_id,
a.month, a.p_id
) vol_t
from
sales1 a
left join sales2 b on a.sale = b.sale
and a.d_id = b.d_id
and a.month = b.month
and a.p_id = b.p_id
I know one way is pandasql but getting error as sql involves partition by.

Spark SQL passing variables - Synapse (Spark pool)

I have following SparkSQL (Spark pool - Spark 3.0) code and I want to pass a variable to it. How can I do that? I tried the following:
#cel 1 (Toggle parameter cell):
%%pyspark
stat = 'A'
#cel2:
select * from silver.employee_dim where Status= '$stat'
When you're running your cell as PySpark, you can pass a variable to your query like this:
#cel 1 (Toggle parameter cell):
%%pyspark
stat = 'A' #define variable
#cel2:
%%pyspark
query = "select * from silver.employee_dim where Status='" + stat + "'"
spark.sql(query) #execute SQL
Since you're executing a SELECT statement, I assume you might want to load the result to a DataFrame:
sqlDf = spark.sql(query)
sqlDf.head(5) #select first 5 rows

PySpark DataFrame Code for an HiveQL that takes 3-4 hours

The following HiveQL code takes about 3 to 4 hours and I am trying effectively convert this into a pyspark data frame code. Any dataframe experts input is appreciated a lot.
INSERT overwrite table dlstage.DIBQtyRank_C11 PARTITION(fiscalyearmonth)
SELECT * FROM
(SELECT a.matnr, a.werks, a.periodstartdate, a.fiscalyear, a.fiscalmonth,b.dy_id, MaterialType,(COALESCE(a.salk3,0)) salk3,(COALESCE(a.lbkum,0)) lbkum, sum(a.valuatedquantity) AS valuatedquantity, sum(a.InventoryValue) AS InventoryValue,
rank() over (PARTITION by dy_id, werks, matnr order by a.max_date DESC) rnk, sum(stprs) stprs, max(peinh) peinh, fcurr,fiscalyearmonth
FROM dlstage.DIBmsegFinal a
LEFT JOIN dlaggr.dim_fiscalcalendar b ON a.periodstartdate=b.fmth_begin_dte WHERE a.max_date >= b.fmth_begin_dte AND a.max_date <= b.dy_id and
fiscalYearmonth = concat(fyr_id,lpad(fmth_nbr,2,0))
GROUP BY a.matnr, a.werks,dy_id, max_date, a.periodstartdate, a.fiscalyear, a.fiscalmonth, MaterialType, fcurr, COALESCE(a.salk3,0), COALESCE(a.lbkum,0),fiscalyearmonth) a
WHERE a.rnk=1 and a.fiscalYear = '%s'" %(year) + " and a.fiscalmonth ='%s'" %(mnth)

SQL Oracle Sub-query

I am having a issue getting this Sub-query to run. I am using Toad Data Point -Oracle. I get syntax error. I have tried several different ways with no luck. I am knew to sub-query's
Select *
from FINC.VNDR_ITEM_M as M
where M.ACCT_DOC_NBR = A.ACCT_DOC_NBR
(SELECT A.CLIENT_ID,
A.SRC_SYS_ID,
A.CO_CD,
A.ACCT_NBR,
A.CLR_DT,
A.ASGN_NBR,
A.FISCAL_YR,
A.ACCT_DOC_NBR,
A.LINE_ITEM_NBR,
A.MFR_PART_NBR,
A.POST_DT,
A.DRCR_IND,
A.DOC_CRNCY_AMT,
A.CRNCY_CD,
A.BSL_DT
FROM FINC.VNDR_ITEM_F A
WHERE A.CLR_DT IN (SELECT MAX(B.CLR_DT)
FROM FINC.VNDR_ITEM_F AS B
where (B.ACCT_DOC_NBR = A.ACCT_DOC_NBR and B.FISCAL_YR=A.FISCAL_YR and B.LINE_ITEM_NBR = A.LINE_ITEM_NBR and B.SRC_SYS_ID =A.SRC_SYS_ID and B.POST_DT=A.POST_DT and B.CO_CD=A.CO_CD)
and (B.CO_CD >='1000' and B.CO_CD <= '3000' or B.CO_CD ='7090') and (B.POST_DT Between to_date ('08/01/2018','mm/dd/yyyy')
AND to_date ('08/31/2018', 'mm/dd/yyyy')) and (B.SRC_SYS_ID ='15399') and (B.FISCAL_YR ='2018'))
GROUP BY
A.CLIENT_ID,
A.SRC_SYS_ID,
A.CO_CD,
A.ACCT_NBR,
A.CLR_DT,
A.ASGN_NBR,
A.FISCAL_YR,
A.ACCT_DOC_NBR,
A.LINE_ITEM_NBR,
A.MFR_PART_NBR,
A.POST_DT,
A.DRCR_IND,
A.DOC_CRNCY_AMT,
A.CRNCY_CD,
A.BSL_DT)
Your syntax is broken, you put subquery just at the end. Now it looks like:
select *
from dual as m
where a.dummy = m.dummy
(select dummy from dual)
It is in incorrect place, not joined, not aliased. What you should probably do is:
select *
from dual m
join (select dummy from dual) a on a.dummy = m.dummy
You also have some redundant, unnecessary brackets, but that's minor flaw. Full code (I cannot test it without data access):
select *
from FINC.VNDR_ITEM_M M
join (SELECT A.CLIENT_ID, A.SRC_SYS_ID, A.CO_CD, A.ACCT_NBR, A.CLR_DT, A.ASGN_NBR,
A.FISCAL_YR, A.ACCT_DOC_NBR, A.LINE_ITEM_NBR, A.MFR_PART_NBR, A.POST_DT,
A.DRCR_IND, A.DOC_CRNCY_AMT, A.CRNCY_CD, A.BSL_DT
FROM FINC.VNDR_ITEM_F A
WHERE A.CLR_DT IN (SELECT MAX(B.CLR_DT)
FROM FINC.VNDR_ITEM_F AS B
where B.ACCT_DOC_NBR = A.ACCT_DOC_NBR
and B.FISCAL_YR=A.FISCAL_YR
and B.LINE_ITEM_NBR = A.LINE_ITEM_NBR
and B.SRC_SYS_ID =A.SRC_SYS_ID
and B.POST_DT=A.POST_DT
and B.CO_CD=A.CO_CD
and (('1000'<=B.CO_CD and B.CO_CD<='3000') or B.CO_CD='7090')
and B.POST_DT Between to_date ('08/01/2018', 'mm/dd/yyyy')
AND to_date ('08/31/2018', 'mm/dd/yyyy')
and B.SRC_SYS_ID ='15399' and B.FISCAL_YR ='2018')
GROUP BY A.CLIENT_ID, A.SRC_SYS_ID, A.CO_CD, A.ACCT_NBR, A.CLR_DT, A.ASGN_NBR,
A.FISCAL_YR, A.ACCT_DOC_NBR, A.LINE_ITEM_NBR, A.MFR_PART_NBR, A.POST_DT,
A.DRCR_IND, A.DOC_CRNCY_AMT, A.CRNCY_CD, A.BSL_DT) A
on M.ACCT_DOC_NBR = A.ACCT_DOC_NBR and M.CO_CD=A.CO_CD;
You need to add an alias to the SubSelect (or Derived Table in Standard SQL):
select *
from
( select .......
) AS dt
join ....

Resources