EOF in multi-line string error in PySpark - apache-spark

I was running the following query in PySpark, the SQL query runs fine on Hive
spark.sql(f"""
create table DEFAULT.TMP_TABLE as
select b.customer_id, prob
from
bi_prod.TMP_score_1st_day a,
(select customer_id, prob from BI_prod.Tbl_brand1_scoring where insert_date = 20230101
union all
select customer_id, prob from BI_prod.Tbl_brand2_scoring where insert_date = 20230101)  b
where a.customer_id = b.customer_id
""")
This produces the following error
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
Need to fix this error, can't find out why error is occurring.

I recommend rewriting the code in a more Pythonic way.
from pyspark.sql.functions import col
df1 = (
spark.table('bi_prod.TMP_score_1st_day')
.filter(col('insert_date') == '20230101')
.select('customer_id')
)
df2 = (
spark.table('bi_prod.Tbl_brand2_scoring')
.filter(col('insert_date') == '20230101')
.select('customer_id', 'prob')
)
df = df1.join(df2, 'customer_id')
df.show(1, vertical=True)
Let me know how this works for you and if you still get the same error.

Related

Using AVG in Spark with window function

I have the following SQL Query:
Select st.Value,
st.Id,
ntile(2) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls),
AVG(St.Value) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls, St.Date)
FROM table tb
INNER JOIN staging st on St.Id = tb.Id
I've tried to adapt this to Spark/PySpark using window function, my code is below:
windowSpec_1 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls")
windowSpec_2 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls", "staging.Date")
df= table.join(
staging,
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)
Although I'm getting the following error:
pyspark.sql.utils.AnalysisException: Can't extract value from Value#42928: need struct type but got decimal(16,6)
How Can I solve this problem? Is it necessary to group data?
Maybe you forgot to assign alias to staging?:
df= table.join(
staging.alias("staging"),
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)

Databricks SQL, error when running update with join

I am trying to run an update on a Delta table and I am getting the following error.
Error in SQL statement: ParseException:
mismatched input 'from' expecting <EOF>(line 3, pos 0)
I really can't figure out why I get this error. Can anyone help me?
update eff
set eff.ACP = ia.COSTVALUE
from views.test_acp_effect eff
left join source_tables_db.ia_master_items ia
on eff.CODE = ia.IMM_CODE
where eff.DXN_Period = (
select td.MY_FISCAL_PERIOD_ABBR
from timedelta td
where current_date() between td.MIN_P_DATE
and td.MAX_P_DATE
)
and eff.CODE = source_tables_db.ia_master_items.IMM_CODE

I'm getting an unexpected failed assertion error when joining Spark Dataframe - Found duplicate rewrite attributes

When I run the code below, I get the error java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes. Prior to updating our databricks runtime this ran smoothly.
top10_df is a dataframe of data with unique keys in the list groups.
res_df is an aggregation of the unique keys in top10_df with min and max dates.
once res_df is created and persisted it is joined back into the top10_df on the unique keys in groups.
groups = ['col1','col2','col3','col4']
min_date_created = fn.min('date_created').alias('min_date_created')
max_date_created = fn.max('date_created').alias('max_date_created')
res_df = (top10_df
.groupBy(groups)
.agg(min_date_created
,max_date_created
)
)
res_df.persist()
print(res_df.count())
score_rank = fn.row_number().over(w.partitionBy(groups).orderBy(fn.desc('score')))
unique_issue_id = fn.row_number().over(w.orderBy(groups))
out_df = (top10_df.alias('t10')
.join(res_df.alias('res'),groups,'left')
.where(fn.col('t10.date_created')==fn.col('res.max_date_created'))
.drop(fn.col('t10.date_created'))
.drop(fn.col('t10.date_updated'))
.withColumn('score_rank',score_rank)
.where(fn.col('score_rank')==1)
.drop('score_rank'
,'latest_revision_complete_hash'
,'latest_revision_durable_hash'
)
.withColumn('unique_issue_id',unique_issue_id)
.withColumnRenamed('res.id','resource_id')
)
out_df.persist()
print(out_df.count())
Instead of:
out_df = (top10_df.alias('t10')
.join(res_df.alias('res'),groups,'left')
right after the join, select and alias all columns in your right-hand-side df to disambiguate the duplicate attributes:
out_df = (
top10_df.alias('t10')
.join(
res_df.alias('res').select(
fn.col('groups').alias('groups'),
fn.col('min_date_created').alias('min_date_created'),
fn.col('max_date_created').alias('max_date_created')
),
groups,
'left'
)

Unable to query complex SQL statements, from hive table using pyspark

Hi I am trying to query hive table from spark context.
my code:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table('select * from db.table_name')
bank.show()
simple queries like this works fine, without any error.
but when I try with below query.
query = """with table1 as ( select distinct a,b
from db_first.table_first
order by b )
--select * from table1 order by b
,c as ( select *
from db_first.table_two)
--select * from c
,d as ( select *
from c
where upper(e) = 'Y')
--select * from d
,f as ( select table1.b
,cast(regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',1) as Int) aid1
,regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',2) aid2
,cast(regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',3) as Int) aid3
,from_unixtime(cast(substr(lastdbupdatedts,1,10) as int),"yyyy-MM-dd
HH:mm:ss") lastupdts
,d.*
from d
left outer join table1
on d.hiba = table1.a)
select * from f order by b,aid1,aid2,aid3 limit 100"""
I get the below error, please help.
ParseExceptionTraceback (most recent call last)
<ipython-input-27-cedb6fad210d> in <module>()
3 hive_context = HiveContext(sc)
4 #bank = hive_context.table("bdalab.test_prodapt_inv")
----> 5 bank = hive_context.table(first)
ParseException: u"\nmismatched input '*' expecting <EOF>(line 1, pos 7)\n\n== SQL ==\nselect *
You need to use .sql method instead of .table method if we are using sql query.
1.Using .table method then we need to provide table name:
>>> hive_context.table("<db_name>.<table_name>").show()
2.Using .sql method then provide your with cte expression:
>>> first ="with cte..."
>>> hive_context.sql(first).show()

"resolved attribute(s) missing" when performing join on pySpark

I have the following two pySpark dataframe:
> df_lag_pre.columns
['date','sku','name','country','ccy_code','quantity','usd_price','usd_lag','lag_quantity']
> df_unmatched.columns
['alt_sku', 'alt_lag_quantity', 'country', 'ccy_code', 'name', 'usd_price']
Now I want to join them on common columns, so I try the following:
> df_lag_pre.join(df_unmatched, on=['name','country','ccy_code','usd_price'])
And I get the following error message:
AnalysisException: u'resolved attribute(s) price#3424 missing from country#3443,month#801,price#808,category#803,subcategory#804,page#805,date#280,link#809,name#806,quantity#807,ccy_code#3439,sku#3004,day#802 in operator !EvaluatePython PythonUDF#<lambda>(ccy_code#3439,price#3424), pythonUDF#811: string;'
Some of the columns that show up on this error, such as price, were part of another dataframe from which df_lag was built from. I can't find any info on how to interpret this message, so any help would be greatly appreciated.
You can perform join this way in pyspark, Please see if this is useful for you:
df_lag_pre.alias("df1")
df_unmatched.alias("df2")
join_both = df1.join(df2, (col("df1.name") == col("df2.name")) & (col("df1.country") == col("df2.country")) & (col("df1.ccy_code") == col("df2.ccy_code")) & (col("df1.usd_price") == col("df2.usd_price")), 'inner')
Update: If you are getting col not defined error, please use below import
from pyspark.sql.functions import col

Resources