Using AVG in Spark with window function - apache-spark

I have the following SQL Query:
Select st.Value,
st.Id,
ntile(2) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls),
AVG(St.Value) OVER (PARTITION BY St.Id, St.VarId ORDER By St.Sls, St.Date)
FROM table tb
INNER JOIN staging st on St.Id = tb.Id
I've tried to adapt this to Spark/PySpark using window function, my code is below:
windowSpec_1 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls")
windowSpec_2 = Window.partitionBy("staging.Id", "staging.VarId").orderBy("staging.Sls", "staging.Date")
df= table.join(
staging,
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)
Although I'm getting the following error:
pyspark.sql.utils.AnalysisException: Can't extract value from Value#42928: need struct type but got decimal(16,6)
How Can I solve this problem? Is it necessary to group data?

Maybe you forgot to assign alias to staging?:
df= table.join(
staging.alias("staging"),
on=f.col("staging.Id") == f.col("table.Id"),
how='inner'
).select(
f.col("staging.Value"),
f.ntile(2).over(windowSpec_1),
f.avg("staging.Value").over(windowSpec_2)
)

Related

EOF in multi-line string error in PySpark

I was running the following query in PySpark, the SQL query runs fine on Hive
spark.sql(f"""
create table DEFAULT.TMP_TABLE as
select b.customer_id, prob
from
bi_prod.TMP_score_1st_day a,
(select customer_id, prob from BI_prod.Tbl_brand1_scoring where insert_date = 20230101
union all
select customer_id, prob from BI_prod.Tbl_brand2_scoring where insert_date = 20230101)  b
where a.customer_id = b.customer_id
""")
This produces the following error
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
Need to fix this error, can't find out why error is occurring.
I recommend rewriting the code in a more Pythonic way.
from pyspark.sql.functions import col
df1 = (
spark.table('bi_prod.TMP_score_1st_day')
.filter(col('insert_date') == '20230101')
.select('customer_id')
)
df2 = (
spark.table('bi_prod.Tbl_brand2_scoring')
.filter(col('insert_date') == '20230101')
.select('customer_id', 'prob')
)
df = df1.join(df2, 'customer_id')
df.show(1, vertical=True)
Let me know how this works for you and if you still get the same error.

I'm getting an unexpected failed assertion error when joining Spark Dataframe - Found duplicate rewrite attributes

When I run the code below, I get the error java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes. Prior to updating our databricks runtime this ran smoothly.
top10_df is a dataframe of data with unique keys in the list groups.
res_df is an aggregation of the unique keys in top10_df with min and max dates.
once res_df is created and persisted it is joined back into the top10_df on the unique keys in groups.
groups = ['col1','col2','col3','col4']
min_date_created = fn.min('date_created').alias('min_date_created')
max_date_created = fn.max('date_created').alias('max_date_created')
res_df = (top10_df
.groupBy(groups)
.agg(min_date_created
,max_date_created
)
)
res_df.persist()
print(res_df.count())
score_rank = fn.row_number().over(w.partitionBy(groups).orderBy(fn.desc('score')))
unique_issue_id = fn.row_number().over(w.orderBy(groups))
out_df = (top10_df.alias('t10')
.join(res_df.alias('res'),groups,'left')
.where(fn.col('t10.date_created')==fn.col('res.max_date_created'))
.drop(fn.col('t10.date_created'))
.drop(fn.col('t10.date_updated'))
.withColumn('score_rank',score_rank)
.where(fn.col('score_rank')==1)
.drop('score_rank'
,'latest_revision_complete_hash'
,'latest_revision_durable_hash'
)
.withColumn('unique_issue_id',unique_issue_id)
.withColumnRenamed('res.id','resource_id')
)
out_df.persist()
print(out_df.count())
Instead of:
out_df = (top10_df.alias('t10')
.join(res_df.alias('res'),groups,'left')
right after the join, select and alias all columns in your right-hand-side df to disambiguate the duplicate attributes:
out_df = (
top10_df.alias('t10')
.join(
res_df.alias('res').select(
fn.col('groups').alias('groups'),
fn.col('min_date_created').alias('min_date_created'),
fn.col('max_date_created').alias('max_date_created')
),
groups,
'left'
)

Converting SQL query to Spark Data-frame

I want to convert the below query's to a spark data frame (I am pretty new to spark):
-- Creating group number
select distinct *, DENSE_RANK() OVER(ORDER BY person_id, trust_id) AS group_number;
-- This is what I got so far for above
df = self.spark.sql("select person_id, trust_id, insurance_id, amount, time_of_app, place_of_app from {}".format(self.tables['people']))
df = df.withColumn("group_number", dense_rank().over(Window.partitionBy("person_id", "trust_id").OrderBy("person_id", "trust_id")))
-- Different query 1
where group_number in (select group_number from etl_table_people where code like 'H%') group by group_number having count(distinct amount) > 1;
-- Different query 2
where insurance_id = 'V94.12'
group by group_number having count(distinct amount) = 2;
What you are looking for is Window Spec Function of spark.
val windowSpec = Window.partitionBy("person_id","trust_id").orderBy(col("person_id").desc).orderBy(col("trust_id").desc)
df.withColumn("group_number", dense_rank() over windowSpec)
And you get your data frame using spark based on your Data Source. You can refer this if your source is Hive

Need to Join multiple tables in pyspark:

query using:
df= (df1.alias('a')
.join(df2, a.id == df2.id, how='inner')
.select('a.*').alias('b')
.join(df3, b.id == df3.id, how='inner'))
error: name 'b' is not defined.
.alias('b') does not create a Python identifier named b. It sets an internal name of the returned dataframe. Your a.id is likely not the thing you expect it to be, too, but is something defined previously.
I can't remember a nice way to access the newly created DF by name right in the expression. I'd go with an intermediate identifier:
df_joined = df1.join(df1.id == df2.id, how='inner')
result_df = dj_joined.join(df_joined.id == df3.id, how='inner')

What does "Correlated scalar subqueries must be Aggregated" mean?

I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?
You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min

Resources