Convert spark sql to pyspark dataframe API for one query - apache-spark

Below is the query to be changed to pyspark dataframe
SELECT b.se10,
b.se3,
b.se_aggrtr_indctr,
b.key_swipe_ind
FROM
(SELECT se10,
se3,
se_aggrtr_indctr,
ROW_NUMBER() OVER (PARTITION BY SE10
ORDER BY se_aggrtr_indctr DESC) AS rn,
key_swipe_ind
FROM fraud_details_data_whole
GROUP BY se10,
se3,
se_aggrtr_indctr ,
key_swipe_ind) b
WHERE b.rn<2

simply use spark.sql:
sql_statement = """SELECT b.se10,
b.se3,
b.se_aggrtr_indctr,
b.key_swipe_ind
FROM
(SELECT se10,
se3,
se_aggrtr_indctr,
ROW_NUMBER() OVER (PARTITION BY SE10
ORDER BY se_aggrtr_indctr DESC) AS rn,
key_swipe_ind
FROM fraud_details_data_whole
GROUP BY se10,
se3,
se_aggrtr_indctr ,
key_swipe_ind) b
WHERE b.rn<2"""
df = spark.sql(sql_statement)

Related

Spark replaceWhere error even after filtering df

I tested this on 10 date partitions and 3 failed. I'm completely lost on what could be the issue. dt is just a widget for local testing.
dbutils.widgets.text('dt', '', 'dt: yyyy-MM-dd')
# 2022-05-18
silver_df = spark.sql(f"""
select data:id::int as id
, data:name::string as user_name
, substring(data:location::string, 0, 100) as country_name
, data:updated_at::string as updated_at
, to_date(data:updated_at::string) as dt
, dt as processed_dt
from bronze.{table_name}
where dt = '{dt}'
qualify row_number() over(partition by id order by updated_at desc) = 1
""")
(
silver_df
.write
.format('delta')
.partitionBy('dt', 'processed_dt')
.mode('overwrite')
.option('replaceWhere', f"processed_dt == '{dt}'")
.saveAsTable(f"{database}.{table_name}")
)
AnalysisException: Data written out does not match replaceWhere 'processed_dt == '2022-05-18''.

Execute SQL query in python pandas

How can we execute sql in python pandas :
SQL :
select
a.*,
b.vol1 / sum(vol1) over (
partition by a.sale, a.d_id,
a.month, a.p_id
) vol_r,
a.vol2* b.vol1/ sum(b.vol1) over (
partition by a.sale, a.d_id,
a.month, a.p_id
) vol_t
from
sales1 a
left join sales2 b on a.sale = b.sale
and a.d_id = b.d_id
and a.month = b.month
and a.p_id = b.p_id
I know one way is pandasql but getting error as sql involves partition by.

Converting SQL query to Spark Data-frame

I want to convert the below query's to a spark data frame (I am pretty new to spark):
-- Creating group number
select distinct *, DENSE_RANK() OVER(ORDER BY person_id, trust_id) AS group_number;
-- This is what I got so far for above
df = self.spark.sql("select person_id, trust_id, insurance_id, amount, time_of_app, place_of_app from {}".format(self.tables['people']))
df = df.withColumn("group_number", dense_rank().over(Window.partitionBy("person_id", "trust_id").OrderBy("person_id", "trust_id")))
-- Different query 1
where group_number in (select group_number from etl_table_people where code like 'H%') group by group_number having count(distinct amount) > 1;
-- Different query 2
where insurance_id = 'V94.12'
group by group_number having count(distinct amount) = 2;
What you are looking for is Window Spec Function of spark.
val windowSpec = Window.partitionBy("person_id","trust_id").orderBy(col("person_id").desc).orderBy(col("trust_id").desc)
df.withColumn("group_number", dense_rank() over windowSpec)
And you get your data frame using spark based on your Data Source. You can refer this if your source is Hive

PySpark DataFrame Code for an HiveQL that takes 3-4 hours

The following HiveQL code takes about 3 to 4 hours and I am trying effectively convert this into a pyspark data frame code. Any dataframe experts input is appreciated a lot.
INSERT overwrite table dlstage.DIBQtyRank_C11 PARTITION(fiscalyearmonth)
SELECT * FROM
(SELECT a.matnr, a.werks, a.periodstartdate, a.fiscalyear, a.fiscalmonth,b.dy_id, MaterialType,(COALESCE(a.salk3,0)) salk3,(COALESCE(a.lbkum,0)) lbkum, sum(a.valuatedquantity) AS valuatedquantity, sum(a.InventoryValue) AS InventoryValue,
rank() over (PARTITION by dy_id, werks, matnr order by a.max_date DESC) rnk, sum(stprs) stprs, max(peinh) peinh, fcurr,fiscalyearmonth
FROM dlstage.DIBmsegFinal a
LEFT JOIN dlaggr.dim_fiscalcalendar b ON a.periodstartdate=b.fmth_begin_dte WHERE a.max_date >= b.fmth_begin_dte AND a.max_date <= b.dy_id and
fiscalYearmonth = concat(fyr_id,lpad(fmth_nbr,2,0))
GROUP BY a.matnr, a.werks,dy_id, max_date, a.periodstartdate, a.fiscalyear, a.fiscalmonth, MaterialType, fcurr, COALESCE(a.salk3,0), COALESCE(a.lbkum,0),fiscalyearmonth) a
WHERE a.rnk=1 and a.fiscalYear = '%s'" %(year) + " and a.fiscalmonth ='%s'" %(mnth)

Spark sql when joining two or more tables using two select statements

This is my statement:
val Porders = sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
( SELECT so.ROW_ID FROM s_order so
JOIN s_order_item soi
ON so.ROW_ID = soi.ORDER_ID
JOIN s_order_type sot
ON so.ORDER_TYPE_ID = sot.ROW_ID
JOIN s_product sp
ON soi.PROD_ID = sp.ROW_ID
WHERE (sp.NAME like '%VIP%' OR sp.NAME like '%BIZ%' OR sp.NAME like '%UniFi%')
AND LOWER(sot.NAME) = 'new install')
""")
I receive the following error:
ERROR : java.lang.RuntimeException: [3.3] failure: identifier expected
( SELECT so.ROW_ID FROM s_order so JOIN s_order_item soi
^
What could be the reason?
The reason, why this happens, is that subqueries are not supported: see Spark-4226.
Even a query like
sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
(SELECT * FROM s_order)
""")
does not work currently (speaking of Spark SQL 1.5.1)
Try to replace your subquery by a join, e.g. https://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html

Resources