Below is the query to be changed to pyspark dataframe
SELECT b.se10,
b.se3,
b.se_aggrtr_indctr,
b.key_swipe_ind
FROM
(SELECT se10,
se3,
se_aggrtr_indctr,
ROW_NUMBER() OVER (PARTITION BY SE10
ORDER BY se_aggrtr_indctr DESC) AS rn,
key_swipe_ind
FROM fraud_details_data_whole
GROUP BY se10,
se3,
se_aggrtr_indctr ,
key_swipe_ind) b
WHERE b.rn<2
simply use spark.sql:
sql_statement = """SELECT b.se10,
b.se3,
b.se_aggrtr_indctr,
b.key_swipe_ind
FROM
(SELECT se10,
se3,
se_aggrtr_indctr,
ROW_NUMBER() OVER (PARTITION BY SE10
ORDER BY se_aggrtr_indctr DESC) AS rn,
key_swipe_ind
FROM fraud_details_data_whole
GROUP BY se10,
se3,
se_aggrtr_indctr ,
key_swipe_ind) b
WHERE b.rn<2"""
df = spark.sql(sql_statement)
Related
I tested this on 10 date partitions and 3 failed. I'm completely lost on what could be the issue. dt is just a widget for local testing.
dbutils.widgets.text('dt', '', 'dt: yyyy-MM-dd')
# 2022-05-18
silver_df = spark.sql(f"""
select data:id::int as id
, data:name::string as user_name
, substring(data:location::string, 0, 100) as country_name
, data:updated_at::string as updated_at
, to_date(data:updated_at::string) as dt
, dt as processed_dt
from bronze.{table_name}
where dt = '{dt}'
qualify row_number() over(partition by id order by updated_at desc) = 1
""")
(
silver_df
.write
.format('delta')
.partitionBy('dt', 'processed_dt')
.mode('overwrite')
.option('replaceWhere', f"processed_dt == '{dt}'")
.saveAsTable(f"{database}.{table_name}")
)
AnalysisException: Data written out does not match replaceWhere 'processed_dt == '2022-05-18''.
How can we execute sql in python pandas :
SQL :
select
a.*,
b.vol1 / sum(vol1) over (
partition by a.sale, a.d_id,
a.month, a.p_id
) vol_r,
a.vol2* b.vol1/ sum(b.vol1) over (
partition by a.sale, a.d_id,
a.month, a.p_id
) vol_t
from
sales1 a
left join sales2 b on a.sale = b.sale
and a.d_id = b.d_id
and a.month = b.month
and a.p_id = b.p_id
I know one way is pandasql but getting error as sql involves partition by.
I want to convert the below query's to a spark data frame (I am pretty new to spark):
-- Creating group number
select distinct *, DENSE_RANK() OVER(ORDER BY person_id, trust_id) AS group_number;
-- This is what I got so far for above
df = self.spark.sql("select person_id, trust_id, insurance_id, amount, time_of_app, place_of_app from {}".format(self.tables['people']))
df = df.withColumn("group_number", dense_rank().over(Window.partitionBy("person_id", "trust_id").OrderBy("person_id", "trust_id")))
-- Different query 1
where group_number in (select group_number from etl_table_people where code like 'H%') group by group_number having count(distinct amount) > 1;
-- Different query 2
where insurance_id = 'V94.12'
group by group_number having count(distinct amount) = 2;
What you are looking for is Window Spec Function of spark.
val windowSpec = Window.partitionBy("person_id","trust_id").orderBy(col("person_id").desc).orderBy(col("trust_id").desc)
df.withColumn("group_number", dense_rank() over windowSpec)
And you get your data frame using spark based on your Data Source. You can refer this if your source is Hive
The following HiveQL code takes about 3 to 4 hours and I am trying effectively convert this into a pyspark data frame code. Any dataframe experts input is appreciated a lot.
INSERT overwrite table dlstage.DIBQtyRank_C11 PARTITION(fiscalyearmonth)
SELECT * FROM
(SELECT a.matnr, a.werks, a.periodstartdate, a.fiscalyear, a.fiscalmonth,b.dy_id, MaterialType,(COALESCE(a.salk3,0)) salk3,(COALESCE(a.lbkum,0)) lbkum, sum(a.valuatedquantity) AS valuatedquantity, sum(a.InventoryValue) AS InventoryValue,
rank() over (PARTITION by dy_id, werks, matnr order by a.max_date DESC) rnk, sum(stprs) stprs, max(peinh) peinh, fcurr,fiscalyearmonth
FROM dlstage.DIBmsegFinal a
LEFT JOIN dlaggr.dim_fiscalcalendar b ON a.periodstartdate=b.fmth_begin_dte WHERE a.max_date >= b.fmth_begin_dte AND a.max_date <= b.dy_id and
fiscalYearmonth = concat(fyr_id,lpad(fmth_nbr,2,0))
GROUP BY a.matnr, a.werks,dy_id, max_date, a.periodstartdate, a.fiscalyear, a.fiscalmonth, MaterialType, fcurr, COALESCE(a.salk3,0), COALESCE(a.lbkum,0),fiscalyearmonth) a
WHERE a.rnk=1 and a.fiscalYear = '%s'" %(year) + " and a.fiscalmonth ='%s'" %(mnth)
This is my statement:
val Porders = sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
( SELECT so.ROW_ID FROM s_order so
JOIN s_order_item soi
ON so.ROW_ID = soi.ORDER_ID
JOIN s_order_type sot
ON so.ORDER_TYPE_ID = sot.ROW_ID
JOIN s_product sp
ON soi.PROD_ID = sp.ROW_ID
WHERE (sp.NAME like '%VIP%' OR sp.NAME like '%BIZ%' OR sp.NAME like '%UniFi%')
AND LOWER(sot.NAME) = 'new install')
""")
I receive the following error:
ERROR : java.lang.RuntimeException: [3.3] failure: identifier expected
( SELECT so.ROW_ID FROM s_order so JOIN s_order_item soi
^
What could be the reason?
The reason, why this happens, is that subqueries are not supported: see Spark-4226.
Even a query like
sqlContext.sql(
"""SELECT count(STATUS_CD)
FROM s_order
WHERE STATUS_CD = 'pending' AND ROW_ID IN
(SELECT * FROM s_order)
""")
does not work currently (speaking of Spark SQL 1.5.1)
Try to replace your subquery by a join, e.g. https://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html