deleting particular row in particular condition in pyspark

deleting particular row in particular condition in pyspark - apache-spark

I am newbie in spark. i want to delete a row using spark sql.due to incompatibility deletion in temptable so far i have read that ,operating delete like sql queries i need to save the table in pyspark permanently that is hive table i guess.I have done that too
my code is
spark.sql("select b.ENTITYID as ENTITYID, cm.BLDGID as BldgID,cm.LEASID as LeaseID,coalesce(l.SUITID,(select EmptyDefault from EmptyDefault)) as SuiteID,(select CurrDate from CurrDate) as TxnDate,cm.INCCAT as IncomeCat,'??' as SourceCode,(Select CurrPeriod from CurrPeriod)as Period,coalesce(case when cm.DEPARTMENT ='#' then 'null' else cm.DEPARTMENT end, null) as Dept,'Lease' as ActualProjected ,fnGetChargeInd(cm.EFFDATE,cm.FRQUENCY,cm.BEGMONTH,(select CurrPeriod from CurrPeriod))*coalesce (cm.AMOUNT,0) as ChargeAmt,0 as OpenAmt,cm.CURRCODE as CurrencyCode,case when ('PERIOD.DATACLSD') is null then 'Open' else 'Closed' end as GLClosedStatus,'Unposted'as GLPostedStatus ,'Unpaid' as PaidStatus,cm.FRQUENCY as Frequency,0 as RetroPD from CMRECC cm join BLDG b on cm.BLDGID =b.BLDGID join LEAS l on cm.BLDGID =l.BLDGID and cm.LEASID =l.LEASID and (l.VACATE is null or l.VACATE >= ('select CurrDate from CurrDate')) and (l.EXPIR >= ('select CurrDate from CurrDate') or l.EXPIR < ('select RunDate from RunDate')) left outer join PERIOD on b.ENTITYID = PERIOD.ENTITYID and ('select CurrPeriod from CurrPeriod')=PERIOD.PERIOD where ('select CurrDate from CurrDate')>=cm.EFFDATE and (select CurrDate from CurrDate) <= coalesce(cm.EFFDATE,cast(date_add(( select min(cm2.EFFDATE) from CMRECC cm2 where cm2.BLDGID = cm.BLDGID and cm2.LEASID = cm.LEASID and cm2.INCCAT = cm.INCCAT and 'cm2.EFFDATE' > 'cm.EFFDATE'),-1) as timestamp) ,case when l.EXPIR <(select RunDate from RunDate)then (Select RunDate from RunDate) else l.EXPIR end)").write.saveAsTable('Fact_Temp1')
and after that i have got a permanent table in my pyspark
next i have queried delete operation
spark.sql("DELETE from Fact_Temp1 where ActualProjected='Lease' and ChargAmt=0").show()
I have got this error
pyspark.sql.utils.ParseException: u"\nOperation not allowed: DELETE from(line 1, pos 0)\n\n== SQL ==\nDELETE from Fact_Temp1 where ActualProjected='Lease'and ChargeAmt=0\n^^^\n"
I am a bit confused.is there any different way to write this. I have no idea why i am getting this error.Kindly guide me I am using spark 2.0
kalyan

Related

Python how to pass variables to SQLite complex SQL update Query

I have this SQL query that I confirmed works in SQLite. It updates two columns in the Table. I have 144 columns that need to be updated using the same query. How can I, using Python, pass along variables so I can use the same query to update all of them?
Here is my query to update one column:
UPDATE GBPAUD_TA AS t1
SET _1m_L3_Time = COALESCE(
(
SELECT
MIN(
CASE t1.Action
WHEN 'Buy' THEN CASE WHEN (t2._1M_55 >= t2.Low AND t2._1M_55 < t2.Open) THEN t2.Date_Time END
WHEN 'Sell' THEN CASE WHEN (t2._1M_55 <= t2.High AND t2._1M_55 < t2.Open) THEN t2.Date_Time END
END
)
FROM GBPAUD_DATA t2
WHERE t2.Date_Time >= t1.Open_Date AND t2.Date_Time <= t1.New_Closing_Time
),
t1._1m_L3_Time
);
UPDATE GBPAUD_TA
SET _1m_L3_Price = (SELECT _1M_55
FROM GBPAUD_DATA
WHERE Date_Time = GBPAUD_TA._1m_L3_Time)
where EXISTS (SELECT _1M_55
FROM GBPAUD_DATA
WHERE Date_Time = GBPAUD_TA._1m_L3_Time)
Here is my query showing the variables that I would need to automatically insert:
UPDATE GBPAUD_TA AS t1
SET Variable1 = COALESCE(
(
SELECT
MIN(
CASE t1.Action
WHEN 'Buy' THEN CASE WHEN (t2.Variable2 >= t2.Low AND t2.Variable2< t2.Open) THEN t2.Date_Time END
WHEN 'Sell' THEN CASE WHEN (t2.Variable2 <= t2.High AND t2.Variable2< t2.Open) THEN t2.Date_Time END
END
)
FROM GBPAUD_DATA t2
WHERE t2.Date_Time >= t1.Open_Date AND t2.Date_Time <= t1.New_Closing_Time
),
t1.Variable1
);
UPDATE GBPAUD_TA
SET Variable3 = (SELECT Variable2
FROM GBPAUD_DATA
WHERE Date_Time = GBPAUD_TA.Variable1)
where EXISTS (SELECT Variable2
FROM GBPAUD_DATA
WHERE Date_Time = GBPAUD_TA.Variable1)
I have a total of 3 Variables.
Based upon googling and reading, I found a possible way by using host variables: I use the "?" in place of the variable, combine the variables into a tuple, and then use "executemany()"?
I tried this, but it did not work. It gave me an error:
"cursor.executemany(sql_update_query, SLTuple)
OperationalError: near "?": syntax error"
So what should I do? Any guidance is much appreciated!

Found the answer after I figured out the proper terminology: string formatting and interloping. Found the answer here.

Converting SQL query to Spark Data-frame

I want to convert the below query's to a spark data frame (I am pretty new to spark):
-- Creating group number
select distinct *, DENSE_RANK() OVER(ORDER BY person_id, trust_id) AS group_number;
-- This is what I got so far for above
df = self.spark.sql("select person_id, trust_id, insurance_id, amount, time_of_app, place_of_app from {}".format(self.tables['people']))
df = df.withColumn("group_number", dense_rank().over(Window.partitionBy("person_id", "trust_id").OrderBy("person_id", "trust_id")))
-- Different query 1
where group_number in (select group_number from etl_table_people where code like 'H%') group by group_number having count(distinct amount) > 1;
-- Different query 2
where insurance_id = 'V94.12'
group by group_number having count(distinct amount) = 2;

What you are looking for is Window Spec Function of spark.
val windowSpec = Window.partitionBy("person_id","trust_id").orderBy(col("person_id").desc).orderBy(col("trust_id").desc)
df.withColumn("group_number", dense_rank() over windowSpec)
And you get your data frame using spark based on your Data Source. You can refer this if your source is Hive

PySpark DataFrame Code for an HiveQL that takes 3-4 hours

The following HiveQL code takes about 3 to 4 hours and I am trying effectively convert this into a pyspark data frame code. Any dataframe experts input is appreciated a lot.
INSERT overwrite table dlstage.DIBQtyRank_C11 PARTITION(fiscalyearmonth)
SELECT * FROM
(SELECT a.matnr, a.werks, a.periodstartdate, a.fiscalyear, a.fiscalmonth,b.dy_id, MaterialType,(COALESCE(a.salk3,0)) salk3,(COALESCE(a.lbkum,0)) lbkum, sum(a.valuatedquantity) AS valuatedquantity, sum(a.InventoryValue) AS InventoryValue,
rank() over (PARTITION by dy_id, werks, matnr order by a.max_date DESC) rnk, sum(stprs) stprs, max(peinh) peinh, fcurr,fiscalyearmonth
FROM dlstage.DIBmsegFinal a
LEFT JOIN dlaggr.dim_fiscalcalendar b ON a.periodstartdate=b.fmth_begin_dte WHERE a.max_date >= b.fmth_begin_dte AND a.max_date <= b.dy_id and
fiscalYearmonth = concat(fyr_id,lpad(fmth_nbr,2,0))
GROUP BY a.matnr, a.werks,dy_id, max_date, a.periodstartdate, a.fiscalyear, a.fiscalmonth, MaterialType, fcurr, COALESCE(a.salk3,0), COALESCE(a.lbkum,0),fiscalyearmonth) a
WHERE a.rnk=1 and a.fiscalYear = '%s'" %(year) + " and a.fiscalmonth ='%s'" %(mnth)

Could not add table 'SELECT('

Error given when adding query that runs in SQL developer but not in MS Query. Seems to not like my nested query.
Code I am using:
SELECT ORDER_DATE
,SALES_ORDER_NO
,CUSTOMER_PO_NUMBER
,DELIVER_TO
,STATUS
,ITEM_NUMBER
,DESCRIPTION
,ORD_QTY
,SUM(QUANTITY) AS ON_HAND
,PACKAGE_ID
,PACKAGE_STATUS
,MAX(TRAN_DATE) AS LAST_TRANSACTION
,MIN(DAYS) AS DAYS
FROM (
SELECT TRUNC(SH.ORDER_DATE) AS ORDER_DATE
,SH.SALES_ORDER_NO
,SH.CUSTOMER_PO_NUMBER
,SH.SHIP_CODE AS DELIVER_TO
,SH.STATUS
,SB.ITEM_NUMBER
,IM.DESCRIPTION
,SB.ORD_QTY
,BID.QUANTITY
,SPM.PACKAGE_ID
,CASE
WHEN SPM.SHIPPED = 'Y'
THEN 'SHIPPED'
WHEN SPM.STATUS = 'C'
THEN 'PACKED'
WHEN BID.QUANTITY IS NOT NULL
THEN 'AVAILABLE'
WHEN BID.QUANTITY IS NULL
THEN 'UNAVAILABLE'
END AS PACKAGE_STATUS
,CASE
WHEN SPM.SHIPPED = 'Y'
THEN TRUNC(SPM.BILLING_DATE)
WHEN SPM.STATUS = 'C'
THEN TRUNC(SPM.END_TIME)
WHEN BID.QUANTITY IS NOT NULL
THEN TRUNC(BID.ACTIVATION_TIME)
END AS TRAN_DATE
,CASE
WHEN SPM.SHIPPED = 'Y'
THEN ROUND(SYSDATE - SPM.BILLING_DATE, 0)
WHEN SPM.STATUS = 'C'
THEN ROUND(SYSDATE - SPM.END_TIME, 0)
WHEN BID.QUANTITY IS NOT NULL
THEN ROUND(SYSDATE - BID.ACTIVATION_TIME, 0)
END AS DAYS
FROM SO_HEADER SH
LEFT JOIN SO_BODY SB ON SB.SO_HEADER_TAG = SH.SO_HEADER_TAG
LEFT JOIN SO_PACKAGE_MASTER SPM ON SPM.PACKAGE_ID = SB.PACKAGE_ID
LEFT JOIN ITEM_MASTER IM ON IM.ITEM_NUMBER = SB.ITEM_NUMBER
LEFT JOIN V_BIN_ITEM_DETAIL BID ON BID.ITEM_NUMBER = SB.ITEM_NUMBER
WHERE SH.ORDER_TYPE = 'MSR'
AND (
SB.REASON_CODE IS NULL
OR SB.REASON_CODE NOT LIKE 'CANCEL%'
)
AND (
IM.DESCRIPTION NOT LIKE '%CKV%'
OR IM.DESCRIPTION IS NULL
AND IM.ITEM_NUMBER IS NOT NULL
)
)
WHERE PACKAGE_STATUS <> 'SHIPPED'
GROUP BY ORDER_DATE
,SALES_ORDER_NO
,CUSTOMER_PO_NUMBER
,DELIVER_TO
,STATUS
,ITEM_NUMBER
,DESCRIPTION
,ORD_QTY
,PACKAGE_ID
,PACKAGE_STATUS
ORDER BY (
CASE PACKAGE_STATUS
WHEN 'AVAILABLE'
THEN 1
WHEN 'UNAVAILABLE'
THEN 2
WHEN 'PACKED'
THEN 3
END
)
,LAST_TRANSACTION;
Is there any option I can select that will allow me to run this query?

Getting java.lang.RuntimeException: Unsupported data type NullType when turning a dataframe into permanent hive table

I am newbie in spark . I have created a data frame by using sql query inside pyspark. i want to make it as permanent table for getting advantage in future work. i used bellow code
spark.sql("select b.ENTITYID as ENTITYID, cm.BLDGID as BldgID,cm.LEASID as LeaseID,coalesce(l.SUITID,(select EmptyDefault from EmptyDefault)) as SuiteID,(select CurrDate from CurrDate) as TxnDate,cm.INCCAT as IncomeCat,'??' as SourceCode,(Select CurrPeriod from CurrPeriod)as Period,coalesce(case when cm.DEPARTMENT ='#' then 'null' else cm.DEPARTMENT end, null) as Dept,'Lease' as ActualProjected ,fnGetChargeInd(cm.EFFDATE,cm.FRQUENCY,cm.BEGMONTH,(select CurrPeriod from CurrPeriod))*coalesce (cm.AMOUNT,0) as ChargeAmt,0 as OpenAmt,null as Invoice,cm.CURRCODE as CurrencyCode,case when ('PERIOD.DATACLSD') is null then 'Open' else 'Closed' end as GLClosedStatus,'Unposted'as GLPostedStatus ,'Unpaid' as PaidStatus,cm.FRQUENCY as Frequency,0 as RetroPD from CMRECC cm join BLDG b on cm.BLDGID =b.BLDGID join LEAS l on cm.BLDGID =l.BLDGID and cm.LEASID =l.LEASID and (l.VACATE is null or l.VACATE >= ('select CurrDate from CurrDate')) and (l.EXPIR >= ('select CurrDate from CurrDate') or l.EXPIR < ('select RunDate from RunDate')) left outer join PERIOD on b.ENTITYID = PERIOD.ENTITYID and ('select CurrPeriod from CurrPeriod')=PERIOD.PERIOD where ('select CurrDate from CurrDate')>=cm.EFFDATE and (select CurrDate from CurrDate) <= coalesce(cm.EFFDATE,cast(date_add(( select min(cm2.EFFDATE) from CMRECC cm2 where cm2.BLDGID = cm.BLDGID and cm2.LEASID = cm.LEASID and cm2.INCCAT = cm.INCCAT and 'cm2.EFFDATE' > 'cm.EFFDATE'),-1) as timestamp) ,case when l.EXPIR <(select RunDate from RunDate)then (Select RunDate from RunDate) else l.EXPIR end)").write.saveAsTable('FactChargeTempTable')
for making permanent table but i am getting this error
Job aborted due to stage failure: Task 11 in stage 73.0 failed 1 times, most recent failure: Lost task 11.0 in stage 73.0 (TID 2464, localhost): java.lang.RuntimeException: Unsupported data type NullType.
I hav no idea why it is happening and how could i solve it . Kindly guide
me
Thank you
kalyan

The error you have Unsupported data type NullType indicates that one of the columns for the table you are saving has a NULL column. To workaround this issue, you can do a NULL check for the columns in your table and ensure that one of the columns isn't all NULL.
Note, if there is just one row within a column mostly of NULLs, Spark usually is able to identify the datatype (e.g. StringType, IntegerType, etc.) instead of the datatype of NullType.

I have met this error when I run a spark-sql application.
You can cast NULL to String in first, like this:
lit(null).cast("string").

#Denny Lee is correct. Someone did opened a Jira for your issue and got similar response. One of the comment suggest the below way around:
Michael:Yeah, parquet doesn't have a concept of null type. I'd probably suggest they case null to a type CAST(NULL AS INT) if they really want to do this, but really you should just omit the column probably.

I have faced this error too. All values in a specific column in my table was null and I was doing the below:
Select
null as column_name
I fixed it just by ‘not selecting’ that column at all.
If you don’t select that column in your final select it will be populated as null anyway without throwing this error.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

deleting particular row in particular condition in pyspark - apache-spark

Related

Python how to pass variables to SQLite complex SQL update Query

Converting SQL query to Spark Data-frame

PySpark DataFrame Code for an HiveQL that takes 3-4 hours

Could not add table 'SELECT('

Getting java.lang.RuntimeException: Unsupported data type NullType when turning a dataframe into permanent hive table

Categories

Resources