Spark: Use Temporary Table Twice in Query? - apache-spark

I am working on converting a set of Hive queries to run on spark. So far I have gotten a nice performance boost by creating TEMP tables, where Hive was previously creating new tables on disc. I have run into a query where the TEMP table is being called twice in the same query and these causes a failure. I have tried to write my temp table to disc, but I notice that the "saveAsTable" function is deprecated, and when I try to use it my program fails due to executor timeouts. I would prefer to not have to write to disc anyway. I have considered rewriting the hive query, but would prefer to leave it alone. Do I have any other options?
Sample Query
SELECT d.LEVEL_1,
d.LEVEL_2,
d.CODE
FROM
( SELECT DISTINCT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
I have reduced the query a bit to try and show the basic concept, but may have broken it during reduction.

Your query has multiple parts. Did you try running -
first
SELECT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
second
SELECT DISTINCT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
Also, the second one is your answer. You need not to do another select on top of that. You are missing in second select d.CODE.
I have ran similar self-joins in spark and it works.

Related

Databricks SQL nondeterministic expressions using DELETE FROM

I am trying to execute the following SQL clause using Databricks SQL:
DELETE FROM prod_gbs_gpdi.bronze_data.sapex_ap_posted AS HISTORICAL_DATA
WHERE
HISTORICAL_DATA._JOB_SOURCE_FILE = (SELECT MAX(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
The intention of the query is to delete a set of rows in a historical data table based on a value present in a column of new data table.
For reasons that I cannot understand it is raising an error like:
Error in SQL statement: AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
(HISTORICAL_DATA._JOB_SOURCE_FILE IN (listquery()))
in operator DeleteCommandEdge
It seems it is not accepting a subquery inside the where clause. That's odd for me, as in the Databricks documentation Link it is acceptable.
I even tried other types of predicates, like:
(SELECT FIRST(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
(SELECT DISTINCT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
IN (SELECT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
None of them seems to take effect in executing the query successfully.
What's even odd for me is that I was able to accomplish a similar case with a slightly different query, as it can be seen in this link.
I have created demo_table1 & demo_table2 for querying purpose. I have created the following query carrying the similar purpose. I haven’t considered double aliases and have given straight query using subquery, it also depends on data frame in usage use a normal pandas data frame. it works fine for me.
delete from demo_table1 as t1 where t1.age = (select min(t2.age) from demo_table2 as t2);

AnalysisException: Operation not allowed: `CREATE TABLE LIKE` is not supported for Delta tables;

create table if not exists map_table like position_map_view;
While using this it is giving me operation not allowed error
As pointed in documentation, you need to use CREATE TABLE AS, just use LIMIT 0 in SELECT:
create table map_table as select * from position_map_view limit 0;
I didn't find an easy way of getting CREATE TABLE LIKE to work, but I've got a workaround. On DBR in Databricks you should be able to use SHALLOW CLONE to do something similar:
%sql
CREATE OR REPLACE TABLE $new_database.$new_table
SHALLOW CLONE $original_database.$original_table`;
You'll need to replace $templates manually.
Notes:
This has an added side-effect of preserving the table content in case you need it.
Ironically, creating empty table is much harder and involves manipulating show create table statement with custom code

Cannot execute CTE query via Azure DataBricks

I'm trying to execute CTE query via data bricks getting syntax error for SQL query. Is there any other to use CTE from Data bricks?
Thanks in Advance .
pushdown_query = """(WITH t(x, y) AS (SELECT 1, 2)
SELECT * FROM t WHERE x = 1 AND y = 2) as Test """
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Error:-
"com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword WITH."
Just remove the as test syntax:
WITH t(x, y) AS (SELECT 1, 2) SELECT * FROM t WHERE x = 1 AND y = 2
This will work.
It is impossible using spark.read jdbc. When you use dbtable or query parameters, effect is to insert your SQL code as a subquery inside larger SELECT statement.
Spark docs for dbtable param are poor, IMHO, but you can see where this heading in query doc.
As an example, spark will issue a query of the following form to the JDBC Source.
SELECT <columns> FROM (<user_specified_query>) spark_gen_alias
The thing we specify gets turned into a subquery.
Add that with the fact that a WITH clause must not be included in a subquery. I'm looking for proof of that claim in the Microsoft docs.
Closest I can find is in doc referred to above,
which says WITH must be followed by single SELECT statement. https://github.com/uglide/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-migrate-code.md
I think CTE functionality is stripped out of Azure SQL Server, which is also known as Synapse. You may be able to re-write some of your queries to do what you need, without using the standard CTE syntax.
These links should shed some light on the situation.
https://github.com/uglide/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-migrate-code.md
https://learn.microsoft.com/en-us/azure/azure-sql/database/transact-sql-tsql-differences-sql-server

What is the difference between dynamic.partition=True and dynamic.partition.mode = nonstrict?

Spark 2.0 - pyspark
I seen the following 2 properties paired. What is the difference between them?
hive> SET hive.exec.dynamic.partition=true;
hive> SET hive.exec.dynamic.partition.mode=non-strict;
I know what the outcome is when they are used - you can use dynamic partitioning to load/create multiple partitions, but I don't know the difference between these two similar commands.
When I was running this code
input_field_names=['id','code','num']
df \
.select(input_field_names) \
.write \
.mode('append')\
.insertInto('test_insert_into_partition')
I got an error message that says Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
Using spark.sql("SET hive.exec.dynamic.partition.mode = nonstrict") the code works. It doesn't require me to use the other one.
Why don't I need to set SET hive.exec.dynamic.partition=true; and what else should I know to choose which one to use.
Although there is much to google, here is a short answer.
If you want to insert dynamically into Hive partitions both values need to be set and you can then load many partitions in one go:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict
create table tblename (h string,m string,mv double,country string)partitioned by (starttime string) location '/.../...'
INSERT overwrite table tblename PARTITION(starttime) SELECT h,m,mv,country,starttime from tblename2
Otherwise you need to do like this, setting the partition col val yourself / explicity:
INSERT into table tblename PARTITION(starttime='2017-08-09') SELECT h,m,mv,country from tblname2 where to_date(starttime)='2017-08-09'
The purpose of default value of 'strict' for
hive.exec.dynamic.partition.mode is there to prevent a user from
accidentally overwriting all the partitions, i.e. to avoid data loss.
So, there is not a situation of difference, rather a situation of caution, a but like the safety catch on a firearm, as it were.

USQL nested query performance

I have a USQL query that runs fine on it's own against 400M records in a managed table.
But during development, I don't want to run it against all records all the time, so I pop a where clause in, run it for a tiny subsection of data, and it completes in around 2 minutes (#5 AUs), writing out results to a tsv in my data lake.
Happy with that.
However, I now want to use it as the source for a second query and further processing.
So I create a view with the original USQL (minus the where clause).
Then to test, a new script :
'Select * from MyView WHERE <my original test filter>'.
Now I was expecting that to execute in around the same time as the original raw query. But instead I got to 4 minutes, only 10% through the plan, and cancelled - something is not right.
No expert at reading Job Graphs, but ...
The original script kicks off with 2* 'Extract Combine partition' both reading a couple of hundered MBs, my select on the saved View is reading over 100GB !!
So it is not taking the where clause into account at all at this stage.
Obviously this shows how little I yet understand about how DLA works behind the scenes !
Would someone please help me understand (a) what is going on and (b) a path forward to get the behavior I need ?
Currently having a play with stored procedures to store the 1st result in a table and then call the second query against that - but just seems overkill compared with 'traditional' SQL Server ?!?
All pointers & hints appreciated !
Many Thanks
Original Base Query:
CREATE VIEW IF NOT EXISTS Play.[M3_CycleStartPoints]
AS
//#BASE =
SELECT ROW_NUMBER() OVER (PARTITION BY A.[CTNNumber] ORDER BY A.[SeqNo]) AS [CTNCycleNo], A.[CTNNumber], A.[SeqNo], A.[BizstepDescription], A.[ContainerStatus], A.[FillStatus]
FROM
[Play].[RawData] AS A
LEFT OUTER JOIN
(
SELECT [CTNNumber],[SeqNo]+1 AS [SeqNo],[FillStatus],[ContainerStatus],[BizstepDescription]
FROM [Play].[RawData]
WHERE [FillStatus] == "EMPTY" AND [AssetUsage] == "CYLINDER"
) AS B
ON A.[CTNNumber] == B.[CTNNumber] AND A.[SeqNo] == B.[SeqNo]
WHERE (
(A.[FillStatus] == "FULL" AND
A.[AssetUsage] == "CYLINDER" AND
B.[CTNNumber] == A.[CTNNumber]
) OR (
A.[SeqNo] == 1
)
);
//AND A.[CTNNumber] == "BE52XH7";
//Only used to test when running script as stand-alone & output to tsv
Second Query
SELECT *
FROM [Play].[M3_CycleStartPoints]
WHERE [CTNNumber] == "BE52XH7";
Ok, I think I've got this, or at least in part.
Table valued Functions
http://www.sqlservercentral.com/articles/U-SQL/146839/
to allow the passing of an argument to a view and return the result.
Would be interested in finding some reading material around this subject still though.
Coming from a T-SQL world, seems that there are some fundamental differences I'm still tripping over.

Resources