I'm trying to execute CTE query via data bricks getting syntax error for SQL query. Is there any other to use CTE from Data bricks?
Thanks in Advance .
pushdown_query = """(WITH t(x, y) AS (SELECT 1, 2)
SELECT * FROM t WHERE x = 1 AND y = 2) as Test """
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Error:-
"com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword WITH."
Just remove the as test syntax:
WITH t(x, y) AS (SELECT 1, 2) SELECT * FROM t WHERE x = 1 AND y = 2
This will work.
It is impossible using spark.read jdbc. When you use dbtable or query parameters, effect is to insert your SQL code as a subquery inside larger SELECT statement.
Spark docs for dbtable param are poor, IMHO, but you can see where this heading in query doc.
As an example, spark will issue a query of the following form to the JDBC Source.
SELECT <columns> FROM (<user_specified_query>) spark_gen_alias
The thing we specify gets turned into a subquery.
Add that with the fact that a WITH clause must not be included in a subquery. I'm looking for proof of that claim in the Microsoft docs.
Closest I can find is in doc referred to above,
which says WITH must be followed by single SELECT statement. https://github.com/uglide/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-migrate-code.md
I think CTE functionality is stripped out of Azure SQL Server, which is also known as Synapse. You may be able to re-write some of your queries to do what you need, without using the standard CTE syntax.
These links should shed some light on the situation.
https://github.com/uglide/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-migrate-code.md
https://learn.microsoft.com/en-us/azure/azure-sql/database/transact-sql-tsql-differences-sql-server
Related
I am trying to execute the following SQL clause using Databricks SQL:
DELETE FROM prod_gbs_gpdi.bronze_data.sapex_ap_posted AS HISTORICAL_DATA
WHERE
HISTORICAL_DATA._JOB_SOURCE_FILE = (SELECT MAX(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
The intention of the query is to delete a set of rows in a historical data table based on a value present in a column of new data table.
For reasons that I cannot understand it is raising an error like:
Error in SQL statement: AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
(HISTORICAL_DATA._JOB_SOURCE_FILE IN (listquery()))
in operator DeleteCommandEdge
It seems it is not accepting a subquery inside the where clause. That's odd for me, as in the Databricks documentation Link it is acceptable.
I even tried other types of predicates, like:
(SELECT FIRST(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
(SELECT DISTINCT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
IN (SELECT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
None of them seems to take effect in executing the query successfully.
What's even odd for me is that I was able to accomplish a similar case with a slightly different query, as it can be seen in this link.
I have created demo_table1 & demo_table2 for querying purpose. I have created the following query carrying the similar purpose. I haven’t considered double aliases and have given straight query using subquery, it also depends on data frame in usage use a normal pandas data frame. it works fine for me.
delete from demo_table1 as t1 where t1.age = (select min(t2.age) from demo_table2 as t2);
I figured out that running the following code will do full scan of the table:
select max(run_id) from database.table
So I switched my code to work with the following syntax:
select max(run_id) from "database"."table$partitions"
This query works great on Athena but when I try to execute it with Spark Sql I get the following error:
mismatched input '"database"' expecting <EOF>(line 1, pos 24)
It seems like spark sql identify the quotes as the end of the query.
Any ideas how to make this query work on spark sql?
Thanks
My solution for this problem was:
sql_context.sql(f'show partitions {table_name}').agg(
f.max(f.regexp_extract('partition', rf'''{partition_name}=([^/]+)''', 1))).collect()[0][0]
The advantage: It's not doing full scan on the table
Disadvantage: It's scan all partitions levels + code isn't elegant.
Anyway that's the best I found
I am working on converting a set of Hive queries to run on spark. So far I have gotten a nice performance boost by creating TEMP tables, where Hive was previously creating new tables on disc. I have run into a query where the TEMP table is being called twice in the same query and these causes a failure. I have tried to write my temp table to disc, but I notice that the "saveAsTable" function is deprecated, and when I try to use it my program fails due to executor timeouts. I would prefer to not have to write to disc anyway. I have considered rewriting the hive query, but would prefer to leave it alone. Do I have any other options?
Sample Query
SELECT d.LEVEL_1,
d.LEVEL_2,
d.CODE
FROM
( SELECT DISTINCT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
I have reduced the query a bit to try and show the basic concept, but may have broken it during reduction.
Your query has multiple parts. Did you try running -
first
SELECT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
second
SELECT DISTINCT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
Also, the second one is your answer. You need not to do another select on top of that. You are missing in second select d.CODE.
I have ran similar self-joins in spark and it works.
Does anyone know of a way in Spark SQL to execute a string variable like the following?
INSERT TableA (Col1,Col2) SELECT Col1,Col2 FROM TableB
I understand that I can obviously write this statement directly. However, I am using a work flow engine where my Insert/Select statement is in String variable. If not, I assume I should use spark_submit. I was looking for other options.
I'm not sure what environment you're in. If this is a Spark application or a Spark shell you always provide queries as strings:
val query = "INSERT TableA (Col1,Col2) SELECT Col1,Col2 FROM TableB"
sqlContext.sql(query)
(See http://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically.)
Spark Sql also support hive queries
insert overwrite table usautomobiles select * from sourcedata
Go Through this link
I write because I've a problem with cassandra; after have imported the data from pentaho as show here
http://wiki.pentaho.com/display/BAD/Write+Data+To+Cassandra
when I try to execute the query
Select * FROM mytable;
cassandre give me an error message
Syntax error at position 7: unexpected "*" for Select * FROM mytable;.
and don't show the results of query.Why? what does it mean that error?
the step that i make are the follow:
start cassandra cli utility;
use keyspace added from pentaho; (use tpc_h);
select to show the data added (Select * FROM mytable;)
The cassandra-cli does not support any CQL version. It has its own syntax which you can find on datastax's website.
Just for clarity, in cql to select everything from a table (aka column-family) called mytable stored in a keyspace called myks you would use:
SELECT * FROM myks.mytable;
The equivalent in cassandra-cli would *roughly be :
USE myks;
LIST mytable;
***** In the cli you are limited to selecting the first 100 rows. If this is a problem you can use the limit clause to specify how many rows you want:
LIST mytable limit 10000;
As for this:
in cassandra i have read that isn't possible make the join such as sql, ther isn't a shortcut to issue this disadvantage
There is a reason why joins don't exist in Cassandra, its for the same reason that C* isn't ACID compliant, it sacrifices that functionality for it's amazing performance and scalability, so it's not a disadvantage, you just need to re-think your model if you need joins. Also take a look at this question / answer.