Databricks auto merge schema - databricks

Does anyone know how to resolve this error?
I have put the following before my merge, but it seems to not like it.
%sql set spark.databricks.delta.schema.autoMerge.enabled = true
Also, the reason for putting this in was because my notebook was failing on schema changes to a delta lake table. I have an additional column on one of the tables I am loading into. I thought that data bricks were able to auto-merge schema changes.

The code works fine in my environment. I'm using Databricks runtime 10.4

TL;DR: add a semicolon to the end of the separate SQL statements:
set spark.databricks.delta.schema.autoMerge.enabled = true;
The error is actually a more generic SQL error; the IllegalArgumentException is a clue - though not a very helpful one :)
I was able to reproduce your error:
set spark.databricks.delta.schema.autoMerge.enabled = true
INSERT INTO records SELECT * FROM students
gives: Error in SQL statement: IllegalArgumentException: spark.databricks.delta.schema.autoMerge.enabled should be boolean, but was true
and was able to fix it by adding a ; to the end of the first line:
set spark.databricks.delta.schema.autoMerge.enabled = true;
INSERT INTO records SELECT * FROM students
succeeds.
Alternatively you could run the set in a different cell.

Related

Databricks SQL nondeterministic expressions using DELETE FROM

I am trying to execute the following SQL clause using Databricks SQL:
DELETE FROM prod_gbs_gpdi.bronze_data.sapex_ap_posted AS HISTORICAL_DATA
WHERE
HISTORICAL_DATA._JOB_SOURCE_FILE = (SELECT MAX(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
The intention of the query is to delete a set of rows in a historical data table based on a value present in a column of new data table.
For reasons that I cannot understand it is raising an error like:
Error in SQL statement: AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
(HISTORICAL_DATA._JOB_SOURCE_FILE IN (listquery()))
in operator DeleteCommandEdge
It seems it is not accepting a subquery inside the where clause. That's odd for me, as in the Databricks documentation Link it is acceptable.
I even tried other types of predicates, like:
(SELECT FIRST(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
(SELECT DISTINCT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
IN (SELECT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
None of them seems to take effect in executing the query successfully.
What's even odd for me is that I was able to accomplish a similar case with a slightly different query, as it can be seen in this link.
I have created demo_table1 & demo_table2 for querying purpose. I have created the following query carrying the similar purpose. I haven’t considered double aliases and have given straight query using subquery, it also depends on data frame in usage use a normal pandas data frame. it works fine for me.
delete from demo_table1 as t1 where t1.age = (select min(t2.age) from demo_table2 as t2);

Access "table$partitions" through Spark Sql

I figured out that running the following code will do full scan of the table:
select max(run_id) from database.table
So I switched my code to work with the following syntax:
select max(run_id) from "database"."table$partitions"
This query works great on Athena but when I try to execute it with Spark Sql I get the following error:
mismatched input '"database"' expecting <EOF>(line 1, pos 24)
It seems like spark sql identify the quotes as the end of the query.
Any ideas how to make this query work on spark sql?
Thanks
My solution for this problem was:
sql_context.sql(f'show partitions {table_name}').agg(
f.max(f.regexp_extract('partition', rf'''{partition_name}=([^/]+)''', 1))).collect()[0][0]
The advantage: It's not doing full scan on the table
Disadvantage: It's scan all partitions levels + code isn't elegant.
Anyway that's the best I found

Copying data in and out of Snowflake via Azure Blob Storage

I'm trying to copy into blob storage and then copy out of blob storage. The copy into works:
copy into 'azure://my_blob_url.blob.core.windows.net/some_folder/MyTable'
from (select *
from MyTable
where condition = 'true')
credentials = (azure_sas_token = 'my_token');
But the copy out fails:
copy into MyTable
from 'azure://my_blob_url.blob.core.windows.net/some_folder/MyTable'
credentials = (azure_sas_token = 'my_token');
the error is:
SQL Compilation error: Function 'EXTRACT' not supported within a COPY.
Weirdly enough, it worked once and hasn't worked since. I'm at a loss, nothing turns up details for this.
I know there's an approach I could take using stages, but I don't want to for a bunch of reasons and even when I try with stages the same error presents itself.
Edit:
The cluster key definition is:
cluster by (idLocal, year(_ts), month(_ts), substring(idGlobal, 0, 1));
where the idLocal and idGlobal are varchars and the _ts is a TIMESTAMPTZ
I think I've seen this before with a cluster key on the table (which I don't think is supported with COPY INTO). The EXTRACT function (shown in the error) being part of the CLUSTER BY on the table.
This is a bit of a hunch, but assuming this isn't occurring for all your tables, hoping it leads to investigation on the table configuration and perhaps that might help.
Alex can you try with a different function in the cluster key on your target table like date_trunc('day',_ts)?
thanks
Chris

Execute multiple SQL queries in one statement using Python

I am new to Python and I want to execute multiple SQL queries in one statement using python, but I am not able to find the appropriate way to do so.
I wrote following code but its throwing an error as " DatabaseError: ORA-00933: SQL command not properly ended."
import cx_Oracle;
SQLQuery = "select x from xyz where p= 'sn'; select * from abs where a ='qw';"
connection = cx_Oracle.connect('username', 'password', 'server')
cursor = connection.cursor()
cursor.execute(SQLQuery) #its throwing error here
It would be great if one can suggest me the appropriate function for executing the multiple queries in one call.
Appreciate your response. Thanks in advance.
What do you want to achieve with this?
Technically you could try to get rows from two tables or try to combine rows from different tables, but all this is directly done in SQL.
Remove ; (semicolon) at the end of query and it should run fine.

greenplum database "relation does not exist"

I am getting "relation does not exist" error while trying to truncate a particular table.The table actually exists in the database.
Also when I click on this table in pg admin I get the warning for vacuum.
Are these things related.?
------ Adding few more details----
Truncate statement is called within a greenplum function. This job truncates and load the table on a daily basis(This table is queried in reports)The issue pops up once in a while and if we go and restart the same job again after few minutes it succeeds.
Please try to do the below select * from schemaname.tablename limit 10; If you don't use the schema name then you have to set the search path as below and then run your select
set search_path=schemaname;

Resources