Delta lake in databricks - creating a table for existing storage - apache-spark

I currently have an append table in databricks (spark 3, databricks 7.5)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
It was created with a create table command before and I don't use INSERT commands to write to it (as seen above)
Now I want to be able to use SQL logic to query it without everytime going through createOrReplaceTempView every time. Is is possible to add a table to the current data without removing it? what changes do I need to support this?
UPDATE:
I've tried:
res= spark.sql(f"CREATE TABLE exploration.oplog USING DELTA LOCATION '/mnt/defaultDataLake/{append_table_name}'")
But get an AnalysisException
You are trying to create an external table exploration.dataitems_oplog
from /mnt/defaultDataLake/specificpathhere using Databricks Delta, but the schema is not specified when the
input path is empty.
While the path isn't empty.

Starting with Databricks Runtime 7.0, you can create table in Hive metastore from the existing data, automatically discovering schema, partitioning, etc. (see documentation for all details). The base syntax is following (replace values in <> with actual values):
CREATE TABLE <database>.<table>
USING DELTA
LOCATION '/mnt/defaultDatalake/<append_table_name>'
P.S. there is more documentation on different aspects of the managed vs unmanaged tables that could be useful to read.
P.P.S. Works just fine for me on DBR 7.5ML:

Related

How to create an external unmanaged table in delta lake in Azure Databricks

I have this scenario where I am reading files from my blob storage and then creating a delta table in Azure Databricks. When you create a delta table in Datbaricks , there are delta files which are created by default which we cant access.
Other way is to create a unmanaged delta table and specify your own path to store these delta files. I would like to know how to do this. How can i specify where I want to store my delta files. and what does this external table mean, how can i specify the path for external table?
I tried below code and it fails on creating external table command:
spark.conf.set("fs.azure.account.auth.type.xyzstorageaccount.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.xyzstorageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.xyzstorageaccount.dfs.core.windows.net", "<sas token>")
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS axytable
LOCATION 'abfss://xyzstorageaccount/tables';
I know I might be doing something wrong here and I have not completely understood what external table actually means.. Does it stays inside the Databricks cluster and instead of providing my storage account path , should I provide Databricks path? How do i create a custom catalog which also seems to be a requirement here? Also this code works here and i am able to write it in storage account by the path provided but I fail to understand it.. Any SME who can help me out here?
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.start(path + "/myTable"))
This documentation provide good description of what managed tables are and how are they different from unmanaged tables. In nutshell, managed tables are created in a "default" location, and both data & table metadata a managed by Hive metastore or Unity Catalog, so when you drop a table, actual data is deleted as well. Unmanaged tables are different as only metadata are controlled by Hive metastore or Unity Catalog - if you drop table, only table definition will be dropped, but not data.
You can create unamanged table different ways:
Create from scratch using syntax create table <name> (columns definition) using delta location 'path' (doc)
Create table for existing data using syntax create table name using delta location 'path' (you don't need to provide columns definition) (doc)
Provide path option with path to data when saving table using saveAsTable (for batch), or toTable (streaming). In your example it will be following:
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.option('path', path + "/myTable")
.toTable('<table_name>')
)

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?
To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

Spark jdbc overwrite mode not working as expected

I would like to perform update and insert operation using spark
please find the image reference of existing table
Here i am updating id :101 location and inserttime and inserting 2 more records:
and writing to the target with mode overwrite
df.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/test")
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable","temptgtUpdate")
.option("user", "root")
.option("password", "root")
.option("truncate","true")
.mode("overwrite")
.save()
After executing the above command my data is corrupted which is inserted into db table
Data in the dataframe
Could you please let me know your observations and solutions
Spark JDBC writer supports following modes:
append: Append contents of this :class:DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Since you are using "overwrite" mode it recreate your table as per then column length, if you want your own table definition create table first and use "append" mode
i would like to perform update and insert operation using spark
There is no equivalent in to SQL UPDATE statement with Spark SQL. Nor is there an equivalent of the SQL DELETE WHERE statement with Spark SQL. Instead, you will have to delete the rows requiring update outside of Spark, then write the Spark dataframe containing the new and updated records to the table using append mode (in order to preserve the remaining existing rows in the table).
In case where you need to perform UPSERT / DELETE operations in your pyspark code, i suggest you to use pymysql libary, and execute your upsert/delete operations. Please check this post for more info, and code sample for reference : Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
Please modify the code sample as per your needs.
I wouldn't recommend TRUNCATE, since it would actually drop the table, and create new table. While doing this, the table may lose column level attributes that were set earlier...so be careful while using TRUNCATE, and be sure, if it's ok for dropping the table/recreate the table.
Upsert logic is working fine when following below steps
df = (spark.read.format("csv").
load("file:///C:/Users/test/Desktop/temp1/temp1.csv", header=True,
delimiter=','))
and doing this
(df.write.format("jdbc").
option("url", "jdbc:mysql://localhost/test").
option("driver", "com.mysql.jdbc.Driver").
option("dbtable", "temptgtUpdate").
option("user", "root").
option("password", "root").
option("truncate", "true").
mode("overwrite").save())
Still, I am unable to understand the logic why its failing when i am writing using the data frame directly

Loading Snowflake from Databricks changes table structure

I'm doing some POC to load a Snowflake table from a dataframe in Databricks. I've successfully loaded the table, however it changes the structure of it.
For example, in Snowflake I created this table:
CREATE OR REPLACE TABLE FNT_DAYS
(
FNT_DT_PK TIMESTAMP_NTZ NOT NULL,
OPEN_DT_FLG VARCHAR(1),
HOLIDAY_DT_FLG VARCHAR(1),
LOAD_USR VARCHAR(10)
);
ALTER TABLE FNT_DAYS ADD CONSTRAINT FNT_DAYS_PK PRIMARY KEY (FNT_DT_PK);
When running my code in Databricks unsing Python, the table gets successfully loaded, however the structure of the table changes to this:
CREATE OR REPLACE TABLE FNT_DAYS
(
FNT_DT_PK TIMESTAMP_NTZ,
OPEN_DT_FLG VARCHAR(16777216),
HOLIDAY_DT_FLG VARCHAR(16777216),
LOAD_USR VARCHAR(10)
);
Note that the Primary Key Constraint is gone, FNT_DT_PK field is not longer NOT NULL and finally, every single VARCHAR field data type length is changed to 16777216.
My python code in Databricks is very straight forward:
%python
options = dict(sfUrl="mysnflk.snowflakecomputing.com",
sfUser="me",
sfPassword="******",
sfDatabase="SNF_DB",
sfSchema="PUBLIC",
sfWarehouse="SNF_WH")
df = spark.sql("select * from exc.test")
df.write \
.format("snowflake") \
.mode("overwrite") \
.options(**options) \
.option("dbtable", "FNT_DAYS") \
.save()
Do you have an idea of why the table structure is changed in Snowflake?
If you look at the query_history in Snowflake, do you see that the table is being recreated by the df.write command? It seems that it is recreating the table and using the datatypes of the dataframe to define your new table. I don't know exactly what is causing that, but I do see that the Snowflake example (https://docs.snowflake.net/manuals/user-guide/spark-connector-use.html#id1) has a slightly different syntax on your mode.
I should also note that the length of those varchar field will not hurt you in any way in Snowflake. Length does not affect storage or performance and those lengths mean that the connector is literally just using VARCHAR as the data type without a length specified. Also, constraints on PK are not enforced, so not sure how important that is to you. The only thing I'd be concerned about is your NOT NULL, which Snowflake does enforce.

Resources