Write spark Dataframe to an exisitng Delta Table by providing TABLE NAME instead of TABLE PATH - apache-spark

I am trying to write spark dataframe into an existing delta table.
I do have multiple scenarios where I could save data into different tables as shown below.
SCENARIO-01:
I have an existing delta table and I have to write dataframe into that table with option mergeSchema since the schema may change for each load.
I am doing the same with below command by providing delta table path
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").save(finalDF01DestFolderPath)
Just want to know whether this can be done by providing exisiting delta TABLE NAME instead of delta PATH.
This has been resolved by updating data write command as below.
finalDF01.write.format("delta").option("mergeSchema", "true").mode("append") \
.partitionBy("part01","part02").saveAsTable(finalDF01DestTableName)
Is this the correct way ?
SCENARIO 02:
I have to update the existing table if the record already exists and if not insert a new record.
For this I am currently doing as shown below.
spark.sql("SET spark.databricks.delta.schema.autoMerge.enabled = true")
DeltaTable.forPath(DestFolderPath)
.as("t")
.merge(
finalDataFrame.as("s"),
"t.id = s.id AND t.name= s.name")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute()
I tried with below script.
destMasterTable.as("t")
.merge(
vehMasterDf.as("s"),
"t.id = s.id")
.whenNotMatched().insertAll()
.execute()
but getting below error(even with alias instead of as).
error: value as is not a member of String
destMasterTable.as("t")
Here also I am using delta table path as destination, Is there any way so that we could provide delta TABLE NAME instead of TABLE PATH?
It will be good to provide TABLE NAME instead of TABLE PATH, In case if we chage the table path later will not affect the code.
I have not seen anywhere in databricks documentation providing table name along with mergeSchema and autoMerge.
Is it possible to do so?

To use existing data as a table instead of path you either were need to use saveAsTable from the beginning, or just register existing data in the Hive metastore using the SQL command CREATE TABLE USING, like this (syntax could be slightly different depending on if you're running on Databricks, or OSS Spark, and depending on the version of Spark):
CREATE TABLE IF NOT EXISTS my_table
USING delta
LOCATION 'path_to_existing_data'
after that, you can use saveAsTable.
For the second question - it looks like destMasterTable is just a String. To refer to existing table, you need to use function forName from the DeltaTable object (doc):
DeltaTable.forName(destMasterTable)
.as("t")
...

Related

How to create an external unmanaged table in delta lake in Azure Databricks

I have this scenario where I am reading files from my blob storage and then creating a delta table in Azure Databricks. When you create a delta table in Datbaricks , there are delta files which are created by default which we cant access.
Other way is to create a unmanaged delta table and specify your own path to store these delta files. I would like to know how to do this. How can i specify where I want to store my delta files. and what does this external table mean, how can i specify the path for external table?
I tried below code and it fails on creating external table command:
spark.conf.set("fs.azure.account.auth.type.xyzstorageaccount.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.xyzstorageaccount.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.xyzstorageaccount.dfs.core.windows.net", "<sas token>")
%sql
CREATE EXTERNAL TABLE IF NOT EXISTS axytable
LOCATION 'abfss://xyzstorageaccount/tables';
I know I might be doing something wrong here and I have not completely understood what external table actually means.. Does it stays inside the Databricks cluster and instead of providing my storage account path , should I provide Databricks path? How do i create a custom catalog which also seems to be a requirement here? Also this code works here and i am able to write it in storage account by the path provided but I fail to understand it.. Any SME who can help me out here?
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.start(path + "/myTable"))
This documentation provide good description of what managed tables are and how are they different from unmanaged tables. In nutshell, managed tables are created in a "default" location, and both data & table metadata a managed by Hive metastore or Unity Catalog, so when you drop a table, actual data is deleted as well. Unmanaged tables are different as only metadata are controlled by Hive metastore or Unity Catalog - if you drop table, only table definition will be dropped, but not data.
You can create unamanged table different ways:
Create from scratch using syntax create table <name> (columns definition) using delta location 'path' (doc)
Create table for existing data using syntax create table name using delta location 'path' (you don't need to provide columns definition) (doc)
Provide path option with path to data when saving table using saveAsTable (for batch), or toTable (streaming). In your example it will be following:
path = "abfss://xxx#xyzstorageacount.dfs.core.windows.net/XYY"
(DF.writeStream
.format('delta')
.outputMode("append")
.trigger(once=True)
.option("mergeSchema", "true")
.option('checkpointLocation', path+"/bronze_checkpoint")
.option('path', path + "/myTable")
.toTable('<table_name>')
)

rescued_data column of schemaEvolutionMode is not showing in output of readStream in Pyspark(databricks)

I am new to databricks (pyspark), I have few queries regarding pyspark syntax:
Do we need to follow any specific order when using options in readStream and writeStream. For example:-
(dataframe.readStream
.format("cloudFile)
.option("cloudFiles.format": "avro")
.option("multiline", True)
.schema(schema)
.load(path))
Delta table creation w/ both tableName and location option, is that right?? If I use both only I can see the files like .parquet, _delta log, checkpoint in the specified path and if I use tableName only I can see the table in hive meta store/spark catalog.bronze of SQL editor in databricks
The syntax i use, is it ok to use both .tableName() and .location() option
(DeltaTable.createIfNotExists(spark)
.tableName("%s.%s_%s" % (layer, domain, deltaTable))
.addColumn("x", "INTEGER")
.location(path)
.execute())
1st question: Do we need to follow any specific order when using options in readStream and writeStream.
Answer: No order required you can use different options like format, option, schema, etc in any order
2nd Question: Delta table creation w/ both tableName and location option, is that right?? If I use both only I can see the files like .parquet, _delta log, checkpoint in the specified path and if I use tableName only I can see the table in hive meta store/spark catalog.bronze of SQL editor in databricks
Answer: If you specify the location explicitly, it is external table (this table is created under specified location and once table alone is deleted the data still persists under data store location like mounted ADLS) and if you don't specify location it is managed table (create under default location and this table can be deleted once the table is deleted)
And to the question in the heading/top of the question about rescued data column:
Answer: Even though we use option to see rescued data, we need to select that rescued data column so that we can see it in the resultant dataframe. If we don't select it in: df.select(), then we can't see it in the result.
Please correct me if I am wrong.

Spark jdbc overwrite mode not working as expected

I would like to perform update and insert operation using spark
please find the image reference of existing table
Here i am updating id :101 location and inserttime and inserting 2 more records:
and writing to the target with mode overwrite
df.write.format("jdbc")
.option("url", "jdbc:mysql://localhost/test")
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable","temptgtUpdate")
.option("user", "root")
.option("password", "root")
.option("truncate","true")
.mode("overwrite")
.save()
After executing the above command my data is corrupted which is inserted into db table
Data in the dataframe
Could you please let me know your observations and solutions
Spark JDBC writer supports following modes:
append: Append contents of this :class:DataFrame to existing data.
overwrite: Overwrite existing data.
ignore: Silently ignore this operation if data already exists.
error (default case): Throw an exception if data already exists
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Since you are using "overwrite" mode it recreate your table as per then column length, if you want your own table definition create table first and use "append" mode
i would like to perform update and insert operation using spark
There is no equivalent in to SQL UPDATE statement with Spark SQL. Nor is there an equivalent of the SQL DELETE WHERE statement with Spark SQL. Instead, you will have to delete the rows requiring update outside of Spark, then write the Spark dataframe containing the new and updated records to the table using append mode (in order to preserve the remaining existing rows in the table).
In case where you need to perform UPSERT / DELETE operations in your pyspark code, i suggest you to use pymysql libary, and execute your upsert/delete operations. Please check this post for more info, and code sample for reference : Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
Please modify the code sample as per your needs.
I wouldn't recommend TRUNCATE, since it would actually drop the table, and create new table. While doing this, the table may lose column level attributes that were set earlier...so be careful while using TRUNCATE, and be sure, if it's ok for dropping the table/recreate the table.
Upsert logic is working fine when following below steps
df = (spark.read.format("csv").
load("file:///C:/Users/test/Desktop/temp1/temp1.csv", header=True,
delimiter=','))
and doing this
(df.write.format("jdbc").
option("url", "jdbc:mysql://localhost/test").
option("driver", "com.mysql.jdbc.Driver").
option("dbtable", "temptgtUpdate").
option("user", "root").
option("password", "root").
option("truncate", "true").
mode("overwrite").save())
Still, I am unable to understand the logic why its failing when i am writing using the data frame directly

How to do append insertion in sparksql?

I have a api endpoint written by sparksql with the following sample code. Every time api accept a request it will run sparkSession.sql(sql_to_hive) which would create a single file in HDFS. Is there any way to do insert by appending data to existing file in HDFS ? Thanks.
sqlContext = SQLContext(sparkSession.sparkContext)
df = sqlContext.createDataFrame(ziped_tuple_list, schema=schema)
df.registerTempTable('TMP_TABLE')
sql_to_hive = 'insert into log.%(table_name)s partition%(partition)s select %(title_str)s from TMP_TABLE'%{
'table_name': table_name,
'partition': partition_day,
'title_str': title_str
}
sparkSession.sql(sql_to_hive)
I don't think this is possible case to append data to the existing file.
But you can work around this case by using either of these ways
Approach1
Using Spark, write to intermediate temporary table and then insert overwrite to final table:
existing_df=spark.table("existing_hive_table") //get the current data from hive
current_df //new dataframe
union_df=existing_df.union(current_df)
union_df.write.mode("overwrite").saveAsTable("temp_table") //write the data to temp table
temp_df=spark.table("temp_table") //get data from temp table
temp_df.repartition(<number>).write.mode("overwrite").saveAsTable("existing_hive_table") //overwrite to final table
Approach2:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer alter table concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
We can also use distributeby,sortby clauses to control number of files, refer this and this link for more details.
Another Approach3 is by using hadoop fs -getMerge to merge all small files into one (this method works for text files and i haven't tried for orc,avro ..etc formats).
When you write the resulted dataframe:
result_df = sparkSession.sql(sql_to_hive)
set it’s mode to append:
result_df.write.mode(SaveMode.Append).

How to create an EXTERNAL Spark table from data in HDFS

I have loaded a parquet table from HDFS into a DataFrame:
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
I now want to expose this table to Spark SQL but this must be a persitent table because I want to access it from a JDBC connection or other Spark Sessions.
Quick way could be to call df.write.saveAsTable method, but in this case it will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore, creating another copy of the data in HDFS.
I don't want to have two copies of the same data, so I would want create like an external table to point to existing data.
To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:
df.write.
option("path","hdfs://user/zeppelin/my_mytable").
saveAsTable("my_table")
The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.. This looks like a bug in Spark API...
Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE in a persistent table.
If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:
// Create a sample of the table
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
option("path", "/user/zeppelin/my_table_tmp").
saveAsTable("my_table_tmp")
// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)
You are going to get a DDL like:
CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table_tmp')
Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:
spark.sql("""
CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table')""")

Resources