Strategies to prevent duplicate data in Azure SQL Data Warehouse - azure

At the moment I am setting up an Azure SQL Data Warehouse. I am using Databricks for the ETL process with JSON-files from Azure Blob Storage.
What is the best practice to make sure to not import duplicate dimensions or facts into the Azure SQL Data Warehouse?
This could happen for facts e.g. in the case of en exception while the loading process. And for dimensions this could happen as well if I would not check, which data already exists.
I am using the following code to import data into the data warehouse and I found no "mode" which would only import data which not already exists:
spark.conf.set(
"spark.sql.parquet.writeLegacyFormat",
"true")
renamedColumnsDf.write
.format("com.databricks.spark.sqldw")
.option("url", sqlDwUrlSmall)
.option("dbtable", "SampleTable")
.option( "forward_spark_azure_storage_credentials","True")
.option("tempdir", tempDir)
.mode("overwrite")
.save()

Ingest to a staging table, then CTAS to your fact table with a NOT EXISTS clause to eliminate duplicates.

Related

Consistent SQL database snapshot using Spark

I am trying to export a snapshot of a postgresql database to parquet files using Spark.
I am dumping each table in the database to a seperate parquet file.
tables_names = ["A", "B", "C" , ...]
for table_name in tables_names:
table = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table_name)
.option("user", user)
.load())
table.write.mode("overwrite").saveAsTable(table_name)
The problem, however, is that I need the tables to be consistent with each other.
Ideally, the table loads should be executed in a single transaction so they see the same version of the database.
The only solution I can think of is to select all tables in a single query using UNION/JOIN but then I would need to identify each table columns which is something I am trying to avoid.
Unless you force all future connections to the database, not instance, to be read only and terminate those in flight, using, setting the
PostgreSQL configuration parameter default_transaction_read_only to true, then, no you cannot do this per discrete table approach as per your code.
Note that a session can override the global setting.
Means your 2nd option will work due to MVRCM, but not elegant and how performance from a Spark context for jdbc?

Apache Hudi create and append Upsert table (Parquet-format) on Dataproc & Cloud Storage

is Dataproc-noob again.
My main goal is to ingest the tables from on-premise sources, store them as a Parquet-file in a Cloud Storage bucket and create/update tables in BigQuery from this file, following my previous post about Dataproc and Hudi conf, i was able to deploy and ingest from on-premise sources, via Dataproc/PySpark/Hudi, and stored them in Cloud Storage.
Next question is about 'Upsert' conf in 'hudi_options' and how can be append new results in the parquet-file that is cloud storage bucket. It is not clear to me is you can update/change a Parquet-file with Hudi.
I want to avoid delete previous loads and only store one Parquet-file per table.
Upsert code:
table_location = "gs://bucket/{}/".format(table_name)
updates = spark.read.format("jdbc") \
.option("url",url) \
.option("user", username) \
.option("password", password) \
.option("driver", "com.sap.db.jdbc.Driver") \
.option("query", query) \
.load()
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.recordkey.field': 'a,b,c,d,e',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'x',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'path': table_location,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': database_name,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms'
}
updates.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save()
However, generate another Parquet-file in Cloud Storage.
Is there anything i´m missing in configuration?
Thanks!
Apache Hudi and the other open Data Lakehouse formats (Apache Iceberg and Delta) creates data snapshots in order to bring ACID transactions to parquet format. In each commit, they read the existing data in the changed partitions, apply the changes in memory and tmp files (appending new data, deleting or updating existing data), the re-write the whole data to the partitions. Later they delete the old files based on the cleaning policy, and you can use them for time travel queries or for rollback.
I want to avoid delete previous loads
While your OPERATION_OPT_KEY is not insert_overwrite_table or insert_overwrite, and spark save mode is not overwrite, don't worry about the previous loads.
and only store one Parquet-file per table
If you want to compact the data and have a single big parquet file in your table, you need to increase hoodie.parquet.max.file.size which is 120MB by default, because Hudi splits the parquet files which exceeds this size in chunks.

Write data to specific partitions in Azure Dedicated SQL pool

At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.

Writing from Databricks to Synapse (Azure DW) very slow

We are using Databricks and its SQL DW connector to load data into Synapse. I have a dataset with 10 000 rows and 40 columns. It takes 7 minutes!
Loading same dataset using Data Factory with Polybase and staging option takes 27 seconds. Same with bulk copy.
What could be wrong? Am I missing some configuration? Or is this business as usual?
Connection configuration:
df_insert.write .format("com.databricks.spark.sqldw") .option("url", sqlDwUrlSmall) .option("dbtable", t_insert) .option( "forward_spark_azure_storage_credentials","True") .option("tempdir", tempDir) .option("maxStrLength", maxStrLength) .mode("append") .save()
You can try to change the write semantics: Databricks documentation
Using the copy write semantics I was able to load data in Synapse faster.
You can configure it before running the write command, in this way:
spark.conf.set("spark.databricks.sqldw.writeSemantics", "copy")

Azure Event Hubs to Databricks, what happens to the dataframes in use

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Resources