Writing from Databricks to Synapse (Azure DW) very slow - databricks

We are using Databricks and its SQL DW connector to load data into Synapse. I have a dataset with 10 000 rows and 40 columns. It takes 7 minutes!
Loading same dataset using Data Factory with Polybase and staging option takes 27 seconds. Same with bulk copy.
What could be wrong? Am I missing some configuration? Or is this business as usual?
Connection configuration:
df_insert.write .format("com.databricks.spark.sqldw") .option("url", sqlDwUrlSmall) .option("dbtable", t_insert) .option( "forward_spark_azure_storage_credentials","True") .option("tempdir", tempDir) .option("maxStrLength", maxStrLength) .mode("append") .save()

You can try to change the write semantics: Databricks documentation
Using the copy write semantics I was able to load data in Synapse faster.
You can configure it before running the write command, in this way:
spark.conf.set("spark.databricks.sqldw.writeSemantics", "copy")

Related

Slow performance writing from pyspark dataframe to Azure Synapse pool

I am writing data from a Spark dataframe in an Azure Databricks notebook into a dedicated Synapse pool. The problem is this takes an extremely long time given the small size of the data involved.
Read performance is fine, this syntax will happily read 100,000 rows in a couple of seconds. However takes about 25 minutes to write a similar number of rows (of 3 columns).
Are there any options I should be adding to improve write performance? Or is there a faster way of completing the same task?
(df.write
.format("jdbc")
.option("url", f"jdbc:sqlserver://<blahblah>.sql.azuresynapse.net:1433;database=<blahblah>;user=<blah>#<blahblah>;password={password};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;")
.option("dbtable", "dbo.newtablename")
.option("user", <username>)
.option("password", <password>)
.option("createTableColumnTypes", "column1 VARCHAR(36), column2 VARCHAR(1), column3 VARCHAR(1)")
.save()
)
I added the createTableColumnTypes option as the default column type did not permit a columnstore index to be created.
Other options I have reviwed in the documentation (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) do not seem relevant.

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

Write data to specific partitions in Azure Dedicated SQL pool

At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.

Azure Event Hubs to Databricks, what happens to the dataframes in use

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Strategies to prevent duplicate data in Azure SQL Data Warehouse

At the moment I am setting up an Azure SQL Data Warehouse. I am using Databricks for the ETL process with JSON-files from Azure Blob Storage.
What is the best practice to make sure to not import duplicate dimensions or facts into the Azure SQL Data Warehouse?
This could happen for facts e.g. in the case of en exception while the loading process. And for dimensions this could happen as well if I would not check, which data already exists.
I am using the following code to import data into the data warehouse and I found no "mode" which would only import data which not already exists:
spark.conf.set(
"spark.sql.parquet.writeLegacyFormat",
"true")
renamedColumnsDf.write
.format("com.databricks.spark.sqldw")
.option("url", sqlDwUrlSmall)
.option("dbtable", "SampleTable")
.option( "forward_spark_azure_storage_credentials","True")
.option("tempdir", tempDir)
.mode("overwrite")
.save()
Ingest to a staging table, then CTAS to your fact table with a NOT EXISTS clause to eliminate duplicates.

Resources