At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.
Related
I am trying to export a snapshot of a postgresql database to parquet files using Spark.
I am dumping each table in the database to a seperate parquet file.
tables_names = ["A", "B", "C" , ...]
for table_name in tables_names:
table = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", url)
.option("dbtable", table_name)
.option("user", user)
.load())
table.write.mode("overwrite").saveAsTable(table_name)
The problem, however, is that I need the tables to be consistent with each other.
Ideally, the table loads should be executed in a single transaction so they see the same version of the database.
The only solution I can think of is to select all tables in a single query using UNION/JOIN but then I would need to identify each table columns which is something I am trying to avoid.
Unless you force all future connections to the database, not instance, to be read only and terminate those in flight, using, setting the
PostgreSQL configuration parameter default_transaction_read_only to true, then, no you cannot do this per discrete table approach as per your code.
Note that a session can override the global setting.
Means your 2nd option will work due to MVRCM, but not elegant and how performance from a Spark context for jdbc?
is Dataproc-noob again.
My main goal is to ingest the tables from on-premise sources, store them as a Parquet-file in a Cloud Storage bucket and create/update tables in BigQuery from this file, following my previous post about Dataproc and Hudi conf, i was able to deploy and ingest from on-premise sources, via Dataproc/PySpark/Hudi, and stored them in Cloud Storage.
Next question is about 'Upsert' conf in 'hudi_options' and how can be append new results in the parquet-file that is cloud storage bucket. It is not clear to me is you can update/change a Parquet-file with Hudi.
I want to avoid delete previous loads and only store one Parquet-file per table.
Upsert code:
table_location = "gs://bucket/{}/".format(table_name)
updates = spark.read.format("jdbc") \
.option("url",url) \
.option("user", username) \
.option("password", password) \
.option("driver", "com.sap.db.jdbc.Driver") \
.option("query", query) \
.load()
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.recordkey.field': 'a,b,c,d,e',
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'x',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'path': table_location,
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.database': database_name,
'hoodie.datasource.hive_sync.table': table_name,
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms'
}
updates.write.format("hudi") \
.options(**hudi_options) \
.mode("append") \
.save()
However, generate another Parquet-file in Cloud Storage.
Is there anything i´m missing in configuration?
Thanks!
Apache Hudi and the other open Data Lakehouse formats (Apache Iceberg and Delta) creates data snapshots in order to bring ACID transactions to parquet format. In each commit, they read the existing data in the changed partitions, apply the changes in memory and tmp files (appending new data, deleting or updating existing data), the re-write the whole data to the partitions. Later they delete the old files based on the cleaning policy, and you can use them for time travel queries or for rollback.
I want to avoid delete previous loads
While your OPERATION_OPT_KEY is not insert_overwrite_table or insert_overwrite, and spark save mode is not overwrite, don't worry about the previous loads.
and only store one Parquet-file per table
If you want to compact the data and have a single big parquet file in your table, you need to increase hoodie.parquet.max.file.size which is 120MB by default, because Hudi splits the parquet files which exceeds this size in chunks.
We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.
I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.
At the moment I am setting up an Azure SQL Data Warehouse. I am using Databricks for the ETL process with JSON-files from Azure Blob Storage.
What is the best practice to make sure to not import duplicate dimensions or facts into the Azure SQL Data Warehouse?
This could happen for facts e.g. in the case of en exception while the loading process. And for dimensions this could happen as well if I would not check, which data already exists.
I am using the following code to import data into the data warehouse and I found no "mode" which would only import data which not already exists:
spark.conf.set(
"spark.sql.parquet.writeLegacyFormat",
"true")
renamedColumnsDf.write
.format("com.databricks.spark.sqldw")
.option("url", sqlDwUrlSmall)
.option("dbtable", "SampleTable")
.option( "forward_spark_azure_storage_credentials","True")
.option("tempdir", tempDir)
.mode("overwrite")
.save()
Ingest to a staging table, then CTAS to your fact table with a NOT EXISTS clause to eliminate duplicates.