I have found a ton of examples showing how to Merge data using Databricks Delta Table Merge to load data to SQL DB. However, I'm trying to find examples whereby trying to load data to SQL DB without Databricks Delta Merge fails.
This is because I'm having trouble getting my head around knowing a situation where I should be using Databricks Delta Merge.
Therefore, can someone point me to a link showing where loading data to SQL DB from Databricks would fail withou Databricks Delta Merge, alternatively steps I would have to take to merge without Databricks Delta Lake Merge?
Related
I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table.
In the beginning we were loading data into the delta table by using the merge function as given below.
This incoming dataframe inc_df had data for all partitions.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
We were executing the above query on table level.
I have given a very basic diagram of the process in the image below.
But my delta table is partitioned on continent and year.
For example, this is how my partitioned delta table looks like.
So I tried implementing the merge on partition level and tried to run merge activity on multiple partitions parallelly.
i.e. I have created seperate pipelines with the filters in queries on partition levels. Image can be seen below.
merge into main_db.main_delta_table main_dt USING inc_df df ON
main_dt.continent in ('AFRICA') AND main_dt.year in (‘202301’) AND
main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND
main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND
main_.rule_read_start=df.rule_read_start AND
main_.company = df.company
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
But I am seeing an error with concurrency.
- com.databricks.sql.transaction.tahoe.ConcurrentAppendException: Files were added to partition [continent=AFRICA, year=2021] by a concurrent update. Please try the operation again.
I understand that the error is telling me that it cannot update files concurrently.
But I have huge volume of data in production and I don't want to perform merge on table level where there are almost 1billion records without proper filters.
Trial2:
As an alternate approach,
I saved my incremental dataframe in an S3 bucket (like a staging dir) and end my streaming pipeline there.
Then I have a seperate PySpark job that reads data from that S3 staging dir and performs merge into my main delta table, once again on partition level (I have specified partitions in those jobs as filters)
But I am facing the same exception/error there as well.
Could anyone let me know how can I design and optimise my streaming pipeline to merge data into delta table on partition level by having multiple jobs parallelly (jobs running on indivdual partitions)
Trial3:
I also made another attempt in a different approach as mentioned in the link and ConcurrentAppendException section from that page.
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='Africa'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
and
base_delta = DeltaTable.forPath(spark,'s3://PATH_OF_BASE_DELTA_TABLE')
base_delta.alias("main_dt").merge(
source=final_incremental_df.alias("df"),
condition="main_dt.continent=df.continent AND main_dt.str_id=df.str_id AND main_.rule_date=df.rule_date AND main_.rule_id=df.rule_id AND main_.rule_read_start=df.rule_read_start AND main_.company = df.company, continent='ASIA'")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
I ran the above merge operations in two separate pipelines.
But I am still facing the same issue.
In your trial 3, you need to change the merge condition.
Instead of
condition="main_dt.continent=df.continent AND [...]"
it should be
condition="main_dt.continent='Africa' AND [...]"
You should also delete the continent='Africa' from the end of the condition.
Here is the documentation for reference.
I have a testcase which generates parquet & delta data into storage. I can then use Azure databricks to load this into a dataframe and query it.
However, if I download this parquet & delta data, then remove that data from storage, and re-upload the same data back into its original location, I can no longer run the same query on the same data, with error '...is not a Delta table' being the response. Is there a way to achieve this?
In total, I'm trying to enable a way to setup a system in order to run a test, where the test relies on this parquet & delta data already being present. I was hoping to be able to just upload this data directly, before running the test (rather than running a system to generate this initial data), however, as noted here, a straight upload of archived data, doesn't initially work.
I am very new to Spark as well as to the Data Factory resource in Azure.
I would like to use Spark on Azure to load and transform 2 files containing data.
Here is what I would like to achieve in more details:
Use a data factory on Azure and create a pipeline and use a Spark activity
Load 2 files containing data in JSONL format
Transform them by doing a "JOIN" on a given field that is existing in both
Output a new file containing the merged data
Anyone can help me achieve that?
As of now, I don't even understand how to load 2 files to work with in a Spark activity from a data factory pipeline...
The two easiest ways to use Spark in an Azure Data Factory (ADF) pipeline are either via a Databricks cluster and the Databricks activity or use an Azure Synapse Analytics workspace, its built-in Spark notebooks and a Synapse pipeline (which is mostly ADF under the hood).
I was easily able to load a json lines file (using this example) in a Synapse notebook using the following python:
%%pyspark
df1=spark.read.load(['abfss://someLake#someStorage.dfs.core.windows.net/raw/json lines example.jsonl'],format='json')
df1.createOrReplaceTempView("main")
df2=spark.read.load(['abfss://someLake#someStorage.dfs.core.windows.net/raw/json lines example 2.jsonl'],format='json')
df2.createOrReplaceTempView("ages")
You can now either join them in SQL, eg
%%sql
-- Join in SQL
SELECT m.name, a.age
FROM main m
INNER JOIN ages a ON m.name = a.name
Or join them in Python:
df3 = df1.join(df2, on="name")
df3.show()
My results:
I haven't tested this in Databricks but I imagine it's similar. You might have to setup up some more permissions there as Synapse integration is slightly easier.
You could also look at Mapping Data Flows in ADF which uses Spark clusters under the hood and offers a low-code / GUI-based experience.
There is literally a JOIN transformation built into ADF in the Data Flow activity that executes on Spark for you without needing to know anything about clusters or Spark programming :)
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-join
I have an ETL pipeline where data coming from redshift, reading the data in (py)spark dataframes, performing calculations and dumping back the result to some target in redshift. So the flow is => Redshift source schema--> Spark 3.0 --> Redshift target schema. This is done in EMR using spark-redshift library provided by databricks. But my data has million of records and doing a full load everytime is not a good option.
How can I perform incremental load/upserts in spark-redshift library, the option I wanted to go with is delta lake(open source and guarantees ACID) but we cannot simply read and write delta files to Redshift Spectrum using delta lake integration.
Please guide me how can i achieve this, also, if there are any alternatives.
I have saved one dataframe in my delta lake, below is the command:
df2.write.format("delta").mode("overwrite").partitionBy("updated_date").save("/delta/userdata/")
Also I can load and see the delta lake /userdata:
dfres=spark.read.format("delta").load("/delta/userdata")
but here , I have one doubt like when I am moving several parquet files from blob to delta lake creating dataframe, then how some one else would know which file I have moved and how he can work on those delta, is there any command to list all the dataframes in delta lake in databricks?
Break down the problem into:
Find the paths of all tables you want to check. Managed tables in the default location are stored at spark.conf.get("spark.sql.warehouse.dir") + s"/$tableName". If you have external tables, it is better to use catalog.listTables() followed by catalog.getTableMetadata(ident).location.getPath. Any other paths can be used directly.
Determine which paths belong to Delta tables using DeltaTable.isDeltaTable(path).
Hope this helps.