Databricks notebooks lineage in Azure Purview - databricks

If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?

Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.
Here's an example of where Spline was used to track lineage from notebooks:
https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/
This article talks about how to get started with the Purview REST API:
https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058

You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.
Supports table level lineage from Spark Notebooks and jobs for the following data sources:
Azure SQL
Azure Synapse Analytics
Azure Data Lake Gen 2
Azure Blob Storage
Delta Lake
Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
Databricks Runtimes between 6.4 and 10.3 are currently supported
Can be configured per cluster or for all clusters as a global configuration
Once configured, does not require any code changes to notebooks or jobs

Related

Azure synapse equivalent commands

Looking for below mentioned equivalent command in Azure synapse analytics notebook. Below is from databricks.
''''1.spark.conf.set("spark.databricks.io.cache.enabled", "true")
''''2.spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","true")
''''3.spark.conf.set("spark.databricks.delta.autoCompact.enabled","true")
These are some Delta cache and auto optimization features which are applicable to Databricks only. Databricks is code oriented while Synapse allows UI based analytics.
Synapse is a separate data Analytics service provided by Azure which has different feature sets. It's not mandatory that Databricks and Synapse will share same features.
Synapse Apache Spark allows you to optimize components like data serialization, joins, shuffle, job execution, etc. These components are different from Databricks but the objective are quite similar.
You can learn more in details about Optimize Apache Spark jobs in Azure Synapse Analytics.
Got the solution
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1 * 1024 * 1024 * 1024)

Databricks + ADF + ADLS2 + Hive = Azure Synapse

I have no experience with Azure Synapse but my understanding is that is the same as Databricks, ADF, ADLS2 and Hive in SQL DWH, all together in one workspace with a different name.
Am I wrong?
Yes, in many context Azure Synapse and Databricks provide the same Big Data Analytics approach but there are also few differences between these services.
With the new functionalities in Synapse now, we see some similar functionalities as in Databricks (e.g. Spark, Delta) which raises the question on how Synapse compares to Databricks and when to use which.
Yes, both have Spark but…
Databricks
has a proprietary data processing engine (Databricks Runtime) built
on a highly optimized version of Apache Spark offering 50x
performance
already has support for Spark 3.0
allows users to opt for GPU enabled clusters and choose between standard and high-concurrency cluster mode
Synapse
Open-source Apache Spark (thus not including all features of Databricks Runtime)
has built-in support for .NET for Spark applications
Yes, both have notebooks
Synapse
Nteract Notebooks
has co-authoring of Notebooks, but one person needs to save the Notebook before another person sees the change
doesn’t have automated versioning
Databricks
Databricks Notebooks
Has real-time co-authoring (both authors see the changes in real-time) Automated versioning
Yes, both can access data from a data lake
Synapse
When creating Synapse, you can select a data lake which will be your
primary data lake (can query it directly from the scripts and
notebooks)
Databricks
You need to mount a data lake before using it
Yes, both leverage Delta
Synapse
Delta Lake is open source
Databricks
Has Databricks Delta which is built on the open source but offers some extra optimizations
No, they are not the same
Synapse
Has both a traditional SQL engine (to fit the traditional BI developers) as well as a Spark engine (to fit data scientists, analysts & engineers)
Is a data warehouse (i.e. Synapse Analytics) + an interface tool (i.e. Synapse Studio)
Databricks
Is not a data warehouse tool but rather a Spark-based notebook tool
Has a focus on Spark, Delta Engine, MLflow and MLR
No, they don’t offer the same developer experience
Synapse
Offers for Spark-development a developer experience currently only through Synapse Studio (not through local IDEs)
Doesn’t have Git yet integrated within the Synapse Studio Notebooks
Databricks
Offers a developer experience within Databricks UI, Databricks Connect (i.e. remote connect from Visual Studio Code, Pycharm, etc.) and soon Jupyter & RStudio UI within Databricks
Check When to use Synapse and when Databricks?.

Does Azure Databricks and Delta Layer make it a Lakehouse?

Even after going through many resources, I have failed to understand what constitutes a lakehouse, hence my question below.
If we have Azure Gen 2 Storage, ADF, and Azure Databricks with the possibility of converting the incoming CSV files into Delta tables can that be called a "Lakehouse" architecture or is it called a "Delta Lake"?
Or is it the "SQL analytics" engine over and above the Delta Lake layer that makes it a "Lakehouse"?
Please clarify.
At a high level a Lakehouse must contain the following properties:
Open direct access data formats (Apache Parquet, Delta Lake etc.)
First class support for machine learning and data science workloads
state of the art performance
Databricks is the first Lakehouse because it meets the above three properties. Specifically, if you are using Databricks with ADLS and converting all your data (json, csv, parquet, messages etc.) into Delta tables that are available within Databricks. Then that is the making of a Lakehouse, but it still needs to be built and supported. The Databricks platform allows us to satisfy points 2 and 3 above and Delta Lake satisfies 1 ad 3 (performance relies on the engine and the storage which is why 3 is mentioned twice).
Leveraging Databricks and accessing data stored in Delta is a Lakehouse. By adding Databricks SQL (formally SQL Analytics) we allow more users to access and use the Lakehouse. In Databricks SQL users are using the same compute and data as the data engineer does in Databricks, they just have a different UI that they are familiar with. Additionally, Databricks SQL is optimized for SQL and BI workloads while the notebook environment is better for engineering and data science
As a fun read you should check our the Lakehouse whitepaper.

How to connect Spark Structured Streaming to blob/file creation events from Azure Data Lake Storage Gen2 or Blob Storage

I am new to Spark Structured Streaming and its concepts. Was reading through the documentation for Azure HDInsight cluster here and it's mentioned that the structured streaming applications run on HDInsight cluster and connects to streaming data from .. Azure Storage, or Azure Data Lake Storage. I was looking at how to get started with the streaming listening to new file created events from the storage or ADLS. The spark documentation does provide an example, but i am looking for how to tie up streaming with the blob/file creation event, so that I can store the file content in a queue from my spark job. It will be great if anyone can help me out on this.
happy to help you on this, but can you be more precise with the requirement. Yes, you can run the Spark Structured Streaming jobs on Azure HDInsight. Basically mount the azure blob storage to cluster and then you can directly read the data available in the blob.
val df = spark.read.option("multiLine", true).json("PATH OF BLOB")
Azure Data Lake Gen2 (ADL2) has been released for Hadoop 3.2 only. Open Source Spark 2.4.x supports Hadoop 2.7 and if you compile it yourself Hadoop 3.1. Spark 3 will support Hadoop 3.2, but it's not released yet (only preview release).
Databricks offers support for ADL2 natively.
My solution to tackle this problem was to manually patch and compile Spark 2.4.4 with Hadoop 3.2 to be able to use the ADL2 libs from Microsoft.

batch processing in azure

We are planning to do batch processing on a daily basis. We generate 1 GB of CSV files every day and will manually put them into Azure Data Lake Store. I have read the Microsoft Azure documents regarding the batch processing and I have decided to use Spark as to batch processing. My question is that after we transfer the data using RDD/DF what would be the next step? how we can visualize the data? since this process is supposed to be run every day, once the data transformation done using Spark, do we need to push the data to any kind of data store like hive hdfs or cosmos before we could visualize it?
There are several options doing this on Azure. It really depends on your requirements (e.g. number of users, needed visualizations, etc). Examples for doing it:
Running Spark on Azure Databricks, you could use the Notebook capabilities to visualize your data
Use HDInsight with Jupyter or Zeppelin Notebooks
Define Spark tables on Azure Databricks and visualize them with Power BI
Load the data with Azure Data Factory V2 to Azure SQL DB or Azure SQL Data Warehouse and visualize it with Power BI.
For Time-Series-Data you could push the data via Spark to Azure EventHubs (see Example notebook with Eventhubs Sink in the following documentation) and consume it via Azure Time Series Insights. If you have an EventData-Stream this could also replace your batch oriented architecture in the future. Parquet files will be used by Azure Time Series Insights as Long-term Storage (see the following link). For Spark also have a look at Time Series Package which adds some time series capabilities to spark.

Resources