Databricks + ADF + ADLS2 + Hive = Azure Synapse - azure

I have no experience with Azure Synapse but my understanding is that is the same as Databricks, ADF, ADLS2 and Hive in SQL DWH, all together in one workspace with a different name.
Am I wrong?

Yes, in many context Azure Synapse and Databricks provide the same Big Data Analytics approach but there are also few differences between these services.
With the new functionalities in Synapse now, we see some similar functionalities as in Databricks (e.g. Spark, Delta) which raises the question on how Synapse compares to Databricks and when to use which.
Yes, both have Spark but…
Databricks
has a proprietary data processing engine (Databricks Runtime) built
on a highly optimized version of Apache Spark offering 50x
performance
already has support for Spark 3.0
allows users to opt for GPU enabled clusters and choose between standard and high-concurrency cluster mode
Synapse
Open-source Apache Spark (thus not including all features of Databricks Runtime)
has built-in support for .NET for Spark applications
Yes, both have notebooks
Synapse
Nteract Notebooks
has co-authoring of Notebooks, but one person needs to save the Notebook before another person sees the change
doesn’t have automated versioning
Databricks
Databricks Notebooks
Has real-time co-authoring (both authors see the changes in real-time) Automated versioning
Yes, both can access data from a data lake
Synapse
When creating Synapse, you can select a data lake which will be your
primary data lake (can query it directly from the scripts and
notebooks)
Databricks
You need to mount a data lake before using it
Yes, both leverage Delta
Synapse
Delta Lake is open source
Databricks
Has Databricks Delta which is built on the open source but offers some extra optimizations
No, they are not the same
Synapse
Has both a traditional SQL engine (to fit the traditional BI developers) as well as a Spark engine (to fit data scientists, analysts & engineers)
Is a data warehouse (i.e. Synapse Analytics) + an interface tool (i.e. Synapse Studio)
Databricks
Is not a data warehouse tool but rather a Spark-based notebook tool
Has a focus on Spark, Delta Engine, MLflow and MLR
No, they don’t offer the same developer experience
Synapse
Offers for Spark-development a developer experience currently only through Synapse Studio (not through local IDEs)
Doesn’t have Git yet integrated within the Synapse Studio Notebooks
Databricks
Offers a developer experience within Databricks UI, Databricks Connect (i.e. remote connect from Visual Studio Code, Pycharm, etc.) and soon Jupyter & RStudio UI within Databricks
Check When to use Synapse and when Databricks?.

Related

Can I use Azure Synapse functionality outside the Azure environment?

Forum,
I am currently looking into Azure Synapse as an option for migrating our on-prem data architecture. I am excited by the functionality it offers - SQL Pools, Spark Pools, and the accompanying notebooks. I get that Synapse can function as a all in one data platform, where my data scientists and data analists can use its functionality to deliver insights at will. However, a large part of the work my team does is creating data products.
We currently have a kubernetes cluster with several stand-alone API's that perform data-science operations in the larger whole of our software. They can be thought of as microservices. Most of the ETL is done in our SQL-server, and the microservices in our K8S cluster (usually python + some python packages + FastAPI) typically get the required data from our SQL-server through some SQL-query with an ODBC connector.
Now my question is, how suitable is Synapse for such an architecture? Can I call upon the SQL-pool or spark-pool to do the heavy data-lifting from outside the azure environment, say from a kubernetes pod?
Unfortunately you can't integrate Azure Synapse Analytics with Kubernetes Services.
While Synapse SQL helps perform SQL queries, Apache Spark executes batch/stream processing on Big Data. SQL Pool is used to work with data stored in Dedicated SQL Pool while Spark SQL can be integrated with existing data preparation or data science projects that you may hold in Azure Databricks or Azure Machine Learning Services.
Also, as per this third-party document, Azure Synapse Analytics can't integrate with Kubernetes Services.
As a workaround, you can copy/move your data from Kubernetes to Azure Services like Azure Dedicated SQL Pool, Azure Blob Storage or Azure Data Lake Storage and then integrate it with Azure Synapse pipeline or Spark Pool.

Databricks notebooks lineage in Azure Purview

If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?
Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.
Here's an example of where Spline was used to track lineage from notebooks:
https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/
This article talks about how to get started with the Purview REST API:
https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058
You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.
Supports table level lineage from Spark Notebooks and jobs for the following data sources:
Azure SQL
Azure Synapse Analytics
Azure Data Lake Gen 2
Azure Blob Storage
Delta Lake
Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
Databricks Runtimes between 6.4 and 10.3 are currently supported
Can be configured per cluster or for all clusters as a global configuration
Once configured, does not require any code changes to notebooks or jobs

Does Azure Databricks and Delta Layer make it a Lakehouse?

Even after going through many resources, I have failed to understand what constitutes a lakehouse, hence my question below.
If we have Azure Gen 2 Storage, ADF, and Azure Databricks with the possibility of converting the incoming CSV files into Delta tables can that be called a "Lakehouse" architecture or is it called a "Delta Lake"?
Or is it the "SQL analytics" engine over and above the Delta Lake layer that makes it a "Lakehouse"?
Please clarify.
At a high level a Lakehouse must contain the following properties:
Open direct access data formats (Apache Parquet, Delta Lake etc.)
First class support for machine learning and data science workloads
state of the art performance
Databricks is the first Lakehouse because it meets the above three properties. Specifically, if you are using Databricks with ADLS and converting all your data (json, csv, parquet, messages etc.) into Delta tables that are available within Databricks. Then that is the making of a Lakehouse, but it still needs to be built and supported. The Databricks platform allows us to satisfy points 2 and 3 above and Delta Lake satisfies 1 ad 3 (performance relies on the engine and the storage which is why 3 is mentioned twice).
Leveraging Databricks and accessing data stored in Delta is a Lakehouse. By adding Databricks SQL (formally SQL Analytics) we allow more users to access and use the Lakehouse. In Databricks SQL users are using the same compute and data as the data engineer does in Databricks, they just have a different UI that they are familiar with. Additionally, Databricks SQL is optimized for SQL and BI workloads while the notebook environment is better for engineering and data science
As a fun read you should check our the Lakehouse whitepaper.

Load data from Databricks to Azure Analysis Services (AAS)

Objective
I'm storing data as Delta Lake format at ADLS gen2. Also they are available through Hive catalog.
It's important to notice that we're currently using PowerBI, but in future we may switch to Excel over AAS.
Question
What is the best way (or hack) to connect AAS to my ADLS gen2 data in Delta Lake format?
The issue
There are no Databricks/Hive among AAS supported sources. AAS supports ADLS gen2 through Blob connector, but AFAIK, it doesn't support Delta Lake format, only parquet.
Possible solution
From this article I see that the issue may be potentially solved with PowerBI on-premise API gateway:
One example is the integration between Azure Analysis Services (AAS)
and Databricks; Power BI has a native connector to Databricks, but
this connector hasn’t yet made it to AAS. To compensate for this, we
had to deploy a Virtual Machine with the Power BI Data Gateway and
install Spark drivers in order to make the connection to Databricks
from AAS. This wasn’t a show stopper, but we’ll be happy when AAS has
a more native Databricks connection.
The issue with this solution is that we're planning to stop using PowerBI. I don't quite understand how it works, what PBI license and implementation/maintenance efforts it requires. Could you please provide deeper insight on how it'll work?
UPD, 26 Dec 2020
Now, when Azure Synapse Analytics is GA, it has full support of SQL on-demand. That means that serverless Synapse may theoretically be used as a glue between AAS and Delta Lake. See "Direct Query Databricks' Delta Lake from Azure Synapse".
In the same time, is that possible to query Databricks Catalog (internal/external) from Synapse on-demand using ODBC? Synapse supports ODBC as external source.
Power BI Dataflows now supports Parquet files, so you can load from those files to Power BI, however the standard design pattern is to use Azure SQL Data Warehouse to load the file then layer Azure Analysis Service (AAS) over that. AAS does not support parquet, you would have to create a CSV version of the final table, or load it to a SQL Database.
As mentioned the typical architecture, is to have Databricks do some or all of the ETL, then have Azure SQL DW sit over it.
Azure SQL DW has now morphed into Azure Synapse, but this has the benefit of that a Databricks/Spark database now has a shadow copy but accessible by the SQL on Demand functionality. SQL on Demand doesn't require to to have an instance of the data warehouse component of Azure Synapse, it runs on demand, and you per per TB of query. A good outline of how it can help is here. The other option is to have Azure Synapse load the data from external table into that service then connect AAS to that.

Distinct difference between Azure Databricks and Azure Synapse Analytics

Can someone explain the distinct difference between these two products in all major aspects? As far as I am aware from reading the official documents, both could host database systems and provide data cleaning pipeline? Both are on cloud?
Databricks:
Azure Databricks is an Apache Spark-based analytics platform optimized
for the Microsoft Azure cloud services platform. Designed with the
founders of Apache Spark, Databricks is integrated with Azure to
provide one-click setup, streamlined workflows, and an interactive
workspace that enables collaboration between data scientists, data
engineers, and business analysts.
Synapse Analytics:
Azure Synapse is a limitless analytics service that brings together
enterprise data warehousing and Big Data analytics. It gives you the
freedom to query data on your terms, using either serverless on-demand
or provisioned resources—at scale. Azure Synapse brings these two
worlds together with a unified experience to ingest, prepare, manage,
and serve data for immediate BI and machine learning needs
they do overlap to some extent, but they are not the same thing. Databricks is pretty much managed Apache Spark, whereas Synapse Analytics is managed SQL Data Warehouse.

Resources