I am struggling to understand something here and im sure the answer is simple....when I run this command in a Databrick note book:
(df.write
.format("delta")
.option("path", "/file/path/location")
.saveAsTable("MyTable"))
It creates a delta table. Okay Great!
but if I run the same command in azure synapse spark notebook....it creates a table...
Does Synapse now support delta tables? According to this stack overflow post it does not.
So my question is...whats the difference?
Thanks in advance!
Azure Synapse Analytics has a number of engines such as Spark and SQL. Synapse Spark pools support Delta Lake. Synapse Serverless SQL pools recently supports reading from Delta Lake. Synapse Dedicated SQL Pools do not support Delta Lake at this moment.
That other StackOverflow answer is out of date. It was only discussing Synapse Serverless SQL Pools. I added a comment asking him to update it.
Related
Looking for below mentioned equivalent command in Azure synapse analytics notebook. Below is from databricks.
''''1.spark.conf.set("spark.databricks.io.cache.enabled", "true")
''''2.spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","true")
''''3.spark.conf.set("spark.databricks.delta.autoCompact.enabled","true")
These are some Delta cache and auto optimization features which are applicable to Databricks only. Databricks is code oriented while Synapse allows UI based analytics.
Synapse is a separate data Analytics service provided by Azure which has different feature sets. It's not mandatory that Databricks and Synapse will share same features.
Synapse Apache Spark allows you to optimize components like data serialization, joins, shuffle, job execution, etc. These components are different from Databricks but the objective are quite similar.
You can learn more in details about Optimize Apache Spark jobs in Azure Synapse Analytics.
Got the solution
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1 * 1024 * 1024 * 1024)
I have no experience with Azure Synapse but my understanding is that is the same as Databricks, ADF, ADLS2 and Hive in SQL DWH, all together in one workspace with a different name.
Am I wrong?
Yes, in many context Azure Synapse and Databricks provide the same Big Data Analytics approach but there are also few differences between these services.
With the new functionalities in Synapse now, we see some similar functionalities as in Databricks (e.g. Spark, Delta) which raises the question on how Synapse compares to Databricks and when to use which.
Yes, both have Spark but…
Databricks
has a proprietary data processing engine (Databricks Runtime) built
on a highly optimized version of Apache Spark offering 50x
performance
already has support for Spark 3.0
allows users to opt for GPU enabled clusters and choose between standard and high-concurrency cluster mode
Synapse
Open-source Apache Spark (thus not including all features of Databricks Runtime)
has built-in support for .NET for Spark applications
Yes, both have notebooks
Synapse
Nteract Notebooks
has co-authoring of Notebooks, but one person needs to save the Notebook before another person sees the change
doesn’t have automated versioning
Databricks
Databricks Notebooks
Has real-time co-authoring (both authors see the changes in real-time) Automated versioning
Yes, both can access data from a data lake
Synapse
When creating Synapse, you can select a data lake which will be your
primary data lake (can query it directly from the scripts and
notebooks)
Databricks
You need to mount a data lake before using it
Yes, both leverage Delta
Synapse
Delta Lake is open source
Databricks
Has Databricks Delta which is built on the open source but offers some extra optimizations
No, they are not the same
Synapse
Has both a traditional SQL engine (to fit the traditional BI developers) as well as a Spark engine (to fit data scientists, analysts & engineers)
Is a data warehouse (i.e. Synapse Analytics) + an interface tool (i.e. Synapse Studio)
Databricks
Is not a data warehouse tool but rather a Spark-based notebook tool
Has a focus on Spark, Delta Engine, MLflow and MLR
No, they don’t offer the same developer experience
Synapse
Offers for Spark-development a developer experience currently only through Synapse Studio (not through local IDEs)
Doesn’t have Git yet integrated within the Synapse Studio Notebooks
Databricks
Offers a developer experience within Databricks UI, Databricks Connect (i.e. remote connect from Visual Studio Code, Pycharm, etc.) and soon Jupyter & RStudio UI within Databricks
Check When to use Synapse and when Databricks?.
This diagram from this URL states Azure Synapse cannot query external relational stores but Azure databricks can.
But here I see it is possible with Azure Synapse. We could also use polybase in Azure Synapse. None of these articles are outdated. So what am I missing?
Your second URL is for External tables, which are not the same as an external relational stores (Azure SQL, MySQL, PostgreSQL, etc.) I do not believe any of the Synapse engines can connect directly to relational data stores [although I'm not certain of Spark's limitations in this regards], but Pipelines can. While they both use Spark, Databricks is a separate product and not related to Synapse.
Polybase uses External Tables, which are metadata references over blobs in storage (Blob or ADLS). Synapse supports External tables in both Dedicated SQL Pool and Serverless SQL. Spark tables are also queryable from Serverless SQL because they are stored as Parquet files in ADLS. I believe this is also implemented as an External Table reference, although it does not display as such in the Workspace UI.
Objective
I'm storing data as Delta Lake format at ADLS gen2. Also they are available through Hive catalog.
It's important to notice that we're currently using PowerBI, but in future we may switch to Excel over AAS.
Question
What is the best way (or hack) to connect AAS to my ADLS gen2 data in Delta Lake format?
The issue
There are no Databricks/Hive among AAS supported sources. AAS supports ADLS gen2 through Blob connector, but AFAIK, it doesn't support Delta Lake format, only parquet.
Possible solution
From this article I see that the issue may be potentially solved with PowerBI on-premise API gateway:
One example is the integration between Azure Analysis Services (AAS)
and Databricks; Power BI has a native connector to Databricks, but
this connector hasn’t yet made it to AAS. To compensate for this, we
had to deploy a Virtual Machine with the Power BI Data Gateway and
install Spark drivers in order to make the connection to Databricks
from AAS. This wasn’t a show stopper, but we’ll be happy when AAS has
a more native Databricks connection.
The issue with this solution is that we're planning to stop using PowerBI. I don't quite understand how it works, what PBI license and implementation/maintenance efforts it requires. Could you please provide deeper insight on how it'll work?
UPD, 26 Dec 2020
Now, when Azure Synapse Analytics is GA, it has full support of SQL on-demand. That means that serverless Synapse may theoretically be used as a glue between AAS and Delta Lake. See "Direct Query Databricks' Delta Lake from Azure Synapse".
In the same time, is that possible to query Databricks Catalog (internal/external) from Synapse on-demand using ODBC? Synapse supports ODBC as external source.
Power BI Dataflows now supports Parquet files, so you can load from those files to Power BI, however the standard design pattern is to use Azure SQL Data Warehouse to load the file then layer Azure Analysis Service (AAS) over that. AAS does not support parquet, you would have to create a CSV version of the final table, or load it to a SQL Database.
As mentioned the typical architecture, is to have Databricks do some or all of the ETL, then have Azure SQL DW sit over it.
Azure SQL DW has now morphed into Azure Synapse, but this has the benefit of that a Databricks/Spark database now has a shadow copy but accessible by the SQL on Demand functionality. SQL on Demand doesn't require to to have an instance of the data warehouse component of Azure Synapse, it runs on demand, and you per per TB of query. A good outline of how it can help is here. The other option is to have Azure Synapse load the data from external table into that service then connect AAS to that.
Databricks has just released a public preview of Delta Lake and Presto integration. I'm new to Azure, and the link has multiple mentions of EMR and Athena but lack Azure keywords. So I have to ask a stupid question:
Am I right that Presto integration is only for AWS since Azure doesn't have Presto PaaS?
P.S. Have Databricks any plans for Delta Lake and Synapse/Polybase integration in nearest future?
Delta Lake Presto integration is based on "symlinks" and they are supported in Presto since long. On Azure, you can conveniently provision Presto using
Starburst Presto Kubernetes + Azure AKS (recommended)
Starburst Presto for HDInsights, freely available in HDInsights Marketplace (more of a "legacy" option)
there is also some Azure offering from Qubole and perhaps others
Bear in mind, however, that "symlinks"-based integration has certain limits. Here, at Starburst, we're working on native Delta Lake support, without need to create "symlinks".