Databricks Serverless Compute - I know this is still in preview and is by request and is only available on AWS.
Can this be used for Read and Write (Update) .delta tables [or] is it read-only?
And is it good to run small queries (transactional in nature)? [or] is it good to have Azure SQL for that?
Performance from Azure SQL (az sql) seems faster for small queries than Databricks.
As Dataricks has to traverse through Hive Metastore when querying from .delta tables - will this impact the performance?
According to the Release Notes (June 17 2021), the new photon executor is switched on for SQL endpoints and it does also support writes to Delta tables (and to Parquet).
If you want to run a lot of small queries on a set of data, then I'd say Az SQL interactions (or operations on a SparkDataFrame taken from the Delta Table), should always outperform the same thing expressed in SQL running directly against a Delta Lake table, since the latter has to negotiate the versioned parquet files and the Delta Lake transaction log on your behalf.
Related
I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?
The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.
Even after going through many resources, I have failed to understand what constitutes a lakehouse, hence my question below.
If we have Azure Gen 2 Storage, ADF, and Azure Databricks with the possibility of converting the incoming CSV files into Delta tables can that be called a "Lakehouse" architecture or is it called a "Delta Lake"?
Or is it the "SQL analytics" engine over and above the Delta Lake layer that makes it a "Lakehouse"?
Please clarify.
At a high level a Lakehouse must contain the following properties:
Open direct access data formats (Apache Parquet, Delta Lake etc.)
First class support for machine learning and data science workloads
state of the art performance
Databricks is the first Lakehouse because it meets the above three properties. Specifically, if you are using Databricks with ADLS and converting all your data (json, csv, parquet, messages etc.) into Delta tables that are available within Databricks. Then that is the making of a Lakehouse, but it still needs to be built and supported. The Databricks platform allows us to satisfy points 2 and 3 above and Delta Lake satisfies 1 ad 3 (performance relies on the engine and the storage which is why 3 is mentioned twice).
Leveraging Databricks and accessing data stored in Delta is a Lakehouse. By adding Databricks SQL (formally SQL Analytics) we allow more users to access and use the Lakehouse. In Databricks SQL users are using the same compute and data as the data engineer does in Databricks, they just have a different UI that they are familiar with. Additionally, Databricks SQL is optimized for SQL and BI workloads while the notebook environment is better for engineering and data science
As a fun read you should check our the Lakehouse whitepaper.
I have files loaded into an azure storage account gen2, and am using Azure Synapse Analytics to query them. Following the documentation here: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables, I should be able to create a spark sql table to query the partitioned data, and thus subsequently use the metadata from spark sql in my sql on demand query to given the line in the doc: When a table is partitioned in Spark, files in storage are organized by folders. Serverless SQL pool will use partition metadata and only target relevant folders and files for your query
My data is partitioned in ADLS gen2 as:
Running the query in a spark notebook in Synapse Analytics returns in just over 4 seconds, as it should given the partitioning:
However, now running the same query in the sql on demand sql side script never completes:
This result and extreme reduction in performance compared to spark pool is completely counter to what the documentation notes. Is there something I am missing in the query to make sql-on demand use the partitions?
Filepath() and filename() functions can be used in the WHERE clause to filter the files to be read. Which that you can achieve the prunning you have been looking for.
I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention period is over, do we need to keep the Cluster Up for the entire duration?
In simple words-: Do we need to have Cluster always in running state in order to leverage Delta lake ?
You don't need to always keep a cluster up and running. You can schedule a vacuum job to run daily (or weekly) to clean up stale data older than the threshold. Delta Lake doesn't require an always-on cluster. All the data/metadata are stored in the storage (s3/adls/abfs/hdfs), so no need to keep anything up and running.
Apparently you need a cluster to be up and running always to query for the data available in databricks tables.
If you have configured the external meta store for databricks, then you can use any wrappers like apache hive by pointing it to that external meta store DB and query the data using hive layer without using databricks.
We are planning to do batch processing on a daily basis. We generate 1 GB of CSV files every day and will manually put them into Azure Data Lake Store. I have read the Microsoft Azure documents regarding the batch processing and I have decided to use Spark as to batch processing. My question is that after we transfer the data using RDD/DF what would be the next step? how we can visualize the data? since this process is supposed to be run every day, once the data transformation done using Spark, do we need to push the data to any kind of data store like hive hdfs or cosmos before we could visualize it?
There are several options doing this on Azure. It really depends on your requirements (e.g. number of users, needed visualizations, etc). Examples for doing it:
Running Spark on Azure Databricks, you could use the Notebook capabilities to visualize your data
Use HDInsight with Jupyter or Zeppelin Notebooks
Define Spark tables on Azure Databricks and visualize them with Power BI
Load the data with Azure Data Factory V2 to Azure SQL DB or Azure SQL Data Warehouse and visualize it with Power BI.
For Time-Series-Data you could push the data via Spark to Azure EventHubs (see Example notebook with Eventhubs Sink in the following documentation) and consume it via Azure Time Series Insights. If you have an EventData-Stream this could also replace your batch oriented architecture in the future. Parquet files will be used by Azure Time Series Insights as Long-term Storage (see the following link). For Spark also have a look at Time Series Package which adds some time series capabilities to spark.