How to access DeltaLake Tables without Databrick Cluster running - databricks

I have created DeltaLake Tables on DataBricks Cluster. And I am able to access these tables from external system/application. Though I need to keep the cluster up and running all the time to be able to access the table data.
Question:
Is it possible to access the DeltaLake Tables when Cluster is down?
If Yes, Then how can I setup
I tried to lookup on docs. Found that 'Premium access to DetaBrick' has some Table Access Controls. disabled by otherwise. It says:
Enabling Table Access Control will allow users to control who can
select, create, and modify databases, tables, views, and functions
that they create.
I also found this doc
I don't think this is the option for my requirement.. Please suggest

The solution I found is to store all Delta Lake Tables on Storage Gen2. This will have access to external resources irrespective of DataBrick Clusters.
While reading a file or writing into table we will have our Cluster up and running, rest of time it can be shut down.
From Docs: In databricks we can create delta tables of two types: managed and unmanaged. Managed are those for which data is stored in DBFS (Databricks FileSystem). While Unmanaged are those where an external ADLS Gen-2 location can be specified.
dataframe.write.mode("overwrite").option("path","abfss://[ContainerName]#[StorageAccount].dfs.core.windows.net").saveAsTable("table")

Related

Get data from Azure Synpase to Azure Machine Learning

I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?
The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.

Azure Databricks Cluster Questions

I am new to azure and am trying to understand the below things. It would be helpful if anyone can share their knowledge on this.
Can the table be created in Cluster A be accessed in Cluster B if Cluster A is down?
What is the connection between the cluster and the data in the tables?
You need to have running process (cluster) to be able to access metastore, and read data, because data is stored in the customer's location, not directly accessible from the control plane that runs UI.
When you wrote data into table, then this data should be available in other cluster in following conditions:
the both clusters are using the same metastore
user has correct permissions (could be enforced via Table ACLs)

Databricks Delta Tables - Where are they normally stored?

I'm beginning my journey into Delta Tables and one thing that is still confusing me is where is the best place to save your delta tables if you need to query them later.
For example I'm migrating several tables from on-prem to azure databricks into individual delta tables. My question is, should I save the individual delta tables which could be significant in size into the DBFS databricks internal storage, or should I mount a blob storage location and save the delta lake tables there? What do people normally do in these situations?
I usually recommend people to store data in a separate storage account (either mounted, or used directly), and don't use the internal storage of workspace for that tasks. Primary reason - it's easier to share this data with other workspaces, or other systems if it's necessary. Internal storage should be primarily used for temp files, libraries, init scripts, etc.
There is a number of useful guides available that can help:
Azure Databricks Best Practices, and it's specifically says about internal storage
About securing access to Azure Data Lake

Is it possible to access Databricks DBFS from Azure Data Factory?

I am trying to use the Copy Data Activity to copy data from Databricks DBFS to another place on the DBFS, but I am not sure if this is possible.
When I select Azure Delta Storage as a dataset source or sink, I am able to access the tables in the cluster and preview the data, but when validating it says that the tables are not delta tables (which they aren't, but I don't seem to acsess the persistent data on DBFS)
Furthermore, what I want to access is the DBFS, not the cluster tables. Is there an option for this?

Hadoop on Azure using IaaS

I am looking at having a Hadoop cluster setup for Big Data analytics using the virtualized environment in Azure. As the data volume is very high, I am looking at having data stored in secondary storage like Azure Data Lake Store and Hadoop cluster storage will act as the primary storage.
I would like to know, how can this be configured so that when i create a Hive table and partition, part of the data can reside in Primary storage and the rest in the secondary storage?
Thanks
Regards,
Madhu
You can't mix file systems with a Hive table by default. The Hive metastore only consists of one filesystem location for a database / table definition.
You might try to use Waggle Dance to setup a federated Hive solution, but it's probably too much work than simply allowing Hive data to exist in Azure
I don't know about Hadoop and Hive but you could combine Azure Data Lake Store (ADLS) and Azure SQL Data Warehouse (ADW), ie use Polybase in ADW to create an external table on the 'cold' data in ADLS and an internal table for your 'warm' data. ADW has the advantage that you can pause it.
Optionally create a view over the top to combine the external and internal table.

Resources