Common metadata in databricks cluster - apache-spark

I have a 3-4 clusters in my databricks instance of Azure cloud platform. I want to maintain a common metastore for all the cluster. Let me know if anyone implemented this.

I recommend configuring an external Hive metastore. By default, Detabricks spins its own metastore behind the scenes. But you can create your own database (Azure SQL does work, also MySQL or Postgres) and specify it during the cluster startup.
Here are detailed steps:
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
Things to be aware of:
Data tab in Databricks - you can choose the cluster and see different metastores.
To avoid using SQL user&password, look at Managed Identities https://learn.microsoft.com/en-us/azure/stream-analytics/sql-database-output-managed-identity
Automate external Hive metastore connections by using initialization scripts for your cluster
Permissions management on your sources. In case of ADLS Gen 2, consider using password pass-through

Related

Is there a way to access internal metastore of Azure HDInsight to fire queries on Hive metastore tables?

I am trying to access the internal Hive metastore tables like HIVE.SDS, HIVE.TBLS etc.
I have an HDInsight Hadoop Cluster running with the default internal metastore. From Ambari screen, I got the Advanced setting details required for connections like -
javax.jdo.option.ConnectionDriverName,javax.jdo.option.ConnectionURL,javax.jdo.option.ConnectionUserName as well as the password
When I try connecting to the SQL Server instance(internal hive metastore) instance from a local machine, I get the message to add my IP address to the allowed list. However, since this Azure SQL server is not visible in the list of Azure SQL server dbs in the portal, it is not possible for me to whitelist my IP.
So, I tried logging in via the secure shell user- SSHUSER into the Cluster and tried accessing the HIVE database from within the Cluster using the credentials of metastore provided in Ambari. I am still not able to access it. I am using sqlcmd to connect to sql sever.
Does HDInsight prevent direct access to internal Metastores? Is External Metastore the only way to move ahead? Any leads would be helpful.
Update- I created an external SQL Server instance and used it as an external metastore and was able to access it programatically.
No luck with the Internal one yet.
There is not a way to access internal metastores for HDInsight cluster. The internal metastores live in the internal subscription which only PGs are able to access.
If you want to have more control on you metastores it is recommended to bring your own "external" metastore.

How to connect Azure Data Factory with SQL Endpoints instead of interactive cluster?

Is it possible to connect Azure Data Factory with Azure Databricks SQL Endpoints (Delta table and views) instead of interactive cluster. I tried with Azure delta lake connector but it has options for cluster and not Endpoints?
Unfortunately, you cannot connect Azure Databricks SQL endpoints with Azure Databricks using ADF.
Note: With compute option - you can connect Azure Databricks workspace with the below cluster options:
New Job cluster
Existing interactive cluster
Existing instance pool
Note: With Datastore option - Azure Databricks Delta Lake option you can connect only existing interactive clusters:
Appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

Azure Databricks Cluster Questions

I am new to azure and am trying to understand the below things. It would be helpful if anyone can share their knowledge on this.
Can the table be created in Cluster A be accessed in Cluster B if Cluster A is down?
What is the connection between the cluster and the data in the tables?
You need to have running process (cluster) to be able to access metastore, and read data, because data is stored in the customer's location, not directly accessible from the control plane that runs UI.
When you wrote data into table, then this data should be available in other cluster in following conditions:
the both clusters are using the same metastore
user has correct permissions (could be enforced via Table ACLs)

How to submit custom spark application on Azure Databricks?

I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.
Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.

Is it possible to take snapshot of existing HDInsight cluster in azure

Currently we have a HDInsights cluster which we might have to shut it down or delete for few days. We need the cluster in the same state as we left. What are the ways we can preserve the current snapshot of this cluster and restore it back after few days.
It depends on how have you created the HDInsight cluster. When you created the cluster, did you specify external meta stores, so that your hive meta store is running on your own SQL azure and not the one that HDInsight created?
Check this documentation.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#use-hiveoozie-metastore
If you haven't used external meta stores when you created the cluster, unfortunately, you will lose that state. Your data however, will be persisted in the Azure blob store or Azure data lake store.

Resources