Hive ODBC connecting to HDInsight - azure

I'm setting up a VM SQL server in Azure and I want it to be able to connect to Hive on a HDInsight cluster. I'm trying to set the ODBC DSN up and I'm unsure of what the various settings are and how to find them in my Azure portal:
Hostname
Username
Password (can I reset this if I've forgotten it)
Cheers, Chris.

Hostname: HDinsight cluster name
Username: HDInsight cluster username
Password: HDinsight cluster password
I don't think you can recover the password. You can delete the HDInsight cluster, and create another one cluster. Because Hadoop jobs are batch jobs, and HDInsight cluster usually contains multiple nodes, poeple usually create a cluster, run a MapReduce job, and delete the cluster right after the job is completed. It is too costly to let an HDInsight cluster sitting in the cloud.
Because HDInsight cluster uses Windows Azure Blob storage for data storage, deleting a cluster will not impact the data.

Related

Common metadata in databricks cluster

I have a 3-4 clusters in my databricks instance of Azure cloud platform. I want to maintain a common metastore for all the cluster. Let me know if anyone implemented this.
I recommend configuring an external Hive metastore. By default, Detabricks spins its own metastore behind the scenes. But you can create your own database (Azure SQL does work, also MySQL or Postgres) and specify it during the cluster startup.
Here are detailed steps:
https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
Things to be aware of:
Data tab in Databricks - you can choose the cluster and see different metastores.
To avoid using SQL user&password, look at Managed Identities https://learn.microsoft.com/en-us/azure/stream-analytics/sql-database-output-managed-identity
Automate external Hive metastore connections by using initialization scripts for your cluster
Permissions management on your sources. In case of ADLS Gen 2, consider using password pass-through

How to submit custom spark application on Azure Databricks?

I have created a small application that submits a spark job at certain intervals and creates some analytical reports. These jobs can read data from a local filesystem or a distributed filesystem (fs could be HDFS, ADLS or WASB). Can I run this application on Azure databricks cluster?
The application works fine on HDInsights cluster as I was able to access the nodes. I kept my deployable jar at one location, started it using the start-script similarly I could also stop it using the stop-script that I prepared.
One thing I found is that Azure Databricks has its own File System: ADFS, I can also add support for this file system but then will I be able to deploy and run my application as I was able to do it on the HDInsight cluster? If not, is there a way I can submit jobs from an edge node, my HDInsight cluster or any other OnPrem Cluster to Azure Databricks cluster.
Have you looked at Jobs? https://docs.databricks.com/user-guide/jobs.html. You can submit jars to spark-submit just like on HDInsight.
Databricks file system is DBFS - ABFS is used for Azure Data Lake. You should not need to modify your application for these - the file paths will be handled by databricks.

Spark Deployment in Azure cloud

Is it possible to deploy spark code in Azure cloud without the yarn component? thanks in advance
Yes,you can deploy Apache Spark cluster in Azure HDInsight without Yarn.
Spark clusters in HDInsight include the following components that are available on the clusters by default.
1)Spark Core. Includes Spark Core, Spark SQL, Spark streaming APIs, GraphX, and MLlib.
2)Anaconda
3)Livy
4)Jupyter notebook
5)Zeppelin notebook
Spark clusters on HDInsight also provide an ODBC driver for connectivity to Spark clusters in HDInsight from BI tools such as Microsoft Power BI and Tableau.
Refer to the following sites for more information,
Create an Apache Spark cluster in Azure HDInsight
Introduction to Spark on HDInsight
I don't think it is possible to deploy HDInsight cluster without YARN.Refer to the HDInsight documentation
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-hadoop-introduction
https://learn.microsoft.com/en-sg/azure/hdinsight/hdinsight-component-versioning
YARN is the resource manager for Hadoop. Is there any particular reason you would not want to use YARN while working with HDInsight Spark cluster?
If you want to use the standalone mode, you can modify the location of the master url while submitting the job using Spark-submit command.
I have some examples in my repo with Spark-submit both in local mode and on HDInsight cluster
https://github.com/NileshGule/learning-spark
You can refer to
local mode : https://github.com/NileshGule/learning-spark/blob/master/src/main/java/com/nileshgule/movielens/MovieLens.md
HDInsight Spark cluster : https://github.com/NileshGule/learning-spark/blob/master/Azure.md

Is it possible to take snapshot of existing HDInsight cluster in azure

Currently we have a HDInsights cluster which we might have to shut it down or delete for few days. We need the cluster in the same state as we left. What are the ways we can preserve the current snapshot of this cluster and restore it back after few days.
It depends on how have you created the HDInsight cluster. When you created the cluster, did you specify external meta stores, so that your hive meta store is running on your own SQL azure and not the one that HDInsight created?
Check this documentation.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#use-hiveoozie-metastore
If you haven't used external meta stores when you created the cluster, unfortunately, you will lose that state. Your data however, will be persisted in the Azure blob store or Azure data lake store.

headnodehost in Azure HDInsights

What is this headnodehost in Azure HDinsights? I setup a HBase cluster. There are headnodes in this HBase cluster. When I RDP to the cluster and open the Hadoop Name Node status weblink from the desktop, it opens web browser with link set to headnodehost:30070. Is the headnodehost the same as the headnodes? The hostname command in the RDP gives me "headnode0" rather than "headnodehost".
Each HDInsight cluster has two headnodes for high availability. It is documented in https://azure.microsoft.com/en-us/documentation/articles/hdinsight-high-availability/

Resources