Can I use Hive on Azure Databricks without Hadoop/HDInsight? - azure

The Docs says "Every Databricks deployment has a central Hive metastore..." besides an external metastore for existing Hive installations.
I have an Azure Databricks workspace with an underlying spark cluster, and a datafiles stored on DBFS and Blob Storage. Do I need HDInsight cluster with external metastore to be able to create and use Hive tables? Or can I use the above mentioned central metastore to create Hive tables on data stored on DBFS or Blob storage?

#Gadam nope you do not. Azure Databricks provisions its own Hive Metastore, but if you are already using one with HDInsight, Databricks can be configured to also use it (an external metastore).

Related

Databricks notebooks lineage in Azure Purview

If I read file from ADLS into PySpark data frame and write back to another ADLS folder in different file format, will that lineage captured in Hive metastore, Can lineage show for this kind of operations?
Currently this lineage won't show up out of the box - however, Purview uses Atlas behind the scenes, thus you can probably capture this lineage using the API.
Here's an example of where Spline was used to track lineage from notebooks:
https://intellishore.dk/data-lineage-from-databricks-to-azure-purview/
This article talks about how to get started with the Purview REST API:
https://techcommunity.microsoft.com/t5/azure-architecture-blog/exploring-purview-s-rest-api-with-python/ba-p/2208058
You can use the OpenLineage based Databricks to Purview Solution Accelerator to ingest the lineage provided by Databricks. By deploying the solution accelerator, you'll have a set of Azure Functions and a Databricks cluster that can extract the logical plan from a Databricks notebook / job and transform it automatically to Apache Atlas / Microsoft Purview entities.
Supports table level lineage from Spark Notebooks and jobs for the following data sources:
Azure SQL
Azure Synapse Analytics
Azure Data Lake Gen 2
Azure Blob Storage
Delta Lake
Supports Spark 3.1 and 3.0 (Interactive and Job clusters) / Spark 2.x (Job clusters)
Databricks Runtimes between 6.4 and 10.3 are currently supported
Can be configured per cluster or for all clusters as a global configuration
Once configured, does not require any code changes to notebooks or jobs

Hive Table migrate to different env

I have a Hive Table on Azure HDInsight WASB, want to migrate / copy over from Production to QA environment, looks like I can only do it via export / import.
1) Export tables from parquet to files (metadata included)
2) AzCopy from Prod Storage to QA Storage
3) Import tables
Azure HDInsight supports only export/import Hive metastore.
If you want to retain your Hive tables after you delete an HDInsight cluster, use a custom metastore. You can then attach the metastore to another HDInsight cluster.
Note: An HDInsight metastore that is created for one HDInsight cluster version cannot be shared across different HDInsight cluster versions.
References:
How do I export a Hive metastore and import it on another cluster?
Hive metastore best practices

Hadoop on Azure using IaaS

I am looking at having a Hadoop cluster setup for Big Data analytics using the virtualized environment in Azure. As the data volume is very high, I am looking at having data stored in secondary storage like Azure Data Lake Store and Hadoop cluster storage will act as the primary storage.
I would like to know, how can this be configured so that when i create a Hive table and partition, part of the data can reside in Primary storage and the rest in the secondary storage?
Thanks
Regards,
Madhu
You can't mix file systems with a Hive table by default. The Hive metastore only consists of one filesystem location for a database / table definition.
You might try to use Waggle Dance to setup a federated Hive solution, but it's probably too much work than simply allowing Hive data to exist in Azure
I don't know about Hadoop and Hive but you could combine Azure Data Lake Store (ADLS) and Azure SQL Data Warehouse (ADW), ie use Polybase in ADW to create an external table on the 'cold' data in ADLS and an internal table for your 'warm' data. ADW has the advantage that you can pause it.
Optionally create a view over the top to combine the external and internal table.

Azure ML - Import Hive Query Failing - Hive over ADLS

We are working on Azure ML and ADLS combination. Since HDInsight Cluster is working over ADLS, we are trying to use Hive Query and HDFS route and running into problems.
Request your help in solving the problem of reading data from hive query and writing to HDFS. Below is the error URL for reference:
https://studioapi.azureml.net/api/sharedaccess?workspaceId=025ba20578874d7086e6c495cc49a3f2&signature=ZMUCNMwRjlrksrrmsrx5SaGedSgwMmO%2FfSHvq190%2F1I%3D&sharedAccessUri=https%3A%2F%2Fesprodussouth001.blob.core.windows.net%2Fexperimentoutput%2Fccf9a206-730d-4773-b44e-a2dd8c6e87b9%2Fccf9a206-730d-4773-b44e-a2dd8c6e87b9.txt%3Fsv%3D2015-02-21%26sr%3Db%26sig%3DHkuFm8B2Ba1kEWWIwanqlv%2FcQPWVz0XYveSsZnEa0Wg%3D%26st%3D2017-10-16T18%3A31%3A06Z%26se%3D2017-10-17T18%3A36%3A06Z%26sp%3Dr
Azure Machine Learning supports Hive but not over ADLS.

HDInsight Azure Blob Storage Change

On HDInsight cluster, a Hive table is created using CREATE EXTERNAL statement:
CREATE EXTERNAL TABLE HTable(t1 string, t2 string, t3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://$containerName#$storageAccountName.blob.core.windows.net/HTable/data/';
Then some existing files changed, some files are added to Azure Blob Container mentioned in the CREATE statement.
Does a new hive query consider changes made to Blob Container with out again loading data to hive table?
Yes, your table definition is saved in the Hive metastore. You can subsequently simply query HTable and data will be there. Normally Hive on HDInsight follows the same rules that applies to Hive and HDFS.
For a more advanced discussion you can play some tricks, but you need to know what you're doing. Because HDInsight storage can survive a cluster lifetime, with HDInsight is feasible to tear down the cluster and redeploy a new HDInsight cluster and still have the Hive data. You can even keep the Hive metastore, as is a separate database (an SQL Azure DB). With an HDFS based cluster a recycle of the cluster leads to loss of all HDFS data.

Resources