HDInsight Azure Blob Storage Change - azure

On HDInsight cluster, a Hive table is created using CREATE EXTERNAL statement:
CREATE EXTERNAL TABLE HTable(t1 string, t2 string, t3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://$containerName#$storageAccountName.blob.core.windows.net/HTable/data/';
Then some existing files changed, some files are added to Azure Blob Container mentioned in the CREATE statement.
Does a new hive query consider changes made to Blob Container with out again loading data to hive table?

Yes, your table definition is saved in the Hive metastore. You can subsequently simply query HTable and data will be there. Normally Hive on HDInsight follows the same rules that applies to Hive and HDFS.
For a more advanced discussion you can play some tricks, but you need to know what you're doing. Because HDInsight storage can survive a cluster lifetime, with HDInsight is feasible to tear down the cluster and redeploy a new HDInsight cluster and still have the Hive data. You can even keep the Hive metastore, as is a separate database (an SQL Azure DB). With an HDFS based cluster a recycle of the cluster leads to loss of all HDFS data.

Related

Synapse Analytics sql on-demand sync with spark pool is very slow to query

I have files loaded into an azure storage account gen2, and am using Azure Synapse Analytics to query them. Following the documentation here: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables, I should be able to create a spark sql table to query the partitioned data, and thus subsequently use the metadata from spark sql in my sql on demand query to given the line in the doc: When a table is partitioned in Spark, files in storage are organized by folders. Serverless SQL pool will use partition metadata and only target relevant folders and files for your query
My data is partitioned in ADLS gen2 as:
Running the query in a spark notebook in Synapse Analytics returns in just over 4 seconds, as it should given the partitioning:
However, now running the same query in the sql on demand sql side script never completes:
This result and extreme reduction in performance compared to spark pool is completely counter to what the documentation notes. Is there something I am missing in the query to make sql-on demand use the partitions?
Filepath() and filename() functions can be used in the WHERE clause to filter the files to be read. Which that you can achieve the prunning you have been looking for.

Is possible to read an Azure Databricks table from Azure Data Factory?

I have a table into an Azure Databricks Cluster, i would like to replicate this data into an Azure SQL Database, to let another users analyze this data from Metabase.
Is it possible to acess databricks tables through Azure Data factory?
No, unfortunately not. Databricks tables are typically temporary and last as long as your job/session is running. See here.
You would need to persist your databricks table to some storage in order to access it. Change your databricks job to dump the table to Blob storage as it's final action. In the next step of your data factory job, you can then read the dumped data from the storage account and process further.
Another option may be databricks delta although I have not tried this yet...
If you register the table in the Databricks hive metastore then ADF could read from it using the ODBC source in ADF. Though this would require an IR.
Alternatively you could write the table to external storage such as blob or lake. ADF can then read that file and push it to your sql database.

Can I use Hive on Azure Databricks without Hadoop/HDInsight?

The Docs says "Every Databricks deployment has a central Hive metastore..." besides an external metastore for existing Hive installations.
I have an Azure Databricks workspace with an underlying spark cluster, and a datafiles stored on DBFS and Blob Storage. Do I need HDInsight cluster with external metastore to be able to create and use Hive tables? Or can I use the above mentioned central metastore to create Hive tables on data stored on DBFS or Blob storage?
#Gadam nope you do not. Azure Databricks provisions its own Hive Metastore, but if you are already using one with HDInsight, Databricks can be configured to also use it (an external metastore).

Hadoop on Azure using IaaS

I am looking at having a Hadoop cluster setup for Big Data analytics using the virtualized environment in Azure. As the data volume is very high, I am looking at having data stored in secondary storage like Azure Data Lake Store and Hadoop cluster storage will act as the primary storage.
I would like to know, how can this be configured so that when i create a Hive table and partition, part of the data can reside in Primary storage and the rest in the secondary storage?
Thanks
Regards,
Madhu
You can't mix file systems with a Hive table by default. The Hive metastore only consists of one filesystem location for a database / table definition.
You might try to use Waggle Dance to setup a federated Hive solution, but it's probably too much work than simply allowing Hive data to exist in Azure
I don't know about Hadoop and Hive but you could combine Azure Data Lake Store (ADLS) and Azure SQL Data Warehouse (ADW), ie use Polybase in ADW to create an external table on the 'cold' data in ADLS and an internal table for your 'warm' data. ADW has the advantage that you can pause it.
Optionally create a view over the top to combine the external and internal table.

Is it possible to take snapshot of existing HDInsight cluster in azure

Currently we have a HDInsights cluster which we might have to shut it down or delete for few days. We need the cluster in the same state as we left. What are the ways we can preserve the current snapshot of this cluster and restore it back after few days.
It depends on how have you created the HDInsight cluster. When you created the cluster, did you specify external meta stores, so that your hive meta store is running on your own SQL azure and not the one that HDInsight created?
Check this documentation.
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#use-hiveoozie-metastore
If you haven't used external meta stores when you created the cluster, unfortunately, you will lose that state. Your data however, will be persisted in the Azure blob store or Azure data lake store.

Resources