Bigquery as metastore for Dataproc - apache-spark

We are trying to migrate pyspark script from on-premise which creates and drops tables in Hive with data transformations to GCP platform.
Hive is replaced by BigQuery.
In this case, the hive reads and writes is converted to bigquery reads and writes using spark-bigquery-connector.
However the problem lies with creation and dropping of bigquery tables via spark sql as spark sql will default run the create and drop queries on hive backed by hive metastore not on big query.
I wanted to check if there is plan to incorporate DDL statements support as well as part of spark-bigquery-connector.
Also, from architecture perspective is it possible to base the metastore for spark sql on bigquery so that any create or drop statement can be run on bigquery from spark.

I don't think Spark SQL will support BigQuery as metastore, nor BQ connector will support BQ DDL. On Dataproc, Dataproc Metastore (DPMS) is the recommended solution for Hive and Spark SQL metastore.
In particular, for no-prem to Dataproc migration, it is more straightforward to migrate to DPMS, see this doc.

Related

spark table vs hive table difference and usage

Is there a difference between spark table and hive table? How they are different if both gets registered in hive metastore or not?
wanted to understand if a table can be registered in hive metastore which was created by spark but not hive?

Hive in Azure Synapse

We have scala Jar file which is running on on-prem Hadoop cluster. It create hive tables on parquet file and do further spark processing on hive table.
May I know your suggestion on how to run such hive based spark processing in synapses spark pool? If possible, without changing our code?
Currently Spark only works with external Hive tables and non-transactional/non-ACID managed Hive tables. It doesn’t support Hive ACID/transactional tables now.
For more details, refer to Use external Hive Metastore for Synapse Spark Pool (Preview)

Hive Table migrate to different env

I have a Hive Table on Azure HDInsight WASB, want to migrate / copy over from Production to QA environment, looks like I can only do it via export / import.
1) Export tables from parquet to files (metadata included)
2) AzCopy from Prod Storage to QA Storage
3) Import tables
Azure HDInsight supports only export/import Hive metastore.
If you want to retain your Hive tables after you delete an HDInsight cluster, use a custom metastore. You can then attach the metastore to another HDInsight cluster.
Note: An HDInsight metastore that is created for one HDInsight cluster version cannot be shared across different HDInsight cluster versions.
References:
How do I export a Hive metastore and import it on another cluster?
Hive metastore best practices

Azure ML - Import Hive Query Failing - Hive over ADLS

We are working on Azure ML and ADLS combination. Since HDInsight Cluster is working over ADLS, we are trying to use Hive Query and HDFS route and running into problems.
Request your help in solving the problem of reading data from hive query and writing to HDFS. Below is the error URL for reference:
https://studioapi.azureml.net/api/sharedaccess?workspaceId=025ba20578874d7086e6c495cc49a3f2&signature=ZMUCNMwRjlrksrrmsrx5SaGedSgwMmO%2FfSHvq190%2F1I%3D&sharedAccessUri=https%3A%2F%2Fesprodussouth001.blob.core.windows.net%2Fexperimentoutput%2Fccf9a206-730d-4773-b44e-a2dd8c6e87b9%2Fccf9a206-730d-4773-b44e-a2dd8c6e87b9.txt%3Fsv%3D2015-02-21%26sr%3Db%26sig%3DHkuFm8B2Ba1kEWWIwanqlv%2FcQPWVz0XYveSsZnEa0Wg%3D%26st%3D2017-10-16T18%3A31%3A06Z%26se%3D2017-10-17T18%3A36%3A06Z%26sp%3Dr
Azure Machine Learning supports Hive but not over ADLS.

HDInsight Azure Blob Storage Change

On HDInsight cluster, a Hive table is created using CREATE EXTERNAL statement:
CREATE EXTERNAL TABLE HTable(t1 string, t2 string, t3 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'wasb://$containerName#$storageAccountName.blob.core.windows.net/HTable/data/';
Then some existing files changed, some files are added to Azure Blob Container mentioned in the CREATE statement.
Does a new hive query consider changes made to Blob Container with out again loading data to hive table?
Yes, your table definition is saved in the Hive metastore. You can subsequently simply query HTable and data will be there. Normally Hive on HDInsight follows the same rules that applies to Hive and HDFS.
For a more advanced discussion you can play some tricks, but you need to know what you're doing. Because HDInsight storage can survive a cluster lifetime, with HDInsight is feasible to tear down the cluster and redeploy a new HDInsight cluster and still have the Hive data. You can even keep the Hive metastore, as is a separate database (an SQL Azure DB). With an HDFS based cluster a recycle of the cluster leads to loss of all HDFS data.

Resources