Can someone please let me know if it is possible to read/write Hive ACID tables from Spark in AWS EMR
Related
Is there a difference between spark table and hive table? How they are different if both gets registered in hive metastore or not?
wanted to understand if a table can be registered in hive metastore which was created by spark but not hive?
We are trying to migrate pyspark script from on-premise which creates and drops tables in Hive with data transformations to GCP platform.
Hive is replaced by BigQuery.
In this case, the hive reads and writes is converted to bigquery reads and writes using spark-bigquery-connector.
However the problem lies with creation and dropping of bigquery tables via spark sql as spark sql will default run the create and drop queries on hive backed by hive metastore not on big query.
I wanted to check if there is plan to incorporate DDL statements support as well as part of spark-bigquery-connector.
Also, from architecture perspective is it possible to base the metastore for spark sql on bigquery so that any create or drop statement can be run on bigquery from spark.
I don't think Spark SQL will support BigQuery as metastore, nor BQ connector will support BQ DDL. On Dataproc, Dataproc Metastore (DPMS) is the recommended solution for Hive and Spark SQL metastore.
In particular, for no-prem to Dataproc migration, it is more straightforward to migrate to DPMS, see this doc.
We have scala Jar file which is running on on-prem Hadoop cluster. It create hive tables on parquet file and do further spark processing on hive table.
May I know your suggestion on how to run such hive based spark processing in synapses spark pool? If possible, without changing our code?
Currently Spark only works with external Hive tables and non-transactional/non-ACID managed Hive tables. It doesn’t support Hive ACID/transactional tables now.
For more details, refer to Use external Hive Metastore for Synapse Spark Pool (Preview)
As per this article from Microsoft all the clusters pointing to an external shared Hive Metastore have to be of the same HDInsight version. Does it mean that the clusters can be of varying type as long as they have the same HDInsight version? Because for the same HDInsight version cluster type could be either Hadoop, Spark, Interactive Query etc.
A custom metastore lets you attach multiple clusters and cluster types to that metastore. For example, a single metastore can be shared across Interactive Query, Hive, and Spark clusters in HDInsight.
Example: If you Hadoop cluster with HDI 3.6, it can be shared with Spark cluster with HDI 3.6 version.
Important points to remember:
If you share a metastore across multiple clusters, ensure all the
clusters are the same HDInsight version. Different Hive versions use
different metastore database schemas.
You can't share a metastore across Hive 2.1 and Hive 3.1 versioned
clusters. Example: You can't share Hive metastore with HDInsight 4.0 and
HDInsight 3.6.
Hope this helps.
Does Spark SQL on the DSE platform have these features?
Create Hive External tables to Read from Cassandra (DSEFS) and S3
To be done via the C* Metastore
Enforce RBAC (RoleBasedAuthenticationControl) from the C* Metastore
Run 'Hive on Spark' on DSE environment.
Can 'Hive on Spark' running on a separate hdfs cluster read from DSE's C* Metastore