Databricks: global unmanaged table, partition metadata sync guarantees - apache-spark

Objective
I want to create Databricks global unmanaged tables from ADLS data and use them from multiple clusters (automated and interactive). So I'm doing CREATE TABLE my_table ... first, then MSCK REPAIR TABLE my_table. I'm using Databricks internal Hive metastore.
The issue
Sometimes MSCK REPAIR wasn't synced across clusters (at all, for hours). Means cluster #1 saw partitions immediately, while cluster #2 didn't see any data for some time.
Sometimes it's synced, still I can't understand why it doesn't work in other cases.
Question
Does Databricks use separate internal hive metastore per cluster? If yes, are there any guarantees about sync-up between clusters?

I believe each databricks deployment has a single hive metastore: https://docs.databricks.com/data/metastores/index.html.
So if the metastore is being updated immediately, then the next most likely problem is that the old table metadata is being cached, so you aren't seeing the updates. Have you tried running
REFRESH <database>.<table>;
on the cluster that was having the sync issues?

Related

On HDinsight 4.0, how can i save/update table on hive and this table be readable for oozie/spark/jupyter

Today we have this scenario:
Cluster Azure HDInsight 4.0
Running Workflow on Oozie
On this version, spark and hive do not share metadata anymore
We came from HDInsight 3.6, to work with this change, now we use Hive Warehouse Connector
Before: spark.write.saveAsTable("tableName", mode="overwrite")
Now: df.write.mode("overwrite").format(HiveWarehouseSession().HIVE_WAREHOUSE_CONNECTOR).option('table', "tableName").save()
The problem is on this point, using HWC make possible to save tables on hive.
But, hive databases/tables are not visible for Spark, Oozie and Jupyter, they see only tables on spark scope.
So, this is a major problem for us, because is not possible get data on managed tables from hive, and use them on oozie workflow.
To be possible save table on hive, and be visible on all cluster i made this configurations on Ambari:
hive > hive.strict.managed.tables = false
spark2 > metastore.catalog.default = hive
And now is possible to save table on hive, on the "old" way spark.write.saveAsTable.
But there is a problem when table is update/overwrite:
pyspark.sql.utils.AnalysisException: u'org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:java.security.AccessControlException: Permission denied: user=hive,
path="wasbs://containerName#storageName.blob.core.windows.net/hive/warehouse/managed/table"
:user:supergroup:drwxr-xr-x);'
So, i have two questions:
Is this the correct way to save table on hive, to be visible on all cluster?
How can i avoid this permission error on table overwrite? Keep in mind, this error occours when we execute Oozie Workflow
Thanks!

How merge files in Hive partitioned and bucketed files into one big file?

I am working on Azure HDInsight cluster for big data processing. A few days back I created a partitioned and bucketed table in hive by merging many files.
Since Azure does not give any option to stop the cluster,therefore I had to delete the cluster to save the cost. The data is independently stored in Azure storage account. When I create new cluster using the same storage account, I can see the database and the table using HDFS commands but hive cannot read that database or table, maybe hive does not have metadata about that.
The only option I am left with is to merge all those partitioned and bucketed files into a single file and then create the table again. So is there any way by which I can migrate that table to another database or merge it so that it would be easier to migrate??
You can create an EXTERNAL TABLE (with same properties as before) pointing to that HDFS location. Since you mentioned it has partitions you can run MSCK REPAIR TABLE table-name so that you can see partitions as well.
Hope this helps

Possible memory leak on hadoop cluster ? (hive, hiveserver2, zeppelin, spark)

The heap usage of hiveserver2 is constantly increasing (first pic).
There are applications such as nifi, zeppelin, spark related to hive. Nifi use puthivesql, zeppelin use jdbc(hive) and spark use spark-sql. I couldn't find any clue to this.
Hive requires a lot of resources for establishing connection. So, first reason is a lot of queries in your puthiveql processor, cause for everyone of them hive need to open connection. Get attention on your hive job browser (you can use hue for this purpose)
Possible resolution: e.g. if you use insert queries - so use orc files to insert data. If you use update queries - use temporary table and merge query.

Running Spark App: Persist Metastore

I work on a Spark 2.1 application that also uses SparkSQL and saves data with dataframe.write.saveAsTable(tbl). My understanding is that an in-memory Derby DB is used for the Hive metastore (right?). This means that a table that I create in the first execution is not available in any subsequent executions. In many cases that might be the intended behaviour - but I would like to persist the metastore across executions (since this is also the behavior I have in my production system).
So, a simple question: How can I change the configuration to persist the metastore on disc?
One remark: I am not starting the Spark job with spark-shell or spark-submit, but as a standalone Scala application.
It is already persisted on disk. As long as both sessions use the same working directory or specific metastore configuration, the permanent table will be persisted between sessions.

Spark Permanent Tables Management

I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks

Resources