Spark Permanent Tables Management - apache-spark

I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks

Related

Question regarding Spark sql tables and databases

I am new to Spark and am confused regarding the below point:
When we are creating new databases and global tables for our own analysis(using dataframe API or spark sql), where are these getting created/stored? Are these getting stored in Spark memory or in external storage(could be Hive/HDFS/RDBMS..etc) from where Spark is reading data? Does temporary view/local tables only get created in Spark memory?
Thanks!
When we are creating new databases and global tables for our own analysis(using dataframe API or spark sql), where are these getting created/stored?
It depends on your infrastructure. For example:
If you're in an on-prem environment, the underlying data is on HDFS
If you're in Azure Databricks, the underlying data is on Azure Data Storage
If you're in Databricks Cloud, the underlying data is on Amazon S3
Are these getting stored in Spark memory or in external storage(could be Hive/HDFS/RDBMS..etc) from where Spark is reading data?
This data only loaded to memory when you call df.cache()
Does temporary view/local tables only get created in Spark memory?
Yes

Databricks: global unmanaged table, partition metadata sync guarantees

Objective
I want to create Databricks global unmanaged tables from ADLS data and use them from multiple clusters (automated and interactive). So I'm doing CREATE TABLE my_table ... first, then MSCK REPAIR TABLE my_table. I'm using Databricks internal Hive metastore.
The issue
Sometimes MSCK REPAIR wasn't synced across clusters (at all, for hours). Means cluster #1 saw partitions immediately, while cluster #2 didn't see any data for some time.
Sometimes it's synced, still I can't understand why it doesn't work in other cases.
Question
Does Databricks use separate internal hive metastore per cluster? If yes, are there any guarantees about sync-up between clusters?
I believe each databricks deployment has a single hive metastore: https://docs.databricks.com/data/metastores/index.html.
So if the metastore is being updated immediately, then the next most likely problem is that the old table metadata is being cached, so you aren't seeing the updates. Have you tried running
REFRESH <database>.<table>;
on the cluster that was having the sync issues?

How merge files in Hive partitioned and bucketed files into one big file?

I am working on Azure HDInsight cluster for big data processing. A few days back I created a partitioned and bucketed table in hive by merging many files.
Since Azure does not give any option to stop the cluster,therefore I had to delete the cluster to save the cost. The data is independently stored in Azure storage account. When I create new cluster using the same storage account, I can see the database and the table using HDFS commands but hive cannot read that database or table, maybe hive does not have metadata about that.
The only option I am left with is to merge all those partitioned and bucketed files into a single file and then create the table again. So is there any way by which I can migrate that table to another database or merge it so that it would be easier to migrate??
You can create an EXTERNAL TABLE (with same properties as before) pointing to that HDFS location. Since you mentioned it has partitions you can run MSCK REPAIR TABLE table-name so that you can see partitions as well.
Hope this helps

Is it possible to create persistent view in Spark?

I'm learning Spark and found that I can create temp view in Spark by calling one of following pySpark API:
df.createGlobalTempView("people")
df.createTempView("people")
df.createOrReplaceTempView'("people")
Can I create a permanent view to that it became available for every user of my spark cluster? Think this will save people's time if views are already defined for them.
Yes, but you'll have to use SQL:
spark.sql("CREATE VIEW persistent_people AS SELECT * FROM people")
By paradigm, Spark doesn't have any persistence capabilities since it's a data processing engine but not data warehousing.
If you want to provide some session independent views you need to work with existing Hive deployment or use an approach with Spark owned metastore. For more details please refer Spark doc about Hive interaction.

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.

Resources