External Hive Table Refresh table vs MSCK Repair - apache-spark

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming.
Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added partition info. Is this the only way or is there a better way? I am concerned if downstream users querying the table, will msck repair cause any issue in non availability of data or stale data? I was going through the HiveContext API and see refreshTable option. Any idea if this makes sense to use refreshTable instead ?

To directly answer your question msck repair table, will check if partitions for a table is active. Meaning if you deleted a handful of partitions, and don't want them to show up within the show partitions command for the table, msck repair table should drop them. Msck repair could take more time than an invalidate or refresh statement, however Invalidate Metadata only runs within Hive updating only the Hive Metastore. Refresh runs only in Spark SQL and updates the Spark metadata store.
Hive metastore should be fine if you are completing the add partition step somewhere in the processing, however if you ever want to access the hive table through Spark SQL you will need to update the metadata through Spark (or Impala or another process that updates the spark metadata).
Anytime you update or change the contents of a hive table, the Spark metastore can fall out of sync, causing you to be unable to query the data through the spark.sql command set. Meaning if you want to query that data you need to keep the Spark metastore in sync.
If you have a Spark version that allows for it, you should refresh and add partitions to Hive tables within Spark, so all metastores are in sync. Below is how I do it:
//Non-Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation)
spark.sql("refresh table " + tableName)
//Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation + "/" + partition)
val addPartitionsStatement = "alter table" + tableName = " add if not exists partition(partitionKey='" + partition + "') location '" + fileLocation + "/" + partition + "'"
spark.sql(addPartitionsStatement)
spark.sql("refresh table " + tableName)

It looks like refreshTable does refresh the cached metadata, not affecting Hive metadata.
Doc says:
Invalidate and refresh all the cached the metadata of the given table.
For performance reasons, Spark SQL or the external data source library
it uses might cache certain metadata about a table, such as the
location of blocks. When those change outside of Spark SQL, users
should call this function to invalidate the cache.
Method does not update Hive metadata, so repair is necessary.

Related

How to perform MSCK REPAIR TABLE to load only specific partitions

I have data in AWS S3 for more than 2 months that is partitioned and stored by day. I want to start using the data using the external table that I created.
Currently I see only a couple of partitions and I want to make sure my metadata picks up all the partitions. I tried using msck repair table tablename using hive after logging in to EMR Cluster's master node. However, may be due to data volume, it is taking a lot of time to execute that command.
Can I do msck repair table so that I can load only a specific day? does msck allow to load specific partitions?
You can use
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
...as described in Hive DDL doc.

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.
Update :
For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB
tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)
Where the table was created(Hive or Spark)
If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables(describe extended <tblName>). The value of the property is orcor empty in Spark table and hive for a Hive.
How spark uses this information
When provider is not hive(datasource table), Spark uses its native way of processing the data.
If provider is hive, Spark uses Hive code to process the data.
Fileformat
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet
Flags:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.
None of the tricks work including:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.

Live sync of Spark RDD from Cassandra table

I'm looking for a way to keep a Spark RDD in sync with a Cassandra table. I know it is possible to load a full Cassandra table into an RDD as a one shot operation but would like to keep the RDD synchronized with updates happening to the Cassandra table.
This will allow to not reload the full table into Spark everytime I need to get fresh data into Spark (which can be long if the table is big).
Any hint ?

Hive Bucketed Tables enabled for Transactions

So we are trying to create a Hive table with ORC format bucketed and enabled for transactions using the below statement
create table orctablecheck ( id int,name string) clustered by (sno) into 3 buckets stored as orc TBLPROPERTIES ( 'transactional'='true')
The table is getting created in Hive and also Reflects in Beeline both in the Metastore as well as Spark SQL(which we have configured to run on top of Hive JDBC)
We are now inserting data into this table via Hive. However we see after insertion the data doesnt reflect in Spark SQL. It only reflects correctly in Hive.
The table only shows the data in the table if we restart the Thrift Server.
Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of below command in hive console.
desc extended <tablename> ;
If you'd need to access transactional table, consider doing a major compaction and then try accessing the tables
ALTER TABLE <tablename> COMPACT 'major';
I created a transactional table in Hive, and stored data in it using Spark (records 1,2,3) and Hive (record 4).
After major compaction,
I can see all 4 records in Hive (using beeline)
only records 1,2,3 in spark (using spark-shell)
unable to update records 1,2,3 in Hive
update to record 4 in Hive is ok

What is the metastore for in Spark?

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.

Resources