How to perform MSCK REPAIR TABLE to load only specific partitions - apache-spark

I have data in AWS S3 for more than 2 months that is partitioned and stored by day. I want to start using the data using the external table that I created.
Currently I see only a couple of partitions and I want to make sure my metadata picks up all the partitions. I tried using msck repair table tablename using hive after logging in to EMR Cluster's master node. However, may be due to data volume, it is taking a lot of time to execute that command.
Can I do msck repair table so that I can load only a specific day? does msck allow to load specific partitions?

You can use
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location'][, PARTITION partition_spec [LOCATION 'location'], ...];
...as described in Hive DDL doc.

Related

Hive - Copy database schema with partitions and recreate in another hive instance

I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).
Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS

SparkSQL on hive partitioned external table on amazon s3

I am planning to use SparkSQL (not pySpark) on top of data in Amazon S3. So I believe I need to create Hive external table and then can use SparkSQL. But S3 data is partitioned and want to have the partitions reflected in Hive external table also.
What is the best way to manage the hive table on a daily basis. Since
, everyday new partitions can be created or old partitions can be
overwritten and what to do , so as to keep the Hive external table
up-to-date?
Create a intermediate table and load to your hive table with insert overwrite partition on date.

External Hive Table Refresh table vs MSCK Repair

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming.
Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added partition info. Is this the only way or is there a better way? I am concerned if downstream users querying the table, will msck repair cause any issue in non availability of data or stale data? I was going through the HiveContext API and see refreshTable option. Any idea if this makes sense to use refreshTable instead ?
To directly answer your question msck repair table, will check if partitions for a table is active. Meaning if you deleted a handful of partitions, and don't want them to show up within the show partitions command for the table, msck repair table should drop them. Msck repair could take more time than an invalidate or refresh statement, however Invalidate Metadata only runs within Hive updating only the Hive Metastore. Refresh runs only in Spark SQL and updates the Spark metadata store.
Hive metastore should be fine if you are completing the add partition step somewhere in the processing, however if you ever want to access the hive table through Spark SQL you will need to update the metadata through Spark (or Impala or another process that updates the spark metadata).
Anytime you update or change the contents of a hive table, the Spark metastore can fall out of sync, causing you to be unable to query the data through the spark.sql command set. Meaning if you want to query that data you need to keep the Spark metastore in sync.
If you have a Spark version that allows for it, you should refresh and add partitions to Hive tables within Spark, so all metastores are in sync. Below is how I do it:
//Non-Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation)
spark.sql("refresh table " + tableName)
//Partitioned Table
outputDF.write.format("parquet").mode("overwrite").load(fileLocation + "/" + partition)
val addPartitionsStatement = "alter table" + tableName = " add if not exists partition(partitionKey='" + partition + "') location '" + fileLocation + "/" + partition + "'"
spark.sql(addPartitionsStatement)
spark.sql("refresh table " + tableName)
It looks like refreshTable does refresh the cached metadata, not affecting Hive metadata.
Doc says:
Invalidate and refresh all the cached the metadata of the given table.
For performance reasons, Spark SQL or the external data source library
it uses might cache certain metadata about a table, such as the
location of blocks. When those change outside of Spark SQL, users
should call this function to invalidate the cache.
Method does not update Hive metadata, so repair is necessary.

Hive Bucketed Tables enabled for Transactions

So we are trying to create a Hive table with ORC format bucketed and enabled for transactions using the below statement
create table orctablecheck ( id int,name string) clustered by (sno) into 3 buckets stored as orc TBLPROPERTIES ( 'transactional'='true')
The table is getting created in Hive and also Reflects in Beeline both in the Metastore as well as Spark SQL(which we have configured to run on top of Hive JDBC)
We are now inserting data into this table via Hive. However we see after insertion the data doesnt reflect in Spark SQL. It only reflects correctly in Hive.
The table only shows the data in the table if we restart the Thrift Server.
Is the transaction attribute set on your table? I observed that hive transaction storage structure do not work with spark yet. You can confirm this by looking at the transactional attribute in the output of below command in hive console.
desc extended <tablename> ;
If you'd need to access transactional table, consider doing a major compaction and then try accessing the tables
ALTER TABLE <tablename> COMPACT 'major';
I created a transactional table in Hive, and stored data in it using Spark (records 1,2,3) and Hive (record 4).
After major compaction,
I can see all 4 records in Hive (using beeline)
only records 1,2,3 in spark (using spark-shell)
unable to update records 1,2,3 in Hive
update to record 4 in Hive is ok

What is the metastore for in Spark?

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.

Resources