Hive - Copy database schema with partitions and recreate in another hive instance - apache-spark

I have copied the data and folder structure for a database with partitioned hive tables from one HDFS instance to another.
How can I do the same with the hive metadata? I need the new HDFS instance's hive to have this database and its tables defined using their existing partitioning just like it is in the original location. And, of course, they need to maintain their original schemas in general with the hdfs external table locations being updated.
Happy to use direct hive commands, spark, or any general CLI utilities that are open source and readily available. I don't have an actual hadoop cluster (this is cloud storage), so please avoid answers that depend on map reduce/etc (like Sqoop).

Use Hive command:
SHOW CREATE TABLE tablename;
This will print create table sentence. Copy and change table type to external, location, schema, column names if necessary, etc and execute.
After you created the table, use this command to create partitions metadata
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS

Related

Write a spark DataFrame to a table

I am trying to understand the spark DataFrame API method called saveAsTable.
I have following question
If I simply write a dataframe using saveAsTable API
df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
How can check if a table is Hive Table or Non-Hive table ?
(I am new to big data processing, so pardon me if question is not phrased properly)
Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).
Below is the documentation comment in DataFrameWriter.scala class. Documentation link
When the DataFrame is created from a non-partitioned
HadoopFsRelation with a single input path, and the data source
provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and
Parquet), the table is persisted in a Hive compatible format, which
means other systems like Hive will be able to read this table.
Otherwise, the table is persisted in a Spark SQL specific format.
Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).

Delta Lake Table metadata

Where does Delta Lake store the table metadata info. I am using spark 2.6(Not Databricks) on my standalone machine. My assumption was that if I restart spark, the table created in delta lake spark will be dropped(trying from Jupyter notebook). But it is not the case.
There are two types of tables in Apache Spark: external tables and managed tables. When creating a table using LOCATION keyword in the CREATE TABLE statement, it's an external table. Otherwise, it's a managed table and its location is under the directory specified by the Spark SQL conf spark.sql.warehouse.dir. Its default value is the spark-warehouse directory in the current work directory
Besides the data, Spark also needs to store the table metadata into Hive Metastore, so that Spark can know where is the data when a user uses the table name to query. Hive Metastore is usually a database. If a user doesn't specify a database for Hive Metastore, Spark will use en embedded database called Derby to store the table metadata on the local file system.
DROP TABLE command has different behaviors depending on the table type. When a table is a managed table, DROP TABLE will remove the table from Hive Metastore and delete the data. If the table is an external table, DROP TABLE will remove the table from Hive Metastore but still keep the data on the file system. Hence, the data files of an external table needs to be deleted from the file system manually by the user.

SparkSQL on hive partitioned external table on amazon s3

I am planning to use SparkSQL (not pySpark) on top of data in Amazon S3. So I believe I need to create Hive external table and then can use SparkSQL. But S3 data is partitioned and want to have the partitions reflected in Hive external table also.
What is the best way to manage the hive table on a daily basis. Since
, everyday new partitions can be created or old partitions can be
overwritten and what to do , so as to keep the Hive external table
up-to-date?
Create a intermediate table and load to your hive table with insert overwrite partition on date.

Write files inside Hive table hdfs folder and make them available to be queried from Hive

I am using Spark 2.2.1 which has a useful option to specify how many records I want to save in each partition of a file; this feature allows to avoid a repartition before writing a file.
However, it seems this option is usable only with the FileWriter interface and not with the DataFrameWriter one:
in this way the option is ignored
df.write.mode("overwrite")
.option("maxRecordsPerFile", 10000)
.insertInto(hive_table)
while in this way it works
df.write.option("maxRecordsPerFile", 10000)
.mode("overwrite").orc(path_hive_table)
so I am directly writing orc files in the HiveMetastore folder of the specified table. The problem is that if I query the Hive table after the insertion, this data is not recognized by Hive.
Do you know if there's a way to write directly partition files inside the hive metastore and make them available also through the Hive table?
Debug steps :
1 . Check the type of file your hive table consumes
Show create table table_name
and check "STORED AS " ..
For better efficiency saves your output in parquet and on the partition location (you can see that in "LOCATION" in above query) ..If there are any other specific types create file as that type.
2 . If you are saving data in any partition and manually creating the partition folder , avoid that .. Create partition using
alter table {table_name} add partition ({partition_column}={value});
3 .After creating the output files in spark .. You can reload those and check for "_corrupt_record" (you can print the dataframe and check this)
Adding to this, I also found out that the command 'MSCK REPAIR TABLE' automatically discovers new partitions inside the hive table folder

What is the metastore for in Spark?

I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.

Resources