Replace a hive partition from Spark - apache-spark

Is there a way I can replace (an existing) a hive partition from a Spark program? Replace only the latest partition, rest of the partitions remains the same.
Below is the idea which I am trying to work upon,
We get transnational data from our RDBMS systems coming into HDFS every min. There will be a spark program (running every 5 or 10 min) which reads the data, performs the ETL and writes the output into a Hive Table.
Since overwriting entire hive table would be huge,
we would like to overwrite the hive table for today's partition only.
End of Day the source and destination partitions would be changed to next day.
Thanks in advance

As you know the hive table location, append the currentdate to location as your table is partitioned on date and overwrite the hdfs path.
df.write.format(source).mode("overwrite").save(path)
Msck repair hive table
once it is completed

Related

Partition strategy for hive

I have a monthly Spark job that process data and save into Hive/Impala tables (file storage format is parquet). The granularity of the table is daily data, but source data for this job also comes monthly job.
I'm trying to see how to best partition the table. I'm thinking of partitioning the table base a month key. Wondering if anyone sees any problems with this approach, or have other suggestions? Thanks.

Apache Hive: CREATE TABLE statement without schema over parquet can fail to infer partition column

I have a partitioned parquet at the following path:
/path/to/partitioned/parq/
with partitions like:
/path/to/partitioned/parq/part_date=2021_01_01_01_01_01
/path/to/partitioned/parq/part_date=2021_01_02_01_01_01
/path/to/partitioned/parq/part_date=2021_01_03_01_01_01
When I run a Spark SQL CREATE TABLE statement like:
CREATE TABLE IF NOT EXISTS
my_db.my_table
USING PARQUET
LOCATION '/path/to/partitioned/parq'
The partition column part_date shows up in my dataset, but DESCRIBE EXTENDED indicates there are no PARTITIONS. SHOW PARTITIONS my_db.my_table shows no partition data.
This seems to happen intermittently, like sometimes spark infers the partitions, other times it doesn't. This is causing issues downstream where we add a partition and try to MSCK REPAIR TABLE my_db.my_table and it says you can't run that on non-partitioned tables.
I see that if you DO declare schema, you can FORCE the PARTITIONED BY part of the clause but we do not have the luxury of a schema, just the files from underneath.
Why is spark intermittently unable to determine partition columns from a parquet in this shape?
Unfortunately with Hive you need to specify the schema, even if parquet obviously has this itself.
You need to add partition by clause to DDL.
Use ALTER table statement to add each partition separately with location.

Write large data set around 100 GB having just one partition to hive using spark

I am trying to write large dataset to a partitioned hive table (partitioned by date) using spark .The data set results in just one date, so just one partition. It is taking long time to write to table. It is also causing shuffling while writing . My code does not contain any join. It has just some map function, filter and union. How to efficiently write this kind of data to hive table? Check image of spark UI here

Apache Hive Add TIMESTAMP partition using alter table statement

i'm currently running MSCK HIVE REPAIR SCHEMA.TABLENAME for all my tables after data is loaded.
As the partitions are growing, this statement is taking much longer (some times more than 5 mins) for one table. I know it scans and parses through all partitions in s3 (where my data is) and then adds the latest partitions into hive metastore.
I want to replace MSCK REPAIR with ALTER TABLE ADD PARTITION statement. MSCK REPAIR works perfectly fine with adding latest partitions, however i'm facing problem with TIMESTAMP value in the partition when using ALTER TABLE ADD PARTITION.
I have a table with four partitions (part_dt STRING, part_src STRING, part_src_file STRING, part_ldts TIMESTAMP).
After running **MSCK REPAIR, the SHOW PARTITIONS command gives me below output
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39
But, when i drop the above partition from metastore, and recreate it using ALTER TABLE ADD PARTITION
hive> alter table hub_cont add partition(part_dt='20181016',part_src='asfs',part_src_file='kjui',part_ldts='2019-05-02 06:30:39');
OK
Time taken: 1.595 seconds
hive> show partitions hub_cont;
OK
part_dt=20181016/part_src=asfs/part_src_file=kjui/part_ldts=2019-05-02 06%3A30%3A39.0
Time taken: 0.128 seconds, Fetched: 1 row(s)
It is adding .0 at the end of timestamp value. When i query the table for this partition, it is giving me 0 records.
Is there way to add parition that has timestamp value without getting this zero added at the end. I'm unable to figure out how MSCK REPAIR is handling this case that is ALTER TABLE statement not able to.
The same should happen if you insert dynamic partitions, it will create new partitions with .0 because default timestamp string representation format includes milliseconds part, REPAIR TABLE finds new folders and adds partition to the metastore and also works correct because timestamp string without milliseconds is quite compatible with the timestamp...
The solution is to use STRING instead of TIMESTAMP and remove milliseconds explicitly.
But first of all double-check that you have really millions of rows in single partition and really need timestamp grain partition, not DATE and this partition column is really significant (for example if it is functionally dependent on another partition column part_src_file, you can completely get rid of it). Too many partitions will cause performance degradation.

How to store Spark data frame as a dynamic partitioned Hive table in Parquet format?

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.
I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?
Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?
If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.
Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.
Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?

Resources