Write large data set around 100 GB having just one partition to hive using spark - apache-spark

I am trying to write large dataset to a partitioned hive table (partitioned by date) using spark .The data set results in just one date, so just one partition. It is taking long time to write to table. It is also causing shuffling while writing . My code does not contain any join. It has just some map function, filter and union. How to efficiently write this kind of data to hive table? Check image of spark UI here

Related

Spark Job stuck writing dataframe to partitioned Delta table

Running databricks to read csv files and then saving as a partitioned delta table.
Total records in file are 179619219 . It is being split on COL A (8419 unique values) and Year ( 10 Years) and Month.
df.write.partitionBy("A","year","month").format("delta") \
.mode("append").save(path)
Job gets stuck on the write step and aborts after running for 5-6 hours
This is very bad partitioning schema. You simply have too many unique values for column A, and additional partitioning is creating even more partitions. Spark will need to create at least 90k partitions, and this will require creation a separate files (small), etc. And small files are harming the performance.
For non-Delta tables, partitioning is primarily used to perform data skipping when reading data. But for Delta lake tables, partitioning may not be so important, as Delta on Databricks includes things like data skipping, you can apply ZOrder, etc.
I would recommend to use different partitioning schema, for example, year + month only, and do OPTIMIZE with ZOrder on A column after the data is written. This will lead to creation of only few partitions with bigger files.

Spark DataFrame Repartition and Parquet Partition

I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.

Replace a hive partition from Spark

Is there a way I can replace (an existing) a hive partition from a Spark program? Replace only the latest partition, rest of the partitions remains the same.
Below is the idea which I am trying to work upon,
We get transnational data from our RDBMS systems coming into HDFS every min. There will be a spark program (running every 5 or 10 min) which reads the data, performs the ETL and writes the output into a Hive Table.
Since overwriting entire hive table would be huge,
we would like to overwrite the hive table for today's partition only.
End of Day the source and destination partitions would be changed to next day.
Thanks in advance
As you know the hive table location, append the currentdate to location as your table is partitioned on date and overwrite the hdfs path.
df.write.format(source).mode("overwrite").save(path)
Msck repair hive table
once it is completed

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?
As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

How to store Spark data frame as a dynamic partitioned Hive table in Parquet format?

The current raw data is on Hive. I want to do a join of several partitioned terabytes Hive tables, and then output the result as a partitioned Hive table in Parquet format.
I am considering to load all partitions of Hive tables as Spark dataframes. And then do join, group by, and etc. Is this the right way to do?
Finally I will need to save the data, can we save Spark dataframe as a dynamic partitioned Hive table in Parquet format? How to deal with the metadata?
If one of the several data set is sufficiently smaller than the other, you may want to consider using Broadcast for data transfer efficiency.
Depending on the nature of the data, you could try group by, then join. So each machine only need to process a specific set of data, reduce the amount of data transferred during task run.
Hive supports storing data into Parquet format directly. https://cwiki.apache.org/confluence/display/Hive/Parquet. Have you given a try?

Resources