How to merge partitions in HDFS? - apache-spark

Assuming I have a partitioned table in my HDFS, that gets new information all the time. New data will be partitioned by days by default, while all of the other files are partitioned by months. How can I merge partitions so by this example I would be able to merge all days partitions that came in the last month to be a month partition? Is there a way to repartition only some of the table’s partitions? I’d like to repartition only some of my partitions so only partitions that are small enough would be merged.
Also, does it even possible to merge partitions or should I try to read them, delete and write again to one partition? I'm thinking of something like concatenating the files.
I’d like to know what is the best way to merge only some partitions of a table.

Related

Selectively overwrite partitions in Delta Live Table pipeline

I have relatively big table and it is overwritten in DLT pipeline. It is partitioned by date and in most cases I change small portion of data (connected to last couple of partitions). Is it possible to selectively overwrite only specified partitions?

Performance of pyspark + hive when a table has many partition columns

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1 has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2 has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet
Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.
Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.
Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day)
Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'
If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

Spark Job stuck writing dataframe to partitioned Delta table

Running databricks to read csv files and then saving as a partitioned delta table.
Total records in file are 179619219 . It is being split on COL A (8419 unique values) and Year ( 10 Years) and Month.
df.write.partitionBy("A","year","month").format("delta") \
.mode("append").save(path)
Job gets stuck on the write step and aborts after running for 5-6 hours
This is very bad partitioning schema. You simply have too many unique values for column A, and additional partitioning is creating even more partitions. Spark will need to create at least 90k partitions, and this will require creation a separate files (small), etc. And small files are harming the performance.
For non-Delta tables, partitioning is primarily used to perform data skipping when reading data. But for Delta lake tables, partitioning may not be so important, as Delta on Databricks includes things like data skipping, you can apply ZOrder, etc.
I would recommend to use different partitioning schema, for example, year + month only, and do OPTIMIZE with ZOrder on A column after the data is written. This will lead to creation of only few partitions with bigger files.

Saving Spark DataFrame to Parquet partitioned by Date

I have huge dataframe that has several columns, one of which is callDate(DateType). I want to save that dataframe to parquet on S3 and partition it by this call_date column. This will be initial load for our project(containing history data), and afterwards in production, after a day finishes it should add up new partition and not delete older ones.
In a case when I omit .partitionBy method, job finishes in 12 minutes. Action example:
allDataDF.write.mode("overwrite").parquet(resultPath)
On the other hand when I do this:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
allDataDF.write.mode("overwrite").partitionBy("call_date").parquet(resultPath)
job doesn't finish in 30 minutes. I am not doing any repartition before partitionBy, so I guessed speed should be somewhat similar, as every executor should save it's own partition to the specific date? What am I missing here?

Spark DataFrame Repartition and Parquet Partition

I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.

Resources