How to ensure output dataframe is dynamically partitioned such that each partition is around 128mb? - apache-spark

In Spark, I have a few jobs chained (i.e. output of one will be input to the next). The issue I am facing is, say, my input dataset to first job is 10GB today and I am repartitioning it statically (hardcoded number, coalesce or repartition) before writing the output such that its around 128mb per partition. As the data grows, the hardcoded number obviously starts to make partitions bigger and in a few months downstream jobs starts to become slower due to the higher partition size.
One way I tried to ensure partitions are always around 128mb is by dividing the total dataset size (row count) by a static number. That is, if a dataset with 1M rows is 500mb, I estimate that approx 250k rows would be around 128mb. So number of partitions I would need is df.count()/250,000
That sort of works but is there a better/more straightforward way to accomplish this without affecting the performance of the job much?

Related

How to specify file size using repartition() in spark

Im using pyspark and I have a large data source that I want to repartition specifying the files size per partition explicitly.
I know using the repartition(500) function will split my parquet into 500 files with almost equal sizes.
The problem is that new data gets added to this data source every day. On some days there might be a large input, and on some days there might be smaller inputs. So when looking at the partition file size distribution over a period of time, it varies between 200KB to 700KB per file.
I was thinking of specifying the max size per partition so that I get more or less the same file size per file per day irrespective of the number of files.
This will help me when running my job on this large dataset later on to avoid skewed executor times and shuffle times etc.
Is there a way to specify it using the repartition() function or while writing the dataframe to parquet?
You could consider writing your result with the parameter maxRecordsPerFile.
storage_location = //...
estimated_records_with_desired_size = 2000
result_df.write.option(
"maxRecordsPerFile",
estimated_records_with_desired_size) \
.parquet(storage_location, compression="snappy")

Resolving small file issue in pyspark

I am reading from a partitioned table that has close to 4 billion records.
The files that I am reading from is my source, and I have no control over it to alter the records.
While reading the files through dataframes, for each partition I am creating 2000 files of size less than 2KB. This is because of shuffle partition being set to 2000, to increase the execution speed.
Approach followed to resolve this issue:
I have looped over the HDFS path of the table, post its execution is completed as has created a list with data paths [/dv/hdfs/..../table_name/partition_value=01,/dv/hdfs/..../table_name/partition_value=02..]
For each such path, I have calculated
disk usage and block size from cluster and got the appropriate number of partitions as
no_of_partitions = ceil[disk_usage / block size], and then written the data into another location with the same partition_id such as [/dv/hdfs/..../table2_name/partition_value=01].
Now though this works in reducing the small files to avg block size of 82 MB from 2KB, it is taking about 2.5 mins per partition. With 256 such partitions being available, it is taking more than 10hrs to finish the execution.
Kindly suggest any other method where this could be achieved in less than 2 hrs of time.
Although you have 2000 shuffle partitions you can and should control the output files.
Generating small files in spark is itself a performance degradation for the next read operations.
Now to control small files issue you can do the following:
While writing the dataframe to hdfs repartition it based on the number of partitions and controlling the number of output files per partition
df.repartition(partition_col).write.option("maxRecordsPerFile", 100000).partition_by(partition_col).parquet(path)
This will generate files having 100000 records each in every partition. Hence solving your small files issue. This will improve overall read and write performance of your job.
Hope it helps.

Spark coalescing on the number of objects in each partition

We are starting to experiment with spark on our team.
After we do reduce job in Spark, we would like to write the result to S3, however we would like to avoid collecting the spark result.
For now, we are writing the files to Spark forEachPartition of the RDD, however this resulted in a lot of small files. We would like to be able to aggregate the data into a couple files partitioned by the number of objects written to the file.
So for example, our total data is 1M objects (this is constant), we would like to produce 400K objects file, and our current partition produce around 20k objects file (this varies a lot for each job). Ideally we want to produce 3 files, each containing 400k, 400k and 200k instead of 50 files of 20K objects
Does anyone have a good suggestion?
My thought process is to let each partition handle which index it should write it to by assuming that each partition will roughy produce the same number of objects.
So for example, partition 0 will write to the first file, while partition 21 will write to the second file since it will assume that the starting index for the object is 20000 * 21 = 42000, which is bigger than the file size.
The partition 41 will write to the third file, since it is bigger than 2 * file size limit.
This will not always result on the perfect 400k file size limit though, more of an approximation.
I understand that there is coalescing, but as I understand it coalesce is to reduce the number of partition based on the number of partition wanted. What I want is to coalesce the data based on the number of objects in each partition, is there a good way to do it?
What you want to do is to re-partition the files into three partitions; the data will be split approximately 333k records per partition. The partition will be approximate, it will not be exactly 333,333 per partition. I do not know of a way to get the 400k/400k/200k partition you want.
If you have a DataFrame `df', you can repartition into n partitions as
df.repartition(n)
Since you want a maximum number or records per partition, I would recommend this (you don't specify Scala or pyspark, so I'm going with Scala; you can do the same in pyspark) :
val maxRecordsPerPartition = ???
val numPartitions = (df.count() / maxRecordsPerPartition).toInt + 1
df
.repartition(numPartitions)
.write
.format('json')
.save('/path/file_name.json')
This will guarantee your partitions are less than maxRecordsPerPartition.
We have decided to just go with the number of files being generated and just making sure that each files contain less than 1 million line items

Is it possible to coalesce Spark partitions "evenly"?

Suppose we have a PySpark dataframe with data spread evenly across 2048 partitions, and we want to coalesce to 32 partitions to write the data back to HDFS. Using coalesce is nice for this because it does not require an expensive shuffle.
But one of the downsides of coalesce is that it typically results in an uneven distribution of data across the new partitions. I assume that this is because the original partition IDs are hashed to the new partition ID space, and the number of collisions is random.
However, in principle it should be possible to coalesce evenly, so that the first 64 partitions from the original dataframe are sent to the first partition of the new dataframe, the next 64 are send to the second partition, and so end, resulting in an even distribution of partitions. The resulting dataframe would often be more suitable for further computations.
Is this possible, while preventing a shuffle?
I can force the relationship I would like between initial and final partitions using a trick like in this question, but Spark doesn't know that everything from each original partition is going to a particular new partition. Thus it can't optimize away the shuffle, and it runs much slower than coalesce.
In your case you can safely coalesce the 2048 partitions into 32 and assume that Spark is going to evenly assign the upstream partitions to the coalesced ones (64 for each in your case).
Here is an extract from the Scaladoc of RDD#coalesce:
This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.
Consider that also how your partitions are physically spread across the cluster influence the way in which coalescing happens. The following is an extract from CoalescedRDD's ScalaDoc:
If there is no locality information (no preferredLocations) in the parent, then the coalescing is very simple: chunk parents that are close in the Array in chunks.
If there is locality information, it proceeds to pack them with the following four goals:
(1) Balance the groups so they roughly have the same number of parent partitions
(2) Achieve locality per partition, i.e. find one machine which most parent partitions prefer
(3) Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)
(4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine

Maximum size of rows in Spark jobs using Avro/Parquet

I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).
The data will probably be stored in Avro files in HDFS.
Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.
But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?
I am aware of published limitations for HBase and Cassandra, but not Spark...
There are currently some fundamental limitations related to block size, both for partitions in use and for shuffle blocks - both are limited to 2GB, which is the maximum size of a ByteBuffer (because it takes an int index, so is limited to Integer.MAX_VALUE bytes).
The maximum size of an individual row will normally need to be much smaller than the maximum block size, because each partition will normally contain many rows, and the largest rows might not be evenly distributed among partitions - if by chance a partition contains an unusually large number of big rows, this may push it over the 2GB limit, crashing the job.
See:
Why does Spark RDD partition has 2GB limit for HDFS?
Related Jira tickets for these Spark issues:
https://issues.apache.org/jira/browse/SPARK-1476
https://issues.apache.org/jira/browse/SPARK-5928
https://issues.apache.org/jira/browse/SPARK-6235

Resources