How to repartition spark dataframe into smaller partitions - apache-spark

I have a dataframe which is partitioned by date.
In normal processing, I am processing a week of data at a time, so this means I have 7 partitions. I would like to increase this number of partitions, but without having to shuffle data or have a mix of dates in the same partition.
I've tried using df.repartition(20, my_date_column), but this just results in 13 empty partitions since the hash partitioner will only get 7 distinct values.
I've also tried using df.repartition(20, my_date_column, unique_id), which does increase the number of partitions to 20, but it means that dates are mixed within the partitions.
Is what I'm trying to do possible?

Perhaps you can increase the number of partitions by setting spark.sql.files.maxPartitionBytes to a smaller value. According to the tuning guide:
Property Name: spark.sql.files.maxPartitionBytes
Default: 134217728 (128 MB)
Meaning: The maximum number of bytes to pack into a single partition when reading files.
This configuration is effective only when using file-based sources
such as Parquet, JSON and ORC.
Since Version: 2.0.0
Alternatively, you can try spark.sql.files.minPartitionNum but it only means to suggest, not guarantee, the number of partitions.

Related

Spark group by Key and partitioning the data

I have a large csv file with data in following format.
cityId1,name,address,.......,zip
cityId2,name,address,.......,zip
cityId1,name,address,.......,zip
........
cityIdN,name,address,.......,zip
I am performing following operation on the above csv file:
Group by cityId as key and list of resources as value
df1.groupBy($"cityId").agg(collect_list(struct(cols.head, cols.tail: _*)) as "resources")
Change it to jsonRDD
val jsonDataRdd2 = df2.toJSON.rdd
Iterate through each Partition and upload to s3 per key
I can not use dataframe partitionby write because of business logic constraints (how other services read from S3 )
My Questions:
What is the default size of a spark partition?
Let's say default size of partition is X MBs and there is one large record present in the dataFrame with key having Y MBs of data (Y > X) , what would happen in this scenario?
Do I need to worry about having the same key in different partitions in that case?
In answer to your questions:
When reading from secondary storage (S3, HDFS) the partitions are equal to block size of file system, 128MB or 256MB; but you can repartition RDDs immediately, not Data Frames. (For JDBC and Spark Structured Streaming the partitions are dynamic in size.)
When applying 'wide transformations' and re-partitioning the number and size of partitions most likely change. The size of a given partition has a maximum value. In Spark 2.4.x the partition size increased to 8GB. So, if any transformation (e.g. collect_list in combination with groupBy) gens more than this maximum size, you will get an error and the program aborts. So you need to partition wisely or in your case have sufficient number of partitions for aggregation - see spark.sql.shuffle.partitions parameter.
The parallel model for processing by Spark relies on 'keys' being allocated via hash, range partitioning, etc. being distributed to one and only one partition - shuffling. So, iterating through a partition foreachPartition, mapPartitions there is no issue.

spark behavior on hive partitioned table

I use Spark 2.
Actually I am not the one executing the queries so I cannot include query plans. I have been asked this question by the data science team.
We are having hive table partitioned into 2000 partitions and stored in parquet format. When this respective table is used in spark, there are exactly 2000 tasks that are executed among the executors. But we have a block size of 256 MB and we are expecting the (total size/256) number of partitions which will be much lesser than 2000 for sure. Is there any internal logic that spark uses physical structure of data to create partitions. Any reference/help would be greatly appreciated.
UPDATE: It is the other way around. Actually our table is very huge like 3 TB having 2000 partitions. 3TB/256MB would actually come to 11720 but we are having exactly same number of partitions as the table is partitioned physically. I just want to understand how the tasks are generated on data volume.
In general Hive partitions are not mapped 1:1 to Spark partitions. 1 Hive partition can be split into multiple Spark partitions, and one Spark partition can hold multiple hive-partitions.
The number of Spark partitions when you load a hive-table depends on the parameters:
spark.files.maxPartitionBytes (default 128MB)
spark.files.openCostInBytes (default 4MB)
You can check the partitions e.g. using
spark.table(yourtable).rdd.partitions
This will give you an Array of FilePartitions which contain the physical path of your files.
Why you got exactly 2000 Spark partitions from your 2000 hive partitions seems a coincidence to me, in my experience this is very unlikely to happen. Note that the situation in spark 1.6 was different, there the number of spark partitions resembled the number of files on the filesystem (1 spark partition for 1 file, unless the file was very large)
I just want to understand how the tasks are generated on data volume.
Tasks are a runtime artifact and their number is exactly the number of partitions.
The number of tasks does not correlate to data volume in any way. It's a Spark developer's responsibility to have enough partitions to hold the data.

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?
I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.
The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).
Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.
If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.
The following will result in 2 files being written out even though it's only got 1 partition:
val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

Empty Files in output spark

I am writing my dataframe like below
df.write().format("com.databricks.spark.avro").save("path");
However I am getting around 200 files where around 30-40 files are empty.I can understand that it might be due to empty partitions. I then updated my code like
df.coalesce(50).write().format("com.databricks.spark.avro").save("path");
But I feel it might impact performance. Is there any other better approach to limit number of output files and remove empty files
You can remove the empty partitions in your RDD before writing by using repartition method.
The default partition is 200.
The suggested number of partition is number of partitions = number of cores * 4
repartition your dataframe using this method. To eliminate skew and ensure even distribution of data choose column(s) in your dataframe with high cardinality (having unique number of values in the columns) for the partitionExprs argument to ensure even distribution.
As default no. of RDD partitions is 200; you have to do shuffle to remove skewed partitions.
You can either use repartition method on the RDD; or make use of DISTRIBUTE BY clause on dataframe - which will repartition along with distributing data among partitions evenly.
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns dataset instance with proper partitions.
You may use repartitionAndSortWithinPartitions - which can improve compression ratio.

Maximum size of rows in Spark jobs using Avro/Parquet

I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).
The data will probably be stored in Avro files in HDFS.
Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.
But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?
I am aware of published limitations for HBase and Cassandra, but not Spark...
There are currently some fundamental limitations related to block size, both for partitions in use and for shuffle blocks - both are limited to 2GB, which is the maximum size of a ByteBuffer (because it takes an int index, so is limited to Integer.MAX_VALUE bytes).
The maximum size of an individual row will normally need to be much smaller than the maximum block size, because each partition will normally contain many rows, and the largest rows might not be evenly distributed among partitions - if by chance a partition contains an unusually large number of big rows, this may push it over the 2GB limit, crashing the job.
See:
Why does Spark RDD partition has 2GB limit for HDFS?
Related Jira tickets for these Spark issues:
https://issues.apache.org/jira/browse/SPARK-1476
https://issues.apache.org/jira/browse/SPARK-5928
https://issues.apache.org/jira/browse/SPARK-6235

Resources