Why Spark create less partitions than the number of files whem reading from S3 - apache-spark

I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Why ?

Spark is grouping multiple files into each partition due to their small size. You should see as much when you print out the partitions.
Example (Scala):
val df = spark.read.parquet("/path/to/files")
df.rdd.partitions.foreach(println)

If you want to use 5,000 task you could do a repartition transformation.
Quote from the docs about repartition:
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data
over the network.
I recommend you take a look at the RDD Programming Guide. Remember that shuffle is an expensive operation.

Related

How to control number of files generated while setting large partitions in spark?

Because of large number of input data, I set large shuffle partitions of spark (spark.sql.shuffle.partitions=1000). However, the output file is small (~1GB), but it creates lots of small files (3000 files, each smaller than 1Mb). How can I combine these small files to one big file?
Another question is, why the number of output files is 3 times the number of shuffle partitions?
As per Spark docs, spark.sql.shuffle.partitions parameter Configures the number of partitions to use when shuffling data for joins or aggregations.. To control the number of output files use the repartition() method before writing the output. So something like this:
df
.filter(...) // some transformations
.join(...)
.repartition(1) // move data into a single partition
.write
.format(...)
.save(...)
The snippet above would result in a single output file.
You are not limited to repartitioning your data once - you can repartition as much as you need, but bare in mind that this is a costly operation:
df
.filter(...) // some transformations
.repartition(...) // repartition to improve join performance
.join(...)
.repartition(1) // move data into a single partition
.write
.format(...)
.save(...)
If you want a good explanation of how repartition works, here is a great answer:
Spark - repartition() vs coalesce()
For more information on how to improve the performance of the joins, refer to the Spark docs:
https://spark.apache.org/docs/latest/sql-performance-tuning.html#join-strategy-hints-for-sql-queries
Since you have a large number of partitions. You may need to coalesce on your date frame. coalesce will decrease the number of partitions.
val df_res = df.coalesce(10)
This should decrease the number of output files from 1000 to just 10. or you can coalesce(1) to create one big file.
Coalesce uses existing partitions and minimizes shuffled data. The results may be different sizes.
The number of output files is equal to the number of partitions. the property (spark.sql.shuffle.partitions) is used when shuffling data for joins or aggregations.
You can perform df.repartition() to your dataframe to increase/decrease the partitions.

Why is spark dataframe repartition faster than coalesce when reducing number of partitions?

I have a df with 100 partitions, and before saving to HDFS as .parquet I want to reduce the number of partitions because the parquet files would be too small (<1MB).
I've added coalesce before writing:
df.coalesce(3).write.mode("append").parquet(OUTPUT_LOC)
It works but slows down the process from 2-3s per file to 10-20s per file.
When I try repartition:
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
The process does not slow down at all, 2-3s per file.
Why? Shouldn't coalesce always be faster when reducing the number of partitions because it avoids a full shuffle?
Background:
I'm importing files from local storage to spark cluster and saving the resulting dataframes as a parquet file. Each file is approx 100-200MB.
Files are located on the "spark-driver" machine, I'm running spark-submit in client deploy mode.
I'm reading files one by one in driver:
data = read_lines(file_name)
rdd = sc.parallelize(data,100)
rdd2 = rdd.flatMap(lambda j: myfunc(j))
df = rdd2.toDF(mySchema)
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
Spark version is 3.1.1
Spark/HDFS cluster has 5 workers with 8CPU,32GB RAM
Each executor has 4cores and 15GB RAM, that makes 10 executors total.
EDIT:
When I use coalesce(1) I get spark.rpc.message.maxSize limit breached error, but not when I use repartition(1). Could that be a clue?
Attaching DAG visualizations .. Looks like WholeStageCodegen part is taking too long on coalesce DAGs?
This can happen sometimes if your data is not evenly distributed and when you do coalesce it tries to reduce the partitions by combining the small partitions in order to reduce full shuffle but there could still be some data skew in one of the partition and that single partition would be taking the most of the time.
While you do repartition the data gets distributed almost evenly on all the partitions as it does full shuffle and all the tasks would almost get completed in the same time.
You could use the spark UI to see why when you are doing coalesce what is happening in terms of tasks and do you see any single task running long.

Splitting spark data into partitions and writing those partitions to disk in parallel

Problem outline: Say I have 300+ GB of data being processed with spark on an EMR cluster in AWS. This data has three attributes used to partition on the filesystem for use in Hive: date, hour, and (let's say) anotherAttr. I want to write this data to a fs in such a way that minimizes the number of files written.
What I'm doing right now is getting the distinct combinations of date, hour, anotherAttr, and a count of how many rows make up combination. I collect them into a List on the driver, and iterate over the list, building a new DataFrame for each combination, repartitioning that DataFrame using the number of rows to guestimate file size, and writing the files to disk with DataFrameWriter, .orc finishing it off.
We aren't using Parquet for organizational reasons.
This method works reasonably well, and solves the problem that downstream teams using Hive instead of Spark don't see performance issues resulting from a high number of files. For example, if I take the whole 300 GB DataFrame, do a repartition with 1000 partitions (in spark) and the relevant columns, and dumped it to disk, it all dumps in parallel, and finishes in ~9 min with the whole thing. But that gets up to 1000 files for the larger partitions, and that destroys Hive performance. Or it destroys some kind of performance, honestly not 100% sure what. I've just been asked to keep the file count as low as possible. With the method I'm using, I can keep the files to whatever size I want (relatively close anyway), but there is no parallelism and it takes ~45 min to run, mostly waiting on file writes.
It seems to me that since there's a 1-to-1 relationship between some source row and some destination row, and that since I can organize the data into non-overlapping "folders" (partitions for Hive), I should be able to organize my code/DataFrames in such a way that I can ask spark to write all the destination files in parallel. Does anyone have suggestions for how to attack this?
Things I've tested that did not work:
Using a scala parallel collection to kick off the writes. Whatever spark was doing with the DataFrames, it didn't separate out the tasks very well and some machines were getting massive garbage collection problems.
DataFrame.map - I tried to map across a DataFrame of the unique combinations, and kickoff writes from inside there, but there's no access to the DataFrame of the data that I actually need from within that map - the DataFrame reference is null on the executor.
DataFrame.mapPartitions - a non-starter, couldn't come up with any ideas for doing what I want from inside mapPartitions
The word 'partition' is also not especially helpful here because it refers both to the concept of spark splitting up the data by some criteria, and to the way that the data will be organized on disk for Hive. I think I was pretty clear in the usages above. So if I'm imagining a perfect solution to this problem, it's that I can create one DataFrame that has 1000 partitions based on the three attributes for fast querying, then from that create another collection of DataFrames, each one having exactly one unique combination of those attributes, repartitioned (in spark, but for Hive) with the number of partitions appropriate to the size of the data it contains. Most of the DataFrames will have 1 partition, a few will have up to 10. The files should be ~3 GB, and our EMR cluster has more RAM than that for each executor, so we shouldn't see a performance hit from these "large" partitions.
Once that list of DataFrames is created and each one is repartitioned, I could ask spark to write them all to disk in parallel.
Is something like this possible in spark?
One thing I'm conceptually unclear on: say I have
val x = spark.sql("select * from source")
and
val y = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr")
and
val z = x.where(s"date=$date and hour=$hour and anotherAttr=$anotherAttr2")
To what extent is y is a different DataFrame than z? If I repartition y, what effect does the shuffle have on z, and on x for that matter?
We had the same problem (almost) and we ended up by working directly with RDD (instead of DataFrames) and implementing our own partitioning mechanism (by extending org.apache.spark.Partitioner)
Details: we are reading JSON messages from Kafka. The JSON should be grouped by customerid/date/more fields and written in Hadoop using Parquet format, without creating too many small files.
The steps are (simplified version):
a)Read the messages from Kafka and transform them to a structure of RDD[(GroupBy, Message)]. GroupBy is a case class containing all the fields that are used for grouping.
b)Use a reduceByKeyLocally transformation and obtain a map of metrics (no of messages/messages size/etc) for each group - eg Map[GroupBy, GroupByMetrics]
c)Create a GroupPartitioner that's using the previously collected metrics (and some input parameters like the desired Parquet size etc) to compute how many partitions should be created for each GroupBy object. Basically we are extending org.apache.spark.Partitioner and overriding numPartitions and getPartition(key: Any)
d)we partition the RDD from a) using the previously defined partitioner: newPartitionedRdd = rdd.partitionBy(ourCustomGroupByPartitioner)
e)Invoke spark.sparkContext.runJob with two parameters: the first one is the RDD partitioned at d), the second one is a custom function (func: (TaskContext, Iterator[T]) that will write the messages taken from Iterator[T] into Hadoop/Parquet
Let's say that we have 100 mil messages, grouped like that
Group1 - 2 mil
Group2 - 80 mil
Group3 - 18 mil
and we decided that we have to use 1.5 mil messages per partition to obtain Parquet files greater than 500MB. We'll end up with 2 partitions for Group1, 54 for Group2, 12 for Group3.
This statement:
I collect them into a List on the driver, and iterate over the list,
building a new DataFrame for each combination, repartitioning that
DataFrame using the number of rows to guestimate file size, and
writing the files to disk with DataFrameWriter, .orc finishing it off.
is completely off-beam where Spark is concerned. Collecting to driver is never a good approach, volumes and OOM issues and latency in your approach is high.
Use so the below so as to simplify and get parallelism of Spark benefits saving time and money for your boss:
df.repartition(cols...)...write.partitionBy(cols...)...
shuffle occurs via repartition, no shuffling ever with partitionBy.
That simple, with Spark's default parallelism utilized.

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this:
df.coalesce(1)
.write
.partitionBy("entity", "year", "month", "day", "status")
.mode(SaveMode.Append)
.parquet(s"$location")
I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core.
I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce.
But is there a better way to do this using the standard Spark SQL API?
I had the exact same problem and I found a way to do this using DataFrame.repartition(). The problem with using coalesce(1) is that your parallelism drops to 1, and it can be slow at best and error out at worst. Increasing that number doesn't help either -- if you do coalesce(10) you get more parallelism, but end up with 10 files per partition.
To get one file per partition without using coalesce(), use repartition() with the same columns you want the output to be partitioned by. So in your case, do this:
import spark.implicits._
df
.repartition($"entity", $"year", $"month", $"day", $"status")
.write
.partitionBy("entity", "year", "month", "day", "status")
.mode(SaveMode.Append)
.parquet(s"$location")
Once I do that I get one parquet file per output partition, instead of multiple files.
I tested this in Python, but I assume in Scala it should be the same.
By definition :
coalesce(numPartitions: Int): DataFrame
Returns a new DataFrame that has exactly numPartitions partitions.
You can use it to decrease the number of partitions in the RDD/DataFrame with the numPartitions parameter. It's useful for running operations more efficiently after filtering down a large dataset.
Concerning your code, it doesn't perform well because what you are actually doing is :
putting everything into 1 partition which overloads the driver since it's pull all the data into 1 partition on the driver (and also it not a good practice)
coalesce actually shuffles all the data on the network which may also result in performance loss.
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
The shuffle concept is very important to manage and understand. It's always preferable to shuffle the minimum possible because it is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
Concerning partitioning parquet, I suggest that you read the answer here about Spark DataFrames with Parquet Partitioning and also this section in the Spark Programming Guide for Performance Tuning.
I hope this helps !
It isn't much on top of #mortada's solution, but here's a little abstraction that ensures you are using the same partitioning to repartition and write, and demonstrates sorting as wel:
def one_file_per_partition(df, path, partitions, sort_within_partitions, VERBOSE = False):
start = datetime.now()
(df.repartition(*partitions)
.sortWithinPartitions(*sort_within_partitions)
.write.partitionBy(*partitions)
# TODO: Format of your choosing here
.mode(SaveMode.Append).parquet(path)
# or, e.g.:
#.option("compression", "gzip").option("header", "true").mode("overwrite").csv(path)
)
print(f"Wrote data partitioned by {partitions} and sorted by {sort_within_partitions} to:" +
f"\n {path}\n Time taken: {(datetime.now() - start).total_seconds():,.2f} seconds")
Usage:
one_file_per_partition(df, location, ["entity", "year", "month", "day", "status"])

How to split the input file in Apache Spark

Suppose I have an input file of size 100MB. It contains large number of points (lat-long pair) in CSV format. What should I do in order to split the input file in 10 10MB files in Apache Spark or how do I customize the split.
Note: I want to process a subset of the points in each mapper.
Spark's abstraction doesn't provide explicit split of data. However you can control the parallelism in several ways.
Assuming you use YARN, HDFS file is automatically split into HDFS blocks and they're processed concurrently when Spark action is running.
Apart from HDFS parallelism, consider using partitioner with PairRDD. PairRDD is data type of RDD of key-value pairs and a partitioner manages mapping from a key to a partition. Default partitioner reads spark.default.parallelism. The partitioner helps to control the distribution of data as well as its locality in PairRDD-specific actions, e.g., reduceByKey.
Take a look at following documentation about Spark data parallelism.
http://spark.apache.org/docs/1.2.0/tuning.html
After searching through the Spark API I have found one method partition which returns the number of partitions of the JavaRDD. At the time of JavaRDD creation we have repartitioned it to desired number of partitions as told by #Nick Chammas.
JavaRDD<String> lines = ctx.textFile("/home/hduser/Spark_programs/file.txt").repartition(5);
List<Partition> partitions = lines.partitions();
System.out.println(partitions.size());

Resources