Spark - In this case, when does repartition occur? - apache-spark

I need to output a unique file in each prefix, so the code is written like this ds.repartition(1).write.partitionBy("prefix").mode(SaveMode.Overwrite).csv(output)
Before the code did not add repartition, each prefix will have thousands of files, and the task can be completed in 2 hours. After adding repartition, each prefix will have 1 file, and the task will be executed for more than 7 hours. At what stage is repartition executed? Am I using this gracefully?

Whenever you do repartition it does a full shuffle and distribute the data evenly as much as possible.
In your case when you do ds.repartition(1), it shuffles all the data and bring all the data in a single partition on one of the worker node.
Now when you perform the write operation then only one worker node/executor is performing the write operation after partitioning by prefix. As only single worker is doing the work it is taking lot of time.
Few Stuffs that you could take into consideration:
If there is no real reason to have only one csv file , try to avoid doing that.
Instead of repartition(1) , use coalesce(1) that will do minimum shuffle instead of repartition(1) that would do full shuffle.
saving a single csv file , you are not utilizing the spark's power of parallelism.

if you want to use prefix as partition column, then you need run
spark.sql("set hive.exec.dynamic.partition=true")
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
and you can use coalesce(1) instead of repartition(1), beacause in this case, coalesce don't shuffle, repartition has shuffle, and the partition is one, then just has one task to deal the all data. so it cost 7 hours.

Related

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.
I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?
coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.
The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)
You state nothing else in terms of logic.
coalesce will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.

Number of files generated by a Spark Job

I want to monitor the number of files that spark generates, and maybe raise an exception if it is generating a lot of files. Is there any way to see this?
well it depends on how you are doing the write operation. Assuming you are writing the content of a dataframe or rdd as output, the easiest way would be to see number of partitions in your final dataframe/rdd. Basically each partition is written as a separate file.
Assuming you are using scala, this should give you the number of partitions.
df.rdd.getNumPartitions
Instead of raising an exception and causing job to fail, i would suggest that you use coalesce function to repartition the df with a value that suits you need. For example, if the output is not too large (1 Gb or less) i use coalesce(1) and write only 1 file.

How to process data in parallel but write results in a single file in Spark

I have a Spark job that:
Reads data from hdfs
Does some intensive transformation without shuffling and aggregation (only map operations)
Writes results back to hdfs
Let's say I have 10GB of raw data (40 blocks = 40 input partitions), which results in 100MB of processed data. To avoid generating many small files in hdfs I use "coalesce(1)" statement in order to write single file with results.
Doing so I get only 1 task running (because of "coalesce(1)" and absence of shuffling), which processes all 10GB in a single thread.
Is there a way to do actual intensive processing in 40 parallel tasks and reduce number of partitions right before writing to disk and avoid data shuffle?
I have an idea that might work - to cache dataframe in memory after all processing (do a count to force Spark to cache the data) and then put "coalesce(1)" and write dataframe to disk
The documentation clearly warns about this behavior and provides the solution:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
So instead
coalesce(1)
you can try
repartition(1)

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Resources