Spark memory usage - apache-spark

I have read spark documentation and I would like to be sure I am doing the right thing.
https://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks
Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join,
etc) build a hash table within each task to perform the grouping,
which can often be large.
How does this solution comes with "input file split size"? my understanding is that a lot of tasks would create lot of small files.
Should I repartition data to smaller number of partitions after a shuffle operation?

Related

Improve Spark denormalization/partition performance

I have a denormalization use case - one hive avro fact table to join with 14 smaller dimension tables and produce a denormalized parquet output table. Both the input fact table and output table are partitioned in the same way (Category=TEST1, YearMonthId=202101). And I do run historical processing, which means processing and loading several months for a given category at once.
I am using Spark 2.4.0/pyspark dataframe, broadcast join for all the table joins, dynamic partition inserts, using coalasce at the end to control the number of output files. (seeing a shuffle at the last stage probably because of dynamic partition inserts)
Would like to know the optimizations possible w.r.t to managing partitions - say maintain partitions consistently from input to output stage such that no shuffle is involved. Want to leverage the fact that the input and output storage tables are partitioned by the same columns.
I am also thinking about this - Use static partitions writes by determining the partitions and write to partitions parallelly - would this help in speeding-up or avoid shuffle?
Appreciate any help that would lead me in the right direction.
Couple of options below that I tried that improved the performance (both time + avoid small files).
Tried using repartition (instead of coalesce) in the data frame before doing a broadcast join, which minimized shuffle and hence the shuffle spill.
-- repartition(count, *PartitionColumnList, AnyOtherSaltingColumn) (Add salting column if the repartition is not even)
Make sure that the the base tables are properly compacted. This might even eliminate the need for #1 in some cases, and reduce # of tasks resulting in reduced overhead due to task scheduling.

is it more efficient to cache a dataframe in on partition or more partitions

I'm persisting a dataFrame, and in the spark interface i see that this dataframe is partitioned in my 7 nodes.
My spark job have transformations with wide dependencies.
Could it be more performant to force the cache in only 1 partition ?
To avoid shuffle?
Thanks
There is a balance between number of partitions and therefore concurrency. Dare I say it, you are a little off-beam here. Meaning:
Too much partitioning makes no sense --> too much overhead.
Just one partition would mean a coalesce or re-partition and would lack parallel processing of what Spark offers to get the job done quicker, e.g. many workers in parallel loading supermarket shelves is faster than just you and I doing it on our own.
The truth is somewhere in between in terms of number of partitions which at scale needs to be estimated and trialled, and, shuffling can rarely be avoided unless you base the partitioning on what you read in from HDFS/Hadoop Source (e.g. KUDU) or S3, or from JDBC.

Why Spark create less partitions than the number of files whem reading from S3

I'm using Spark 2.3.1.
I have a job that reads 5.000 small parquet files into s3.
When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Why ?
Spark is grouping multiple files into each partition due to their small size. You should see as much when you print out the partitions.
Example (Scala):
val df = spark.read.parquet("/path/to/files")
df.rdd.partitions.foreach(println)
If you want to use 5,000 task you could do a repartition transformation.
Quote from the docs about repartition:
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data
over the network.
I recommend you take a look at the RDD Programming Guide. Remember that shuffle is an expensive operation.

which is faster in spark, collect() or toLocalIterator()

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Resources