Repartitioning dataframe from Spark limit() function - apache-spark

I need to use the limit function to get n entries/rows from a dataframe. I do know it's not advisable, but this is meant as a pre-processing step which will not be required when actually implementing the code. However, I've read elsewhere that the resulting dataframe from using the limit function has only 1 partition.
I want to measure the processing time for my job which should not be limited by this. I actually tried repartitioning but the performance improvement is minimal (if any at all). I checked the partitioning by printing out df.rdd.getNumPartitions() and it's still 1. Is there someway to force repartitioning to happen?
EDIT: Note that the getNumPartitions() was run after a count action.
EDIT2: Sample code
df = random_data.groupBy("col").count().sort(F.desc("count")).limit(100).repartition(10)
df.count()
print("No. of partitions: {0}".format(df.rdd.getNumPartitions())) # Prints 1

Calling cache() then count() worked.
I think Spark's lazy evaluation is not executing the repartition for some reason, but I'm not sure why since count is supposed to be an action.

Related

How can i make my Spark Accumulator statistics reliable in Azure Databricks?

I am using a spark accumulator to collect statistics of each pipelines.
In a typical pipeline i would read a data_frame :
df = spark.read.format(csv).option("header",'true').load('/mnt/prepared/orders')
df.count() ==> 7 rows
Then i would actually write it in two diferent locations:
df.write.format(delta).option("header",'true').load('/mnt/prepared/orders')
df.write.format(delta).option("header",'true').load('/mnt/reporting/orders_current/')
Unfortunately my accumulator statistics get updated each write operations. It gives a figure of 14 rows read, while i have only read the input dataframe once.
How can I make my accumulator properly reflects the number of rows that i actually read.
I am a newbie in spark. have checked several threads around the issue, but did not find my answer.
Statistical accumulator in Python
spark Accumulator reset
When are accumulators truly reliable?
The first rule - accumulators aren't 100% reliable. They could be updated multiple times, for example, if tasks were restarted/retried.
In your case, although you read once, it doesn't mean that data won't be re-read again. Read operation just obtains metadata, like, schema, and may read data if you use inferSchema for some data type, but it doesn't mean that it's actually read the data into memory. You can cache your read dataframe, but it will work only for smaller data sets, as it's also don't guarantee that data won't be evicted, and then need to be re-read

Spark: rdd.count() and rdd.write() are executing transformations twice

I am using Apache Spark to fetch records from database and after some transformations, writing them to AWS S3. Now I also want to count the no of records I am writing to S3 and for that I am doing
rdd.count() and then
rdd.write()
In this way all the transformations are executing twice and giving performance issues.
Is there any way It can be achieved while transformations execution will not perform again?
Two Actions - the count and the write mean 2 sets of reading.
Assuming something like this:
val rdd = sc.parallelize(collectedData, 4)
then by adding .cache:
val rdd = sc.parallelize(collectedData, 4).cache
this will obviate the 2nd set of re-reading in general, but not always. You can also look at persist and the levels. Of course, caching has an overhead as well and it depends on sizes in play.
The DAG Visualization on the Spark UI will show a green segment or dot, implying caching has been applied.

Create Empty DataFrame from list is x4 times slower than from emptyRDD()

We want to create an empty DataFrame at some point in our code. We have found this weird issue.
When creating from an empty list, this slows down our program, and causes every spark action (e.g. df.write()) later on in the program to be 4x times slower:
spark.createDataFrame([], schema)
After lots of debugging, I've found this to solve the issue:
spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
Tried to look at Spark source code, but couldn't get to any conclusion.
Also, did df.explain() to DataFrames in the program - but the plans are identical.
My only though is that the 1st option causes some extra communication with the worker nodes.
Does anyone knows why the 1st option is much slower than the 2nd?
The spark.sparkContext.emptyRDD() creates an RDD with zero partitions, while spark.createDataFrame([], schema) creates a DataFrame with at least one partition.
The overhead is due to tasks on empty partitions.

Repartition followed by coalesce is not honored

I would like to spin up a lot of tasks when doing my calculation but coalesce into a smaller set of partitions when writing to the table.
A simple example for a demonstration is given below, where repartition is NOT honored during the execution.
My expected output is that the map operation happens in 100 partitions and finally collect happens in only 10 partitions.
It seems Spark has optimized the execution by ignoring the repartition. It would be helpful if someone can explain how to achieve my expected behavior.
sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).coalesce(10).collect()
Instead of coalesce, using repartition helps to achieve the expected behavior.
sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).cache().repartition(10).collect()
This helps to solve my problem. But, still would appreciate an explanation for this behavior.
"Returns a new Dataset that has exactly numPartitions partitions, when (sic) the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. "
Source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset#coalesce(numPartitions:Int):org.apache.spark.sql.Dataset[T]

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

Resources