Does sortByKey in Spark invoke a shuffle? - apache-spark

I have a huge RDD in which I want to sort individual partitions locally. I looked into sortByKey operation, but it is not clear whether it invokes a shuffle or not. (I want to avoid the shuffle)
In Cloudera blog it is mentioned that sortByKey would involve shuffle but from the javadoc of sortByKey, it looks like there is no shuffle till collect() is invoked.
Question: Does sortByKey() involve shuffling of data ? If yes, then what would be the best way to sort data in each RDD partition ? If no, then how does collect() makes everything globally sorted ?

It involves a shuffle, but of course this happen only when an action is involved, as collect or take, in your graph of execution. This is because when the result of the sort has to be consumed from other transforms, record with same key has to be directed to the same consumer on the cluster.

Basically sortByKey() is an wide type transformation. Since All the transformation operations are lazy in nature, shuffling will happen only when u trigger an action (in your case collect()). In general, transformation are like instructions for an operation. Action will use this instructions for its executions. you can also refer DAG for more clear picture.

Related

Why can coalesce lead to too few nodes for processing?

I am trying to understand spark partitions and in a blog I come across this passage
However, you should understand that you can drastically reduce the parallelism of your data processing — coalesce is often pushed up further in the chain of transformation and can lead to fewer nodes for your processing than you would like. To avoid this, you can pass shuffle = true. This will add a shuffle step, but it also means that the reshuffled partitions will be using full cluster resources if possible.
I understand that coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash partitioner. I am not able to understand what the author is trying to say in this para though. Can somebody please explain to me what is being said in this paragraph?
Coalesce has some not so obvious effects due to Spark
Catalyst.
E.g.
Let’s say you had a parallelism of 1000, but you only wanted to write
10 files at the end. You might think you could do:
load().map(…).filter(…).coalesce(10).save()
However, Spark’s will effectively push down the coalesce operation to
as early a point as possible, so this will execute as:
load().coalesce(10).map(…).filter(…).save()
You can read in detail here an excellent article, that I quote from, that I chanced upon some time ago: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908
In summary: Catalyst treatment of coalesce can reduce concurrency early in the pipeline. This I think is what is being alluded to, though of course each case is different and JOIN and aggregating are not subject to such effects in general due to 200 default partitioning that applies for such Spark operations.
As what you have said in your question "coalesce means to take the data on some of the least data containing executors and shuffle them to already existing executors via a hash practitioner". This effectively means the following
The number of partitions have reduced
The main difference between repartition and coalesce is that in coalesce the movement of the data across the partitions is fewer than in repartition thus reducing the level of shuffle thus being more efficient.
Adding the property shuffle=true is just to distribute the data evenly across the nodes which is the same as using repartition(). You can use shuffle=true if you feel that your data might get skewed in the nodes after performing coalesce.
Hope this answers your question

What happens to the previous RDD when it gets transformed into a new RDD?

I am a beginner in Apache Spark. What I have understood so far regarding RDDs is that, once a RDD is generated, it cannot be modified but can be transformed into another RDD. Now I have several doubts here:
If an RDD is transformed to another RDD on applying some transformation on the RDD, then what happens to the previous RDD? Are both the RDDs stored in the memory?
If an RDD is cached, and some transformation is applied on the cached RDD to generate a new RDD, then can there be a scenario that, there is not enough space in RAM to hold the newly generated RDD? In case, such a scenario occurs, how will Spark handle that?
Thanks in advance!
Due to Spark's lazy evaluation, nothing happens when you do transformations on RDDs. Spark only starts computing when you call an action (save, take, collect, ...). So to answer your questions,
The original RDD stays where it is, and the transformed RDD does not exist / has not been computed yet due to lazy evaluation. Only a query plan is generated for it.
Regardless of whether the original RDD is cached, the transformed RDD will not be computed until an action is called. Therefore running out of memory shouldn't happen.
Normally when you run out of memory, either you encounter an OOM error, or the cached RDDs in memory will be spilled onto the disk to free up memory.
In order to understand the answers of the questions you have, you need to know couple of things about spark.
Spark evaluation Model (Lazy Evaluation)
Spark Operations (Transformations and Actions)
Directed Acyclic Graph (DAG)
Answer to your first Question:
You could think of RDD as virtual data structure that does not get filled with values unless there is some action called on it which materializes the rdd/dataframe. When you perform transformations it just creates query plan which shows the lazily evaluation behavior of spark. When action gets called, it perform all the transformation based on the physical plan that gets generated. So, nothing happens to the RDDs. RDD data gets pulled into memory when action gets called.
Answer to your Second Question:
If an RDD is cached and you perform multiple transformations on top of the cache RDD, actually nothing happens to the RDDs as cache is a transformation operation. Also the RDD that you have cached would be in memory when any action would be performed. So, you won't run out of memory.
You could run into memory issues if you are trying to cache each step of the transformation, which should be avoided.(Whether to cache or not Cache a dataframe/RDD is a million dollar question as beginner but you get to understand that as you learn the basics and spark architecture)
Other workflow where you can run out of memory is when you have huge data size and you are caching the rdd after some transformation as you would like to perform multiple actions on it or it is getting used in the workflow multiple times. In this case you need to verify your cluster configuration and need to make sure that it can handle the data that you are intending to cache.

Spark SQL joining multiple tables design

I am developing a Spark SQL analytics solutions using set of tables. Suppose there are 5 tables which i need to building my solution and finally i am creating one output table.
Here is my flow
dataframe1 = table1 join table2
dataframe2 = dataframe1 join table3
dataframe3 = datamframe2 + filter + agg
dataframe4 = dataframe3 join table4 join table 5
// finally
dataframe4.saveAsTable
When I save final dataframe that's when all the above dataframe is evaluated.
Is my approach is good? or
Do i need to cache/persist intermediate dataframes?
This is a very generic question and it is hard to provide a definitive answer.
Depending on the size of tables you would want to do broadcast hint for any of tables that are relatively small.
You can do this via
table_i.join(broadcast(table_j), ....)
This behaviour depends on the value in:
Now broadcast hint will be honoured only if Spark is able to evaluate the value of the table so you might need to cache().
Another option is via Spark checkpoints that can help to truncate local plan for optimisation (also this allows you to resume jobs from checkpoint location, it is similar to writing to HDFS but with some overhead).
In case of broadcasting few houndres of Mb tables, you might need to increase your kryo buffer:
--conf spark.kryoserializer.buffer.max=1g
It also depends which join types you will use.
You would probably want to do filter and aggregagtion as early as possible since it will reduce the join surface.
There are many other considerations to be consider in order to properly optimise this. In case of power law distribution of join keys in any of the joins you would need to do salting and explode smaller table.
In your case, in principle, there is not really a cache or persist required Why?
As there are no reuse paths evident (for other Actions or other Transformations within the same Action), it is all sequential.
Also, lazy evaluation and Catalyst.
Try the .explain and see how Spark will process.
However, due to memory eviction possibilities on the Cluster, there may be the need to re-compute on a Worker. There are various settings that you could apply via .cache and .persist, but Spark handles memory and disk spills without explicit .cache or .persist. See https://sparkbyexamples.com/spark/spark-difference-between-cache-and-persist/
Also, using .cache can affect performance. So use .explain. See here an excellent posting: Spark: Explicit caching can interfere with Catalyst optimizer's ability to optimize some queries?
So, each case is different but yours seems Ok to answer as I have. In summary: An RDD or DF that is not cached, nor check-pointed, is re-evaluated again each time an Action is invoked on that RDD or DF or if re-accessed within the current Action and no skipped stage situation applies. In your case no issue. Doing otherwise would slow your App down in fact.

Transformation vs Action in the context of Laziness

As mentioned in the "Learning Spark: Lightning-Fast Big Data Analysis" book:
Transformations and actions are different because of the way Spark computes RDDs.
After some explanation about laziness, as I found, both transformations and actions are working lazily. Therefore, the question is, what does the quoted sentence mean?
It is not necessarily valid to contrast laziness of RDD actions vs transformations.
The correct statement would be that RDDs are lazily evaluated, from the perspective of an RDD as a collection of data: there's not necessarily "data" in memory when the RDD instance is created.
The question raised by this statement is: when does the RDD's data get loaded in memory? Which can be rephrased as "when does the RDD get evaluated?". It's here that we have the distinction between actions and transformations:
Consider the following sequence of code:
Line #1:
rdd = sc.textFile("text-file-path")
Does the RDD exist? Yes.
Is the data loaded in memory? No.
--> RDD evaluation is lazy
Line #2:
rdd2 = rdd.map(lambda line: list.split())
Does the RDD exist? Yes. In fact, there are 2 RDDs.
Is the data loaded in memory? No.
--> Still, it's lazy, all Spark does is record how to load the data and transform it, remembering the lineage (how to derive RDDs one from another).
Line #3
print(rdd2.collect())
Does the RDD exist? Yes (2 RDDs still).
Is the data loaded in memory? Yes.
What's the difference? collect() forces Spark to return the result of the transformations. Spark now does all that it recorded in steps #1, #2, and #3.
In spark's terminology, #1 and #2 are transformations.
Transformations typically return another RDD instance, and that's a hint for recognizing the lazy part.
#3 has an action, which simply means an operation that causes plans in transformations to be carried out in order to return a result or perform a final action, such as saving results (yes, "such as saving the actual collection of data loaded in memory").
So, in short, I'd say that RDDs are lazily evaluated, but, in my opinion, it's incorrect to label operations (actions or transformations) as lazy or not.
Transformations are lazy, actions are not.
Definitions:
Transformation - A function that mutates the data out on the cluster. These actions will change the data in place when they are executed. Examples of this are map, filter, and aggregate. These are not executed until an action is called.
Action - Any function that results in data being persisted or returned to the driver (also foreach, which doesn't really fall into those two categories).
In order to run an action (like saving the data), all the transformations you have requested up till now have to be run to materialize the data. Spark can implement optimizations if it looks at the total execution plan of the operations you want to run, so it is beneficial not to compute anything until it is required.

Operations and methods to be careful about in Apache Spark?

What operations and/or methods do I need to be careful about in Apache Spark? I've heard you should be careful about:
groupByKey
collectAsMap
Why?
Are there other methods?
There're what you could call 'expensive' operations in Spark: all those that require a shuffle (data reorganization) fall in this category. Checking for the presence of ShuffleRDD on the result of rdd.toDebugString give those away.
If you mean "careful" as "with the potential of causing problems", some operations in Spark will cause memory-related issues when used without care:
groupByKey requires that all values falling under one key to fit in memory in one executor. This means that large datasets grouped with low-cardinality keys have the potential to crash the execution of the job. (think allTweets.keyBy(_.date.dayOfTheWeek).groupByKey -> bumm)
favor the use of aggregateByKey or reduceByKey to apply map-side reduction before collecting values for a key.
collect materializes the RDD (forces computation) and sends the all the data to the driver. (think allTweets.collect -> bumm)
If you want to trigger the computation of an rdd, favor the use of rdd.count
To check the data of your rdd, use bounded operations like rdd.first (first element) or rdd.take(n) for n elements
If you really need to do collect, use rdd.filter or rdd.reduce to reduce its cardinality
collectAsMap is just collect behind the scenes
cartesian: creates the product of one RDD with another, potentially creating a very large RDD. oneKRdd.cartesian(onekRdd).count = 1000000
consider adding keys and join in order to combine 2 rdds.
others?
In general, having an idea of the volume of data flowing through the stages of a Spark job and what each operation will do with it will help you keep mentally sane.

Resources