Why these two Spark RDDs generation ways have different data localities? - apache-spark

I am running two different ways of RDDs generation in a local machine, the first way is:
rdd = sc.range(0, 100).sortBy(lambda x: x, numPartitions=10)
rdd.collect()
The second way is:
rdd = sc.parallelize(xrange(100), 10)
rdd.collect()
But in my Spark UI, it showed different data locality, and I don't know why. Below is the result from the first way, it shows Locality Level(the 5th column) is ANY
And the result from the second way shows the Locality Level is Process_Local:
And I read from https://spark.apache.org/docs/latest/tuning.html , Process_Local Level is usually faster than Any Level for processing.
Is this because of sortBy operation will give rise to shuffle then influence the data locality? Can someone give me a clearer explanation?

You are correct.
In the first snippet you first create a parallelized collection, meaning your driver tells each worker to create some part of the collection. Then, as for sorting each worker node needs access to data on other nodes, data needs to be shuffled around and data locality is lost.
The second code snippet is effectively not even a distributed job.
As Spark uses lazy evaluation, nothing is done until you call to materialize the results, in this case using the collect method. The steps in your second computation are effectively
Distribute the object of type list from driver to worker nodes
Do nothing on each worker node
Collect distributed objects from workers to create object of type list on driver.
Spark is smart enough to realize that there is no reason to distribute the list even though parallelize is called. Since the data resides and the computation is done on the same single node, data locality is obviously preserved.
EDIT:
Some additional info on how Spark does sort.
Spark operates on the underlying MapReduce model (the programming model, not the Hadoop implementation) and sort is implemented as a single map and a reduce. Conceptually, on each node in the map phase, the part of the collection that a particular node operates on is sorted and written to memory. The reducers then pull relevant data from the mappers, merge the results and create iterators.
So, for your example, let's say you have a mapper that wrote numbers 21-34 to memory in sorted order. Let's say the same node has a reducer that is responsible for numbers 31-40. The reducer gets information from driver where the relevant data is. The numbers 31-34 are pulled from the same node and data only has to travel between threads. The other numbers however can be on arbitrary nodes in the cluster and need to be transferred over the network. Once the reducer has pulled all the relevant data from the nodes, the shuffle phase is over. The reducer now merges the results (like in mergesort) and creates an iterator over the sorted part of the collection.
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

Related

Why there are so many partitions required before shuffling data in Apache Spark?

Background
I am a newbie in Spark and want to understand about shuffling in spark.
I have two following questions about shuffling in Apache Spark.
1) Why there is change in no. of partitions before performing shuffling ? Spark does it by default by changing partition count to value given in spark.sql.shuffle.partitions.
2) Shuffling usually happens when there is a wide transformation. I have read in a book that data is also saved on disk. Is my understanding correct ?
Two questions actually.
Nowhere it it stated that you need to change this parameter. 200 is the default if not set. It applies to JOINing and AGGregating. You make have a far bigger set of data that is better served by increasing the number of partitions for more processing capacity - if more Executors are available. 200 is the default, but if your quantity is huge, more parallelism if possible will speed up processing time - in general.
Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially:
DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.
CURRENT STAGE
All the (fused) Map operations are performed intra-Stage.
The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
operations of current Stage.
This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have
thought in memory was possible, if data is small, but this is an architectural Spark
approach as stated from the docs.)
The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
keys/locations once all of the map side work is done.
NEXT STAGE
The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.
Stages mean writing to disk, even if enough memory present. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' style of implementation.
Of course, fault tolerance is aided by this persistence, less re-computation work.
Similar aspects apply to DFs.

Spark - do transformations also involve driver operations

My course notes have the following sentence: "RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset." But I think this is misleading because the transformation reduceByKey is performed locally on the workers and then on the driver as well (although the change does not take place until there's an action to be performed). Could you please correct me if I am wrong.
Here are the concepts
In Spark Transformation defines where one RDD generates one or more RDD. Everytime a new RDD is created. RDDs are immutable so any transformation on one RDD generates a new RDD and its added to DAG.
Action in spark are the function where new RDDs are not generated its generated other datatypes like String, int etc.. and result is returned to driver or other storage system.
Transformations are lazy in nature and nothing happen until action is triggered.
reduceByKey - Its a transformation as it generates a RDD from input RDD and its a WIDE TRANFORMATION. In reduce by key nothing happens until action is triggered. Please see the image below
reduce - its an action as it generates a non RDD type. Please see the image below
As a matter of fact, driver's first responsibility is managing the job. Moreover, RDD's objects are not located on driver to have an action on them. So, all the results are on workers till the actions' turns come. The thing which I mean is about lazy execution of spark, it means at first of the execution the plan is reviewed to the first action and if it could not find any then the whole program would result nothing. Otherwise, whole the program will be executed on the input data which would be presented as rdd object on the worker nodes to reach the action and all the data during this period would all be on workers and just the result according to the type of the action would be sent to or at least managed by the driver.

how to broadcast the content of a RDD efficiently

so I have this need to broadcast some related content from a RDD to all worker nodes, and I am trying to do it more efficiently.
More specifically, some RDD is created dynamically in the middle of the execution, to broadcast some of its content to all the worker nodes, an obvious solution would be to traverse its element one by one, and create a list/vector/hashmap to hold the needed content while traversing, and then broadcast this data structure to the cluster.
This does not seems to be a good solution at all since the RDD can be huge and it is distributed already, traversing it and creating some array/list based on the traversal result will be very slow.
So what would be a better solution, or best practice for this case? Would it be a good idea to run a SQL query on the RDD (after changing it to a dataFrame) to get the needed content, and then broadcast the query result to all the worker nodes?
thank you for your help in advance!
The following is added after reading Varslavans' answer:
a RDD is created dynamically and it has the following content:
[(1,1), (2,5), (3,5), (4,7), (5,1), (6,3), (7,2), (8,2), (9,3), (10,3), (11,3), ...... ]
so this RDD contains key-value pairs. What we want is to collect all the pairs whose value is > 3. So pair (2,5), (3,5), (4,7), ..., will be collected. Now, once we collected all these pairs, we would like to broadcast them so all the worker nodes will have these pairs.
Sounds like we should use collect() on the RDD and then broadcast... at least this is the best solution at this point.
Thanks again!
First of all - you don't need to traverse RDD to get all data. There is API for that - collect().
Second: Broadcast is not the same as distributed.
In broadcast - you have all the data on each node
In Distributed - you have different parts of a whole on each node
RDD is distributed by it's nature.
Third: To get needed content you can either use RDD API or convert it to DataFrame and use SQL queries. It depends on the data you have. Anyway contents of the result will be RDD or DataFrame and it will also be distributed. So if you need data locally - you collect() it.
Btw from your question it's not possible to understand what you exactly want to do and it looks like you need to read Spark basics. That will give you much answers :)

On which way does RDD of spark finish fault-tolerance?

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. But, I did not find the internal mechanism on which the RDD finish fault-tolerance. Could somebody describe this mechanism?Thanks.
Let me explain in very simple terms as I understand.
Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).
By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.
Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.
Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.
The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.

Spark cartesian doesn't cause shuffle?

So, I tried to test on Spark operations that cause shuffling based on this stackoverflow post: LINK. However, it doesn't make sense for me when the cartesian operation doesn't cause shuffling in Spark since they need to move the partitions across the network in order to put them together locally.
How does Spark actually do its cartesian and distinct operations behind the scene??
Shuffle is an operation which is specific to RDDs of key-value pairs (RDD[(T, U)] commonly described as PairRDDs or PairwiseRDDs) and is more or less equivalent to shuffle phase in Hadoop. A goal of shuffle is to move data to specific executor based on key value and Partitioner.
There are different types of operations in Spark, which require network traffic, but don't use the same type of logic as shuffle and not always require key-value pairs. Cartesian product is one of these operations. It moves data between machines (in fact it causes much more expensive data movements) but doesn't establish relationship between keys and executors.

Resources