On which way does RDD of spark finish fault-tolerance? - apache-spark

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. But, I did not find the internal mechanism on which the RDD finish fault-tolerance. Could somebody describe this mechanism?Thanks.

Let me explain in very simple terms as I understand.
Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).
By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.
Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.
Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.
The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.


Why there are so many partitions required before shuffling data in Apache Spark?

I am a newbie in Spark and want to understand about shuffling in spark.
I have two following questions about shuffling in Apache Spark.
1) Why there is change in no. of partitions before performing shuffling ? Spark does it by default by changing partition count to value given in spark.sql.shuffle.partitions.
2) Shuffling usually happens when there is a wide transformation. I have read in a book that data is also saved on disk. Is my understanding correct ?
Two questions actually.
Nowhere it it stated that you need to change this parameter. 200 is the default if not set. It applies to JOINing and AGGregating. You make have a far bigger set of data that is better served by increasing the number of partitions for more processing capacity - if more Executors are available. 200 is the default, but if your quantity is huge, more parallelism if possible will speed up processing time - in general.
Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially:
DAG dependency involving a shuffle means creation of a separate Stage.
Map operations are followed by Reduce operations and a Map and so forth.
All the (fused) Map operations are performed intra-Stage.
The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map
operations of current Stage.
This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have
thought in memory was possible, if data is small, but this is an architectural Spark
approach as stated from the docs.)
The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all
keys/locations once all of the map side work is done.
The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.
Stages mean writing to disk, even if enough memory present. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' style of implementation.
Of course, fault tolerance is aided by this persistence, less re-computation work.
Similar aspects apply to DFs.

Difference between one-pass and multi-pass computations

I'm reading an article on Apache Spark and I came across the following sentence:
"Hadoop as a big data processing technology has been around for 10 years and has proven to be the solution of choice for processing large data sets. MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms." (Full article)
Searching the web yields results about the difference between one-pass and multi-pass compilers (For instance, see This SO question)
However, I'm not really sure if the answer also applies for data processing. Can somebody explain me what one-pass computation and multi-pass computation is, and why the latter is better, and thus is used in Spark?
Map Reduce
Source : https://www.guru99.com/introduction-to-mapreduce.html
Here you can see, the input file is processed as follows.
first split
goes into mapping phase
In Map-reduce paradigm, after every stage the intermediate result is written to disk. Also, Mapper and Reducer are two different process. That is, first the mapper job runs, spits out the mapping files, then the reducer job starts. At every stage the job requires resource allocation. Therefore, a single map-reduce job required multiple iterations. If you have multiple map phases, after every map the data needs to spit out to disk before other map task starts. This is the multi-step process.
Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.
On the other hand, spark does the resource negotiation only once. Once the negotiation is completed, it spawns all the executors and that stays throughout the tenure of the job.
During the execution, spark doesn't write the intermediate output of the Map phases to the disk, rather keeps in memory. Therefore, all the map operations can happen back to back without writing to disk or spawning new executors. This is the single step process.
Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.
One pass computations is when you are reading the dataset once whereas multipass computations is when a dataset is read once from the disk and multiple computations or operation are done on the same dataset. Apache Spark processing framework allows you to read data once which is then cached into memory and then we can perform multi pass computations on the data. These computations can be done on the dataset very quickly because the data is present into memory of the machine and apache spark does not need to read the data again from the disk which helps us to save lot of input output operations time. As per the definition of apache spark it is an in memory processing framework which means the data and transformation on which the computation is done is present in memory itself.

Spark - do transformations also involve driver operations

My course notes have the following sentence: "RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset." But I think this is misleading because the transformation reduceByKey is performed locally on the workers and then on the driver as well (although the change does not take place until there's an action to be performed). Could you please correct me if I am wrong.
Here are the concepts
In Spark Transformation defines where one RDD generates one or more RDD. Everytime a new RDD is created. RDDs are immutable so any transformation on one RDD generates a new RDD and its added to DAG.
Action in spark are the function where new RDDs are not generated its generated other datatypes like String, int etc.. and result is returned to driver or other storage system.
Transformations are lazy in nature and nothing happen until action is triggered.
reduceByKey - Its a transformation as it generates a RDD from input RDD and its a WIDE TRANFORMATION. In reduce by key nothing happens until action is triggered. Please see the image below
reduce - its an action as it generates a non RDD type. Please see the image below
As a matter of fact, driver's first responsibility is managing the job. Moreover, RDD's objects are not located on driver to have an action on them. So, all the results are on workers till the actions' turns come. The thing which I mean is about lazy execution of spark, it means at first of the execution the plan is reviewed to the first action and if it could not find any then the whole program would result nothing. Otherwise, whole the program will be executed on the input data which would be presented as rdd object on the worker nodes to reach the action and all the data during this period would all be on workers and just the result according to the type of the action would be sent to or at least managed by the driver.

Why these two Spark RDDs generation ways have different data localities?

I am running two different ways of RDDs generation in a local machine, the first way is:
rdd = sc.range(0, 100).sortBy(lambda x: x, numPartitions=10)
The second way is:
rdd = sc.parallelize(xrange(100), 10)
But in my Spark UI, it showed different data locality, and I don't know why. Below is the result from the first way, it shows Locality Level(the 5th column) is ANY
And the result from the second way shows the Locality Level is Process_Local:
And I read from https://spark.apache.org/docs/latest/tuning.html , Process_Local Level is usually faster than Any Level for processing.
Is this because of sortBy operation will give rise to shuffle then influence the data locality? Can someone give me a clearer explanation?
You are correct.
In the first snippet you first create a parallelized collection, meaning your driver tells each worker to create some part of the collection. Then, as for sorting each worker node needs access to data on other nodes, data needs to be shuffled around and data locality is lost.
The second code snippet is effectively not even a distributed job.
As Spark uses lazy evaluation, nothing is done until you call to materialize the results, in this case using the collect method. The steps in your second computation are effectively
Distribute the object of type list from driver to worker nodes
Do nothing on each worker node
Collect distributed objects from workers to create object of type list on driver.
Spark is smart enough to realize that there is no reason to distribute the list even though parallelize is called. Since the data resides and the computation is done on the same single node, data locality is obviously preserved.
Some additional info on how Spark does sort.
Spark operates on the underlying MapReduce model (the programming model, not the Hadoop implementation) and sort is implemented as a single map and a reduce. Conceptually, on each node in the map phase, the part of the collection that a particular node operates on is sorted and written to memory. The reducers then pull relevant data from the mappers, merge the results and create iterators.
So, for your example, let's say you have a mapper that wrote numbers 21-34 to memory in sorted order. Let's say the same node has a reducer that is responsible for numbers 31-40. The reducer gets information from driver where the relevant data is. The numbers 31-34 are pulled from the same node and data only has to travel between threads. The other numbers however can be on arbitrary nodes in the cluster and need to be transferred over the network. Once the reducer has pulled all the relevant data from the nodes, the shuffle phase is over. The reducer now merges the results (like in mergesort) and creates an iterator over the sorted part of the collection.

How to know which piece of code runs on driver or executor?

I am new to Spark. How to know which piece of code will run on the driver & which will run on the executors ?
Do we always have to try to code such that everything runs on the executors ?. Is there any recommendations/ways to make most of your code to run on executors ?
Update: I far as I understand Transformations run on executors & actions runs on driver because it needs to return value. So is it fine if the action runs on driver or should it also run on executor ? Where does the driver actually run ? on cluster ?
Any Spark application consists of a single Driver process and one or more Executor processes. The Driver process will run on the Master node of your cluster and the Executor processes run on the Worker nodes. You can increase or decrease the number of Executor processes dynamically depending upon your usage but the Driver process will exist throughout the lifetime of your application.
The Driver process is responsible for a lot of things including directing the overall control flow of your application, restarting failed stages and the entire high level direction of how your application will process the data.
Coding your application so that more data is processed by Executors falls more under the purview of optimising your application so that it processes data more efficiently/faster making use of all the resources available to it in the cluster.
In practice, you do not really need to worry about making sure that more of your data is being processed by executors.
That being said, there are some Actions, which when triggered, necessarily involve shuffling around of data. If you call the collect action on an RDD, all the data is brought to the Driver process and if your RDD had a sufficiently large amount of data in it, an Out Of Memory error will be triggered by the application, as the single machine running the Driver process will not be able to hold all the data.
Keeping the above in mind, Transformations are lazy and Actions are not.
Transformations basically transform one RDD into another. But calling a transformation on an RDD does not actually result in any data being processed anywhere, Driver or Executor. All a transformation does is that it adds to the DAG's lineage graph which will be executed when an Action is called.
So the actual processing happens when you call an Action on an RDD. The simplest example is that of calling collect. As soon as an action is called, Spark gets to work and executes the previously saved DAG computations on the specified RDD, returning the result back. Where these computations are executed depends entirely on your application.
There is no simple and straightforward answer here.
As a rule of thumb everything that is executed inside closures of higher order functions like mapPartitions (map, filter, flatMap) or combineByKey should be handled mostly by executor machines. Everything outside these are handled by the driver. But you have to be aware that it is a serious simplification.
Depending on a specific method and language at least a part of the job can be handled by the driver. For example when you use combine-like methods (reduce, aggregate) final merging is applied locally on the driver machine. Complex algorithms (like many can ML / MLlib tools) can interleave distributed and local processing when needed.
Moreover data processing is only a fraction of a whole job. Driver is responsible for bookeeping, accumulator processing, initial broadcasting and other secondary tasks. It also handles lineage and DAG processing and generating execution plans for higher level APIs (Dataset, SparkSQL).
While the whole picture is relatively complex in practice your choices are relatively limited. You can:
Avoid collecting data (collect, toLocalIterator) to process locally.
Perform more work on the workers with tree* (treeAggregate, treeReduce) methods.
Avoid unnecessary tasks which increase bookkeeping costs.
To this part of your question "Update: I far as I understand Transformations run on executors & actions runs on the driver because it needs to return value. "
It is not true that only transformation runs on the executor and all actions run on the driver.
If we have to join 2 datasets where there is no aggregate operation that needs to be performed eg :
In this case, as soon as the executor machine completes working on its partition it starts writing down the result to HDFS/some persistence without waiting for other executors to complete. This is the reason why we see different part files, which are technically partitions that each executor processed.
Driver does not wait for all executors to complete its computation.
Where does the driver actually run? on cluster?
Depends on the --deploy-mode chosen.
If --deploy-mode client then the gateway where you launch your spark application is your driver machine.
If --deploy-mode cluster, cluster manager choose a machine(in yarn/mesos) which it feels has sufficient memory to run as the driver.
