What algorithm spark uses to bring same keys together - apache-spark

What algorithm Spark uses to identify similar keys and pushes the data to the next stage?
Scenarios include,
When I apply distinct(), I know a pre-distinct applied in the current stage and then the data is shuffled to the next stage. In this case, all the similar keys need to be in the same partition in the next stage.
When Dataset1 joins with Dataset2 (SortMergeJoin). In this case, all the similar keys in Dataset1 and Dataset2 needs to be in the same partition in next stage.
There are other scenarios as well, but overall picture is this.
How does Spark efficiently does this? and will there be any time lag between Stage1 and Stage2 when identifying the similar keys?

Algorithm Spark uses to partition the data is Hash by default. Also stages don't push but pull the data from previous stage.
Spark creates a stage boundaries whenever a shuffle is needed. Second stage will wait untill all the tasks in stage first complete and write their output to temp files. Second stage then starts pulling the data needed for its partitions from across the partitions written in stage 1.
Distinct as you see isn't as simple as it looks. Spark does distinct by applying aggregates. Also shuffling is needed because duplicates can be in multiple partitions. One of the conditions for shuffling is Spark needs a pair RDD and if your parent isn't one, it will create intermediary pair RDDs.
If you see the logical plan of Distinct, it would be more or less like
Parent RDD ---> Mapped RDD (record as key and null values) ---> MapPartitionsRDD (running distinct at partition level) ----> Shuffled RDD (pulling needed partitions data) ----> MapPartitionsRDD (distinct from segregated partitions for each key) ----> Mapped RDD (collecting only keys and discarding null values for result)

Spark uses RDD Dependency to achieve that data is shuffled to the next stage. And know which is a complexed processes;
The getDependencies function in RDD.SCALA is responsible to get the data from parent.
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps
And Some RDD dont have to get parent rdd, so the rdd dont implement the function compute,like DataSource RDD;
Shuffle rowRDD usually appear in chain compute, so it usually have the parent data to fetch.

Related

What Transformation should I apply on Spark DataFrame

I have 2 Spark dataframes (A & B) having a common column/field in both (which is a primary key in DataFrame A but not in B).
For each record/row in dataframe A, there are multiple records in dataframe B.
Based on that common column value I want to fetch all records from dataframe B against each record in dataframe A.
What kind of transformation should I perform in order to collect the records together without doing much shuffling?
To combine the records from 2 or more spark Dataframes, join is necessary.
If your data is not partitioned / bucketed well, it will lead to a Shuffle join. In which every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining). These joins are expensive because the network can become congested with traffic.
The shuffle can be avoided if:
Both Dataframes have a known partitioner or Bucketized.
One of the datasets is small enough to fit in memory, in which case we can do a broadcast hash join
Partitioning
If you partition your data correctly prior to a join, you can end up with much more efficient execution because even if a shuffle is planned, if data from two different DataFrames is already located on the same machine, Spark can avoid the shuffle.
df1.repartition(col("id"))
df2.repartition(col("id"))
// you can optionally specify the number of partitions like:
df1.repartition(10, col("id"))
// Join Dataframes on id column
df1.join(df2, "id") // this will avoid the duplicate id columns in output DF.
Broadcast Hash join
When one of the Dataset is small enough to fit into the memory of a single worker node, , we can optimize our join.
Spark will replicate the small DataFrame onto every worker node in the cluster (be it located on one machine or many). Now this sounds expensive. However, what this does is prevent us from performing the all-to-all communication during the entire join process. Instead, it performs only once at the beginning and then let each individual worker node perform the work without having to wait or communicate with any other worker node.
import org.apache.spark.sql.functions.broadcast
// explicitly specify the broadcast hint, though spark handles it.
df1.join(broadcast(df2), "id")

How Spark shuffle operation works?

I'm learning Spark for my project and I'm in stuck with shuffle process in Spark. I want to know how this operation works internal. I found some keywords involved in this operation: ShuffleMapStage, ShuffleMapTask, ShuffledRDD, Shuffle Write, Shuffle Read....
My questions are:
1) Why we need ShuffleMapStage? When this stage is created and how it works?
2) When ShuffledRDD's compute method is called?
3) What are Shuffle Read and Shuffle Write?
The suffle operation consist to distribute coherent data on workers (repartition) using hash function on data key (data localilty problem).
This operation involves data transfert to organise data before perform an action, reduce the number of suffle operation increase the performance.
Shuffle operation are automatically called by Spark between 2 transformation to execute a final action.
Some Spark transformation need shuffle (like Group by, Join, sort)
Some Spark transformation doesn't need shuffle (like Union, Map, Reduce, Filter, count)

kafka streaming behaviour for more than one partition

I am consuming from Kafka topic. This topic has 3 partitions.
I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).
Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)
Now , the output is something like this :
1 1 1 2 3 2 4 3 5 ....
This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.
But the final output DataSet that i get has all the records.
So , does Spark internally union all the data? How does it makes out what to union?
I am trying to understand the behaviour , so that i can form my logic
int count = 0;
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
System.out.println("NUmber of elements in RDD : "+ rdd.count());
List<Row> rows = rdd.map(record -> processData(record))
.reduce((rows1, rows2) -> {
rows1.addAll(rows2);
return rows1;
});
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(rows, schema);
ds.createOrReplaceTempView("trades");
ds.show();
}
});
The assumptions are not completely accurate.
foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.
In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.
So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.
Now comes a transformation of the RDD:
rdd.map(record -> processData(record))
As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.
So, does Spark internally union all the data? How does it make out what to union?
The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.
In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()
If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.
A better approach would be:
val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...
In this case, the data of all partitions will remain local to the executor where they are located.
Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.

Why these two Spark RDDs generation ways have different data localities?

I am running two different ways of RDDs generation in a local machine, the first way is:
rdd = sc.range(0, 100).sortBy(lambda x: x, numPartitions=10)
rdd.collect()
The second way is:
rdd = sc.parallelize(xrange(100), 10)
rdd.collect()
But in my Spark UI, it showed different data locality, and I don't know why. Below is the result from the first way, it shows Locality Level(the 5th column) is ANY
And the result from the second way shows the Locality Level is Process_Local:
And I read from https://spark.apache.org/docs/latest/tuning.html , Process_Local Level is usually faster than Any Level for processing.
Is this because of sortBy operation will give rise to shuffle then influence the data locality? Can someone give me a clearer explanation?
You are correct.
In the first snippet you first create a parallelized collection, meaning your driver tells each worker to create some part of the collection. Then, as for sorting each worker node needs access to data on other nodes, data needs to be shuffled around and data locality is lost.
The second code snippet is effectively not even a distributed job.
As Spark uses lazy evaluation, nothing is done until you call to materialize the results, in this case using the collect method. The steps in your second computation are effectively
Distribute the object of type list from driver to worker nodes
Do nothing on each worker node
Collect distributed objects from workers to create object of type list on driver.
Spark is smart enough to realize that there is no reason to distribute the list even though parallelize is called. Since the data resides and the computation is done on the same single node, data locality is obviously preserved.
EDIT:
Some additional info on how Spark does sort.
Spark operates on the underlying MapReduce model (the programming model, not the Hadoop implementation) and sort is implemented as a single map and a reduce. Conceptually, on each node in the map phase, the part of the collection that a particular node operates on is sorted and written to memory. The reducers then pull relevant data from the mappers, merge the results and create iterators.
So, for your example, let's say you have a mapper that wrote numbers 21-34 to memory in sorted order. Let's say the same node has a reducer that is responsible for numbers 31-40. The reducer gets information from driver where the relevant data is. The numbers 31-34 are pulled from the same node and data only has to travel between threads. The other numbers however can be on arbitrary nodes in the cluster and need to be transferred over the network. Once the reducer has pulled all the relevant data from the nodes, the shuffle phase is over. The reducer now merges the results (like in mergesort) and creates an iterator over the sorted part of the collection.
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

What is the result of RDD transformation in Spark?

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?
RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.
So if you do this:
val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()
words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).
So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.
Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable.
All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.
Example of RDD transformations:
map,filter,flatMap,groupByKey,reduceByKey
As others have mentioned, an RDD maintains a list of all the transformations which have been programmatically applied to it. These are lazily evaluated, so though (in the REPL, for example), you may get a result back of a different parameter type (after applying a map, for example), the 'new' RDD doensn't yet contain anything, because nothing has forced the original RDD to evaluate the transformations / filters which are in its lineage. Methods such as count, the various reduction methods, etc will cause the transportations to be applied. The checkpoint method applies all RDD actions as well, returning an RDD which is the result of the transportations but has no lineage (this can be a performance advantage, especially with iterative applications).
All answers are perfectly valid. I just want to add a quick picture :-)
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.
Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The other answers give a good explanation already. Here are my some cents:
To know well what's inside that returned RDD, it'd better to check what's inside the RDD abstract class (quoted from source code):
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Resources