How does Distinct() function work in Spark? - apache-spark

I'm a newbie to Apache Spark and was learning basic functionalities.
Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones out of them. I use distinct() function. I'm wondering on what basis does the function consider that tuples as disparate..? Is it based on the keys, or values, or both?

.distinct() is definitely doing a shuffle across partitions. To see more of what's happening, run a .toDebugString on your RDD.
val hashPart = new HashPartitioner(<number of partitions>)
val myRDDPreStep = <load some RDD>
val myRDD = myRDDPreStep.distinct.partitionBy(hashPart).setName("myRDD").persist(StorageLevel.MEMORY_AND_DISK_SER)
myRDD.checkpoint
println(myRDD.toDebugString)
which for an RDD example I have (myRDDPreStep is already hash-partitioned by key, persisted by StorageLevel.MEMORY_AND_DISK_SER, and checkpointed), returns:
(2568) myRDD ShuffledRDD[11] at partitionBy at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
+-(2568) MapPartitionsRDD[10] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
| ShuffledRDD[9] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
+-(2568) MapPartitionsRDD[8] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
| myRDDPreStep ShuffledRDD[6] at partitionBy at mycode.scala:193 [Disk Memory Serialized 1x Replicated]
| CachedPartitions: 2568; MemorySize: 362.4 GB; TachyonSize: 0.0 B; DiskSize: 0.0 B
| myRDD[7] at count at mycode.scala:214 [Disk Memory Serialized 1x Replicated]
Note that there may be more efficient ways to get distinct that involve fewer shuffles, ESPECIALLY if your RDD is already partitioned in a smart way and the partitions are not overly skewed.
See Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?
and
Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?

The API docs for RDD.distinct() only provide a one sentence description:
"Return a new RDD containing the distinct elements in this RDD."
From recent experience I can tell you that in a tuple-RDD the tuple as a whole is considered.
If you want distinct keys or distinct values, then depending on exactly what you want to accomplish, you can either:
A. call groupByKey() to transform {(k1,v11),(k1,v12),(k2,v21),(k2,v22)} to {(k1,[v11,v12]), (k2,[v21,v22])} ; or
B. strip out either the keys or values by calling keys() or values() followed by distinct()
As of this writing (June 2015) UC Berkeley + EdX is running a free online course Introduction to Big Data and Apache Spark which would provide hands on practice with these functions.

Justin Pihony is right. Distinct uses the hashCode and equals method of the objects for this determination. It's return the distinct elements(object)
val rdd = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20), (3,21), (3,22)))
Distinct
rdd.distinct.collect().foreach(println)
(2,22)
(1,20)
(3,22)
(2,20)
(1,21)
(3,21)
If you want to apply distinct on key.
In that case reduce by is better option
ReduceBy
val reduceRDD= rdd.map(tup =>
(tup._1, tup)).reduceByKey { case (a, b) => a }.map(_._2)
reduceRDD.collect().foreach(println)
Output:-
(2,20)
(1,20)
(3,21)

distinct uses the hashCode and equals method of the objects for this determination. Tuples come built in with the equality mechanisms delegating down into the equality and position of each object. So, distinct will work against the entire Tuple2 object. As Paul pointed out, you can call keys or values and then distinct. Or you can write your own distinct values via aggregateByKey, which would keep the key pairing. Or if you want the distinct keys, then you could use a regular aggregate

It looks like the distinct will get rid of (key, value) duplicates.
In the below example (1,20) and (2,20) are repeated twice in myRDD, but after a distinct(), the duplicates are removed.
scala> val myRDD = sc.parallelize(List((1,20), (1,21), (1,20), (2,20), (2,22), (2,20), (3,21), (3,22)))
myRDD: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1274] at parallelize at <console>:22
scala> myRDD.collect().foreach(println _)
(1,20)
(1,21)
(1,20)
(2,20)
(2,22)
(2,20)
(3,21)
(3,22)
scala> myRDD.distinct.collect().foreach(println _)
(2,22)
(1,20)
(3,22)
(2,20)
(1,21)
(3,21)

Related

What is the performance difference between accumulator and collect() in Spark?

Accumulator are basically the shared variable in spark to be updated by executors but read by driver only.
Collect() in spark is to get all the data into the driver from executors.
So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST?
Code to convert dataframe to List using accumulator
val queryOutput = spark.sql(query)
val acc = spark.sparkContext.collectionAccumulator[Map[String,Any]]("JsonCollector")
val jsonString = queryOutput.foreach(a=>acc.add(convertRowToJSON(a)))
acc.value.asScala.toList
def convertRowToJSON(row: Row): Map[String,Any] = {
val m = row.getValuesMap(row.schema.fieldNames)
println(m)
JSONObject(m).obj
}
Code to convert dataframe to list using collect()
val queryOutput = spark.sql(query)
queryOutput.toJSON.collectAsList()
Convert large RDD to LIST
It is not a good idea. collect will move data from all executors to driver memory. If memory is not enough then it will throw Out Of Memory (OOM) Exception. If your data is fits in memory of single machine then probably you don't need spark.
Spark natively supports accumulators of numeric types, and programmers can add support for new types. They can be used to implement counters (as in MapReduce) or sums. OUT parameter of accumulator should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.
CollectionAccumulator .value returns List (ArrayList implementation) and it will throw OOM if size is greater than driver memory.

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

Is Dataset.rdd an action or transformation?

one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty(), however, I see rdd at mycode.scala:123 in sparkUI executions. which makes me wonder if this rdd() function is actually an action is instead of a transformation.
I know that isEmpty() is an action, but I do see a separate stage where isEmpty() at mycode.scala:234, so I think they are different actions?
rdd is generated to represent a structured query in "RDD terms" so Spark can execute it. It is an RDD of JVM objects of your type T. If used incorrectly can cause memory problems since:
Transfers internally-managed optimized rows that live outside JVM to the memory space in JVM
Transforms the binary rows to your business objects (the JVM "true" representation)
The first will increase the JVM memory required for the computation while the latter is an extra transformation step.
For such a simple calculation where you count the number of rows, you'd rather stick to count as the optimized and fairly cheap computation (that can avoid copying objects and applying schema).
Internally, Dataset keeps rows in their InternalRow. That decreases JVM memory requirement for your Spark application. The RDD (from rdd) is computed to represent the Spark transformations that are going to be executed once a Spark action is executed.
Please note that executing rdd creates a RDD and does require some calculations too.
So, yes, rdd might be considered an action as it "executes" the query (i.e. the physical plan of the Dataset that sits behind), but in the end it just gives an RDD (so it can't be an action by definition since Spark actions return a non-RDD value).
As you can see in the code:
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
val deserialized = CatalystSerde.deserialize[T](logicalPlan) // <-- HERE see explanation below
sparkSession.sessionState.executePlan(deserialized).toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
rdd is computed lazily and only once.
one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty()
I wonder where did you find it. I'd just count:
count(): Long Returns the number of rows in the Dataset.
toRdd Lazy Value
If you insist on going fairly low-level to check whether your Dataset is empty or not, I'd rather use Dataset.queryExecution.toRdd instead. That's almost like rdd without this extra copying and applying schema.
df.queryExecution.toRdd.isEmpty
Compare the following RDD lineages and think which may seem better.
val dataset = spark.range(5).withColumn("group", 'id % 2)
scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step
| MapPartitionsRDD[7] at rdd at <console>:26 []
| MapPartitionsRDD[6] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| ParallelCollectionRDD[4] at rdd at <console>:26 []
// Compare with a more memory-optimized alternative
// Avoids copies and has no schema
scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
| MapPartitionsRDD[10] at toRdd at <console>:26 []
| ParallelCollectionRDD[9] at toRdd at <console>:26 []
From Spark perspective, the transformations are fairly cheap since they don't cause any shuffles, but given the memory requirements change between the computation I'd use the latter (with toRdd).
rdd Lazy Value
rdd represents the content of the Dataset as a (lazily-created) RDD with rows of the JVM type T.
rdd: RDD[T]
As you can see in the source code (pasted above), requesting rdd in the end will trigger one extra computation just to get the RDD.
Creates a new logical plan to deserialize the Dataset’s logical plan, i.e. you get extra deserialization from internal binary row format that is managed outside JVM to its corresponding representation as JVM objects living inside JVM (think of GC that you should avoid at all cost)

Use of partitioners in Spark

Hy, I have a question about partitioning in Spark,in Learning Spark book, authors said that partitioning can be useful, like for example during PageRank at page 66 and they write :
since links is a static dataset, we partition it at the start with
partitionBy(), so that it does not need to be shuffled across the
network
Now I'm focused about this example, but my questions are general:
why a partitioned RDD doesn't need to be shuffled?
PartitionBy() is a wide transformation,so it will produce shuffle anyway,right?
Could someone illustrate a concrete example and what happen into each single node when partitionBy happens?
Thanks in advance
Why a partitioned RDD doesn't need to be shuffled?
When the author does:
val links = sc.objectFile[(String, Seq[String])]("links")
.partitionBy(new HashPartitioner(100))
.persist()
He's partitioning the data set into 100 partitions where each key will be hashed to a given partition (pageId in the given example). This means that the same key will be stored in a single given partition. Then, when he does the join:
val contributions = links.join(ranks)
All chunks of data with the same pageId should already be located on the same executor, avoiding the need for a shuffle between different nodes in the cluster.
PartitionBy() is a wide transformation,so it will produce shuffle
anyway, right?
Yes, partitionBy produces a ShuffleRDD[K, V, V]:
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
}
Could someone illustrate a concrete example and what happen into each
single node when partitionBy happens?
Basically, partitionBy will do the following:
It will hash the key modulu the number of partitions (100 in this case), and since it relys on the fact that the same key will always produce the same hashcode, it will package all data from a given id (in our case, pageId) to the same partition, such that when you join, all data will be available in that partition already, avoiding the need for a shuffle.

Apache spark applying map transformation on RDDs

I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like :
HadoopRDD -> RDD1 -> RDD2.
My question is whether Spak will iterate over the HadoopRDD record by record to generate RDD1 then it will iterate over RDD1 record by record to generate RDD2 or does it ietrate over HadoopRDD and then generate RDD1 and then RDD2 in one go.
Short answer: rdd.map(f).map(g) will be executed in one pass.
tl;dr
Spark splits a job into stages. A stage applied to a partition of data is a task.
In a stage, Spark will try to pipeline as many operations as possible. "Possible" is determined by the need to rearrange data: an operation that requires a shuffle will typically break the pipeline and create a new stage.
In practical terms:
Given `rdd.map(...).map(..).filter(...).sort(...).map(...)`
will result in two stages:
.map(...).map(..).filter(...)
.sort(...).map(...)
This can be retrieved from an rdd using rdd.toDebugString
The same job example above will produce this output:
val mapped = rdd.map(identity).map(identity).filter(_>0).sortBy(x=>x).map(identity)
scala> mapped.toDebugString
res0: String =
(6) MappedRDD[9] at map at <console>:14 []
| MappedRDD[8] at sortBy at <console>:14 []
| ShuffledRDD[7] at sortBy at <console>:14 []
+-(8) MappedRDD[4] at sortBy at <console>:14 []
| FilteredRDD[3] at filter at <console>:14 []
| MappedRDD[2] at map at <console>:14 []
| MappedRDD[1] at map at <console>:14 []
| ParallelCollectionRDD[0] at parallelize at <console>:12 []
Now, coming to the key point of your question: pipelining is very efficient. The complete pipeline will be applied to each element of each partition once. This means that rdd.map(f).map(g) will perform as fast as rdd.map(f andThen g) (with some neglectable overhead)
Apache Spark will iterate over the HadoopRDD record by record in no specific order (data will be split and sent to the workers) and "apply" the first transformation to compute RDD1. After that, the second transformation is applied to each element of RDD1 to get RDD2, again in no specific order, and so on for successive transformations. You can notice it from the map method signature:
// Return a new RDD by applying a function to all elements of this RDD.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Apache Spark follows a DAG (Directed Acyclic Graph) execution engine. It won't actually trigger any computation until a value is required, so you have to distinguish between transformations and actions.
EDIT:
In terms of performance, I am not completely aware of the underlying implementation of Spark, but I understand there shouldn't be a significant performance loss other than adding extra (unnecessary) tasks in the related stage. From my experience, you don't normally use transformations of the same "nature" successively (in this case two successive map's). You should be more concerned of performance when shuffling operations take place, because you are moving data around and this has a clear impact on your job performance. Here you can find a common issue regarding that.

Resources