Union with an existing RDD which is a set in pyspark - apache-spark

Given a set U, which is stored in RDD named rdd.
What is the recommended way to merge any given RDD rdd_not_set and rdd such that the resultant rdd is also an set.
rdd = sc.union([rdd, U])
rdd = rdd.reduceBykey(reduce_func)
Ex: rdd = sc.parallelize([(1,2), (2,3)]) and rdd_not_set = sc.parallelize([(1,4), (3,4)]) and resultant final_rdd = sc.parallelize([(1,4), (2,3), (3,4)])
Naive solution is to perform union and then reduceByKey which would be very inefficient as rdd will be huge in size.

Related

Spark:How to compare two rdd by key

I want to compare two rdd by their common keys. So I filter the rdd by key firstly, then compare the sub rdds.
For examples,
def compare(rdd1,rdd2):
do_something()
rdd = sc.textFile(path1) # each Rdd is dict type
rdd2 = sc.textFile(path2)
pair_rdd = rdd.flatMap(lambda x: x.keys()).zip(rdd.flatMap(lambda x:x.values()))
pair_rdd2 = rdd2.flatMap(lambda x: x.keys()).zip(rdd2.flatMap(lambda x:x.values()))
for feat in set(pair_rdd.keys().distinct().collect()) & \
set(pair_rdd2.keys().distinct().collect()):
pair_rdd_filter = pair_rdd.filter(lambda x: x[0] == feat).map(lambda x:x[1])
pair_rdd_filter2 = pair_rdd.filter(lambda x: x[0] == feat).map(lambda x:x[1])
compare(pair_rdd_filter, pair_rdd_filter2)
For convenience, I give an example of the rdd.
rdd = sc.parallelize([{'f':[1,2,3]},{'f':[1,20],'a':[1]}])
rdd2 = sc.parallelize([{'f':[3,4],'a':[23]},{'f':[2,100,10,2],'a':[3,10,3],'b':[3]}])
But, I find that if using collect() to get common keys, the rdd will start to reduce, which costs much time.
How to make this code run effectively.
The issue here is calling .collect() will move all the data to driver which then does the set intersection. To utilise distributed execution, use join instead:
pair_rdd.join(pair_rdd2)
This will output a RDD with common keys and values as tuple (pair_rdd element, pair_rdd2 element)
It can also be used to e.g. get common keys:
pair_rdd.join(pair_rdd2).keys().distinct()

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

Apache Spark DAG behaviour cogrouped operation

I would like some clarifications about the DAG behaviour, and how exactly has been handle the following job:
val rdd = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,3)
.partitionBy(new HashPartitioner(4))
val rdd1 = sc.parallelize(List(1 to 10).flatMap(x=>x).zipWithIndex,2)
.partitionBy(new HashPartitioner(3))
val rdd2 = rdd.join(rdd1)
rdd2.collect()
This is the related rdd2.toDebugString:
(4) MapPartitionsRDD[6] at join at IntegrationStatusJob.scala:92 []
| MapPartitionsRDD[5] at join at IntegrationStatusJob.scala:92 []
| CoGroupedRDD[4] at join at IntegrationStatusJob.scala:92 []
| ShuffledRDD[1] at partitionBy at IntegrationStatusJob.scala:90 []
+-(3) ParallelCollectionRDD[0] at parallelize at IntegrationStatusJob.scala:90 []
+-(3) ShuffledRDD[3] at partitionBy at IntegrationStatusJob.scala:91 []
+-(2) ParallelCollectionRDD[2] at parallelize at IntegrationStatusJob.scala:91 []
This is the spark UI image:
Looking at the toDebugString and at the spark UI, if I understood well, in order to perform the join, the DAG looks at what partitioner should be used and because both rdds are HashPartitioned,it choose the partitioner with the greater number of partitions, so rdd partitioner.
Now from the spark UI, it seems that rdd partitionBy and join being performed in the same stage, so under this conditions, the shuffle needed for to perform the join, will be done just from one side? From one side, I mean that just the rdd1 will be shuffled and no both.
Is my assumption correct?
You right. If both RDDs are partitioned using different partitioner Spark will pick one as a reference and reparation / shuffle only the second one.
If both have the same partitioner there is no need for a shuffle.

Vertically partition an RDD and write to separate locations

In spark 1.5+ how can I write each column of an "n"-tuple RDD to different locations?
For example if I had a RDD[(String, String)] I would like the first column to be written to s3://bucket/first-col and the second to s3://bucket/second-col
I could do the following
val pairRDD: RDD[(String, String)]
val cachedRDD = pairRDD.cache()
cachedRDD.map(_._1).saveAsTextFile("s3://bucket/first-col")
cachedRDD.map(_._2).saveAsTextFile("s3://bucket/second-col")
But is far from ideal since I need a two-pass over the RDD.
One way you could you can go about doing this is by converting the tuples into lists and then use map to create a list of RDDs and perform a save on each as follows:
val fileNames:List[String]
val input:RDD[(String, String...)] //could be a tuple of any size
val columnIDs = (1 to numCols)
val unzippedValues = input.map(_.productIterator.toList).persist() //converts tuple into list
val columnRDDs = columnIDs.map( a => unzippedValues.map(_(a)))
columnRDDs.zip(fileNames)foreach{case(b,fName) => b.saveAsTextFile(fName)}

adding new elements to batch RDD from DStream RDD

The only way to join / union /cogroup a DStream RDD with Batch RDD is via the "transform" method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch.
Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing the elements of both the DStream RDD and the Batch RDD.
And once such Batch RDD is created in the above way, can it be used by other DStream RDDs to e.g. join with as this time the result can be another DStream RDD
Effectively the functionality described above will result in periodical updates (additions) of elements to a Batch RDD - the additional elements will keep coming from DStream RDDs which keep streaming in with every micro-batch.
Also newly arriving DStream RDDs will be able to join with the thus previously updated BAtch RDD and produce a result DStream RDD
Something almost like that can be achieved with updateStateByKey, but is there a way to do it as described here
Another approach would be to transform the batch input to a DStream and union it with your streaming input. Then you write it out using foreachRDD which is new your batch input to other jobs.
val batch = sc.textFile(...)
val ssc = new StreamingContext(sc, Seconds(30))
val stream = ssc.textFileStream(...)
import scala.collection.mutable
val batchStream = ssc.queueStream(mutable.Queue.empty[RDD[String]], oneAtATime = false, defaultRDD = batch)
val union = ssc.union(Seq(stream, batchStream))
union.print()
union.foreachRDD { rdd =>
// Delete previous, or use SchemaRDD with .insertInto(, overwrite = true)
rdd.saveTextFile(...)
}
ssc.start()
ssc.awaitTermination()

Resources