Finding multiple values using map-reduce - apache-spark

Let's say i have an rdd with the following schema :
(ID,VALUE_1,VALUE_2)
What i would like to do is somehow using map_reduce end up with something like :
(ID,SUM(VALUE_1),SUM(VALUE_2),rdd_size) where sum(value_1,2) is the sum of the value_1 or _2 for the whole rdd and rdd_size is the number of rows in my rdd.
So far using reduce i can easily find one of those 3 but i can't seem to end with the desired output schema.Any ideas??

Please note this is in Scala but you could do similar in PySpark as well.
Following code creates the RDD the way you have shown
scala> val list = List((1,2,3),(1,3,4),(1,10,23),(2,3,5),(2,55,6))
list: List[(Int, Int, Int)] = List((1,2,3), (1,3,4), (1,10,23), (2,3,5), (2,55,6))
scala> val rdd = sc.parallelize(list)
rdd: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:26
Map this RDD to output (key,value) where key is the first element in tuple ( in your case ID) and the value is Tuple3 where first element is hardcoded to 1 and rest two elements are copied from original RDD ( VALUE_1 and VALUE_2 in your example). RDD collect and println are included below for understanding. It is not advisable when you run this with real data.
scala> val rdd1 = rdd.map(x => (x._1,(1,x._2,x._3)))
rdd1: org.apache.spark.rdd.RDD[(Int, (Int, Int, Int))] = MapPartitionsRDD[8] at map at <console>:25
scala> rdd1.collect.foreach(println)
(1,(1,2,3))
(1,(1,3,4))
(1,(1,10,23))
(2,(1,3,5))
(2,(1,55,6))
groupByKey is not required in all of this but just wanted to display how the grouped RDD would look like.
scala> rdd1.groupByKey().collect.foreach(println)
(1,CompactBuffer((1,2,3), (1,3,4), (1,10,23)))
(2,CompactBuffer((1,3,5), (1,55,6)))
Run reduceByKey to arrive the output you are expecting.
You can use above groupBy output to sum the VALUE_1 and VALUE_2 to confirm results of reduceByKey are correct.
scala> rdd1.reduceByKey((a,b) => (a._1+b._1,a._2+b._2,a._3+b._3)).collect.foreach(println)
(1,(3,15,30))
(2,(2,58,11))
In the above output
Key is the ID in your example.
Value is Tuple3 where first element is number of records in that group, second element is SUM(VALUE_1) and third element is SUM(VALUE_2).
You can rearrange if you want number of records or size in your example as the last element in Tuple.

Related

Use Spark groupByKey to dedup RDD which causes a lot of shuffle overhead

I have a key-value pair RDD. The RDD contains some elements with duplicate keys, and I want to split original RDD into two RDDs: One stores elements with unique keys, and another stores the rest elements. For example,
Input RDD (6 elements in total):
<k1,v1>, <k1,v2>, <k1,v3>, <k2,v4>, <k2,v5>, <k3,v6>
Result:
Unique keys RDD (store elements with unique keys; For the multiple elements with the same key, any element is accepted):
<k1,v1>, <k2, v4>, <k3,v6>
Duplicated keys RDD (store the rest elements with duplicated keys):
<k1,v2>, <k1,v3>, <k2,v5>
In the above example, unique RDD has 3 elements, and the duplicated RDD has 3 elements too.
I tried groupByKey() to group elements with the same key together. For each key, there is a sequence of elements. However, the performance of groupByKey() is not good because the data size of element value is very big which causes very large data size of shuffle write.
So I was wondering if there is any better solution. Or is there a way to reduce the amount of data being shuffled when using groupByKey()?
EDIT: given the new information in the edit, I would first create the unique rdd, and than the the duplicate rdd using the unique and the original one:
val inputRdd: RDD[(K,V)] = ...
val uniqueRdd: RDD[(K,V)] = inputRdd.reduceByKey((x,y) => x) //keep just a single value for each key
val duplicateRdd = inputRdd
.join(uniqueRdd)
.filter {case(k, (v1,v2)) => v1 != v2}
.map {case(k,(v1,v2)) => (k, v1)} //v2 came from unique rdd
there is some room for optimization also.
In the solution above there will be 2 shuffles (reduceByKey and join).
If we repartition the inputRdd by the key from the start, we won't need any additional shuffles
using this code should produce much better performance:
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=200) )
Original Solution:
you can try the following approach:
first count the number of occurrences of each pair, and then split into the 2 rdds
val inputRdd: RDD[(K,V)] = ...
val countRdd: RDD[((K,V), Int)] = inputRDD
.map((_, 1))
.reduceByKey(_ + _)
.cache
val uniqueRdd = countRdd.map(_._1)
val duplicateRdd = countRdd
.filter(_._2>1)
.flatMap { case(kv, count) =>
(1 to count-1).map(_ => kv)
}
Please use combineByKey resulting in use of combiner on the Map Task and hence reduce shuffling data.
The combiner logic depends on your business logic.
http://bytepadding.com/big-data/spark/groupby-vs-reducebykey/
There are multiple ways to reduce shuffle data.
1. Write less from Map task by use of combiner.
2. Send Aggregated serialized objects from Map to reduce.
3. Use combineInputFormts to enhance efficiency of combiners.

Multiple Split and Map in Spark

I have the below after a split with # of a file,
res64: Array[(String, String)] = Array((1,Animation|Children's|Comedy), (2,Adventure|Children's|Fantasy))
How to get unique pair (using distinct) like (1, Animation),(1,Children's), etc.. for every key(movie id here) like 1 in the RDD?
can be as simple as
rdd.mapValues(x => x.split('|'))\
.flatMapValues(x=>x)\
.distinct()\
.collect()

merge spark dStream with variable to saveToCassandra()

I have a DStream[String, Int] with pairs of word counts, e.g. ("hello" -> 10). I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed.
The cassandra table created as:
CREATE TABLE wordcounts (
step int,
word text,
count int,
primary key (step, word)
);
When trying to write the stream to the table...
stream.saveToCassandra("keyspace", "wordcounts", SomeColumns("word", "count"))
... I get java.lang.IllegalArgumentException: Some primary key columns are missing in RDD or have not been selected: step.
How can I prepend the step index to the stream in order to write the three columns together?
I'm using spark 2.0.0, scala 2.11.8, cassandra 3.4.0 and spark-cassandra-connector 2.0.0-M3.
As noted, while the Cassandra table expects something of the form (Int, String, Int), the wordCount DStream is of type DStream[(String, Int)], so for the call to saveToCassandra(...) to work, we need a DStream of type DStream[(Int, String, Int)].
The tricky part in this question is how to bring a local counter, that is by definition only known in the driver, up to the level of the DStream.
To do that, we need to do two things: "lift" the counter to a distributed level (in Spark, we mean "RDD" or "DataFrame") and join that value with the existing DStream data.
Departing from the classic Streaming word count example:
// Split each line into words
val words = lines.flatMap(_.split(" "))
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
We add a local var to hold the count of the microbatches:
#transient var batchCount = 0
It's declared transient, so that Spark doesn't try to close over its value when we declare transformations that use it.
Now the tricky bit: Within the context of a DStream transformation, we make an RDD out of that single variable and join it with underlying RDD of the DStream using cartesian product:
val batchWordCounts = wordCounts.transform{ rdd =>
batchCount = batchCount + 1
val localCount = sparkContext.parallelize(Seq(batchCount))
rdd.cartesian(localCount).map{case ((word, count), batch) => (batch, word, count)}
}
(Note that a simple map function would not work, as only the initial value of the variable would be captured and serialized. Therefore, it would look like the counter never increased when looking at the DStream data.
Finally, now that the data is in the right shape, save it to Cassandra:
batchWordCounts.saveToCassandra("keyspace", "wordcounts")
updateStateByKey function is provided by spark for global state handling.
For this case it could look something like following
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount: Int = runningCount.getOrElse(0) + 1
Some(newCount)
}
val step = stream.updateStateByKey(updateFunction _)
stream.join(step).map{case (key,(count, step)) => (step,key,count)})
.saveToCassandra("keyspace", "wordcounts")
Since you are trying to save the RDD to existing Cassandra table, you need to include all the primary key column values in the RDD.
What you can do is, you can use the below methods to save the RDD to new table.
saveAsCassandraTable or saveAsCassandraTableEx
For more info look into this.

Is there a way to provide a Java Comparator to a Spark ReduceByKey function?

I have JavaPairRDD<KeyClass, ValueClass> rdd where my KeyClass has several fields.
I would like to reduceByKey based on only a subset of fields in my KeyClass. I'm doing it by mapping the RDD:
JavaPairRDD<String, Tuple2<KeyClass, ValueClass>> readyForReduce = rdd.MapToPair(addKey());
I know I can pass in a partitioner but that just determines the partition for the record not how it is reduced.
Also I do not want to override the hash method of the KeyClass.
You have listed all the possible solutions in the can't-do list afaik. However, using keyBy will lead to a code that is closer to what you want to achieve. Note that you will still end up with a pair RDD.
val readyToReduce = rdd.keyBy{case (k, v) => pickKeysYouWant(k)}
Example:
scala> val a = sc.parallelize(List(((1, "adam"), "adams_info"), ((2, "bob"), "bobs_info")))
scala> a.collect.map(println)
scala> val readyToReduce = a.keyBy{case (key, value) => key._2}
scala> readyToReduce.collect.map(println)
(adam,((1,adam),adams_info))
(bob,((2,bob),bobs_info))

Vertically partition an RDD and write to separate locations

In spark 1.5+ how can I write each column of an "n"-tuple RDD to different locations?
For example if I had a RDD[(String, String)] I would like the first column to be written to s3://bucket/first-col and the second to s3://bucket/second-col
I could do the following
val pairRDD: RDD[(String, String)]
val cachedRDD = pairRDD.cache()
cachedRDD.map(_._1).saveAsTextFile("s3://bucket/first-col")
cachedRDD.map(_._2).saveAsTextFile("s3://bucket/second-col")
But is far from ideal since I need a two-pass over the RDD.
One way you could you can go about doing this is by converting the tuples into lists and then use map to create a list of RDDs and perform a save on each as follows:
val fileNames:List[String]
val input:RDD[(String, String...)] //could be a tuple of any size
val columnIDs = (1 to numCols)
val unzippedValues = input.map(_.productIterator.toList).persist() //converts tuple into list
val columnRDDs = columnIDs.map( a => unzippedValues.map(_(a)))
columnRDDs.zip(fileNames)foreach{case(b,fName) => b.saveAsTextFile(fName)}

Resources