I have an RDD of tuples with the form (key, count) however some keys are equivalent, i.e.
(a,3)
(b,4)
(c,5)
should reduce down to... as a and c are equivalent (for example)
(a,8)
(b,4)
is there a way to perform this operation in Spark?
I'm thinking some sort of conditional within the reduce() function?
i don't think there is a way to do this within the reduce operation, but you can achieve it using a pre-processing step. One option is to create a Map[K,K] that links your keys.
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap: Map[String,String] = Map[String,String]("a"->"a", "b"->"b", "c"->"a")
val out = in.map{case (k,v) => (keyMap.getOrElse(k,k),v)}.reduceByKey(_+_)
out.take(3).foreach(println)
Edit:
If the Map can't fit on the driver, you can also distribute the lookup:
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap = sc.parallelize(List(("a","a"),("b","b"),("c"->"a")))
val out = in.join(keyMap).map{case (oldKey, (v, newKey)) => (newKey, v)}.reduceByKey(_+_)
out.take(3).foreach(println)
reduceByKey() does the trick here as your data is already paired one.
val baseRDD = sc.parallelize(Seq(("a", 3), ("b", 4), ("a", 5)))
baseRDD.reduceByKey((accum, current) => accum + current).foreach(println)
Related
i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.
This is a quote from jaceklaskowski.gitbooks.io.
Some operations, e.g. map, flatMap, filter, don’t preserve partitioning.
map, flatMap, filter operations apply a function to every partition.
I don't understand why filter does not preserve partitioning. It's just getting a subset of each partition which satisfy a condition so I think partitions can be preserved. Why isn't it like that?
You are of course right. The quote is just incorrect. filter does preserve partitioning (for the reason you've already described), and it is trivial to confirm that
val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
new org.apache.spark.HashPartitioner(11)
)
rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner#b)
val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner#b)
rdd.partitioner == filteredRDD.partitioner
// Boolean = true
This stays in contrast to operations like map, which don't preserver partitioning (Partitioner):
rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None
Datasets are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.
Filter does preserve partitioning, at least this is suggested by the source-code of filter ( preservesPartitioning = true):
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}
I have two RDDs, one (a, b, a, c, b, c, a) and the other - a paired RDD ((a, 0), (b, 1), (c, 2)).
I want to replace the as, bs and cs in first RDD with 0,1,2 (which are the values of keys a,b,c respectively in second RDD), respectively. I'd like to preserve the order of the events in first RDD.
How to achieve it in Spark?
For example like this:
val rdd1 = sc.parallelize(Seq("a", "b", "a", "c", "b", "c", "a"))
val rdd2 = sc.parallelize(Seq(("a", 0), ("b", 1), ("c", 2)))
rdd1
.map((_, 1)) // Map first to PairwiseRDD with dummy values
.join(rdd2)
.map { case (_, (_, x)) => x } // Drop keys and dummy values
If mapping RDD is small it can be faster to broadcast and map:
val bd = sc.broadcast(rdd2.collectAsMap)
// This assumes all values are present. If not use get / getOrElse
// or map withDefault
rdd1.map(bd.value)
It will also preserve an order of elements.
In case of join you can add increasing identifiers (zipWithIndex / zipWithUniqueId) to be able to restore an initial ordering but it is substantially more expensive.
You can do this by using join.
First to simulate your RDDs:
val rdd = sc.parallelize(List("a","b","a","c","b","c","a"))
val mapping = sc.parallelize(List(("a",0),("b",1),("c",2)))
You can only join pairRDDs, so map the original rdd to a pairRDDand then join with mapping
rdd.map(s => (s, None)).join(mapping).map{case(_, (_, intValue)) => intValue}
Is there a way(A method) in Spark to find out the Parition ID/No
Take this example here
val input1 = sc.parallelize(List(8, 9, 10), 3)
val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)
x + y)}
I would like to put some code in ??? to print the Partition ID / No
You can also use
TaskContext.getPartitionId()
e.g., in lieu of the presently missing foreachPartitionWithIndex()
https://github.com/apache/spark/pull/5927#issuecomment-99697229
Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).
Posting the answer here using mapParitionsWithIndex based on suggestion by #Holden.
I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex
scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21
scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)
I ran across this old question while looking for the spark_partition_id sql function for DataFrame.
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect
res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])
I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?
I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.
I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.
val 60to80 = pairs.filter {
_ match {
case (k,v) => k >= 60 && k <= 80
case _ => false //incase of invalid input
}
}
I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.
From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:
// An array of upper bounds for the first (partitions - 1) partitions
private val rangeBounds: Array[K] = {
This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911
UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.
val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache()
val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]];
val (lower, upper) = (10, 20)
val range = p.getPartition(lower) to p.getPartition(upper)
println(range)
val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {
if (range.contains(i))
for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)
else
Iterator.empty
}
for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")
If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache();
glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()
search is not a member BTW I just made an implicit class that has a binary search function, not shown here
How big is your data set? You might be able to do what you need with:
data.take(80).drop(59)
This seems inefficient, but for small to medium-sized data, should work.
Is it possible to solve this in another way? What's the case for picking exactly a certain range out of the middle of your data? Would takeSample serve you better?
Following should be able to get the range. Note the cache will save you some overhead, because internally zipWithIndex need to scan the RDD partition to get the number of elements in each partition.
scala>val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3).cache
scala>val r2 = r1.zipWithIndex
scala>val r3 = r2.filter(x=> {x._2>2 && x._2 < 4}).map(x=>x._1)
scala>r3.foreach(println)
d
For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange