Why filter does not preserve partitioning? - apache-spark

This is a quote from jaceklaskowski.gitbooks.io.
Some operations, e.g. map, flatMap, filter, don’t preserve partitioning.
map, flatMap, filter operations apply a function to every partition.
I don't understand why filter does not preserve partitioning. It's just getting a subset of each partition which satisfy a condition so I think partitions can be preserved. Why isn't it like that?

You are of course right. The quote is just incorrect. filter does preserve partitioning (for the reason you've already described), and it is trivial to confirm that
val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
new org.apache.spark.HashPartitioner(11)
)
rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner#b)
val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner#b)
rdd.partitioner == filteredRDD.partitioner
// Boolean = true
This stays in contrast to operations like map, which don't preserver partitioning (Partitioner):
rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None
Datasets are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.

Filter does preserve partitioning, at least this is suggested by the source-code of filter ( preservesPartitioning = true):
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}

Related

Spark reduce with comparison

I have an RDD of tuples with the form (key, count) however some keys are equivalent, i.e.
(a,3)
(b,4)
(c,5)
should reduce down to... as a and c are equivalent (for example)
(a,8)
(b,4)
is there a way to perform this operation in Spark?
I'm thinking some sort of conditional within the reduce() function?
i don't think there is a way to do this within the reduce operation, but you can achieve it using a pre-processing step. One option is to create a Map[K,K] that links your keys.
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap: Map[String,String] = Map[String,String]("a"->"a", "b"->"b", "c"->"a")
val out = in.map{case (k,v) => (keyMap.getOrElse(k,k),v)}.reduceByKey(_+_)
out.take(3).foreach(println)
Edit:
If the Map can't fit on the driver, you can also distribute the lookup:
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap = sc.parallelize(List(("a","a"),("b","b"),("c"->"a")))
val out = in.join(keyMap).map{case (oldKey, (v, newKey)) => (newKey, v)}.reduceByKey(_+_)
out.take(3).foreach(println)
reduceByKey() does the trick here as your data is already paired one.
val baseRDD = sc.parallelize(Seq(("a", 3), ("b", 4), ("a", 5)))
baseRDD.reduceByKey((accum, current) => accum + current).foreach(println)

How to check if all records for a given key are in the same partition already?

I'd like to avoid repartitioning data set by key as much as possible and know if all records for a given key are in the same partition already.
Is there a built-in function in Spark that would give me the answer?
Not built-in but if you assume specific partitioner it is easy enough to implement your own function:
import org.apache.spark.rdd.RDD
import org.apache.spark.Partitioner
import scala.reflect.ClassTag
def checkDistribution[K : ClassTag, V : ClassTag](
rdd: RDD[(K, V)], partitioner: Partitioner) =
// If partitioner is set we compare partitioners
rdd.partitioner.map(_ == partitioner).getOrElse {
// Otherwise check if correct number of partitions
rdd.partitions.size == partitioner.numPartitions &&
// And check if distribution matches partitioner
rdd.keys.mapPartitionsWithIndex((i, iter) =>
Iterator(iter.forall(x => partitioner.getPartition(x) == i))
).fold(true)(_ && _)
}
A few tests:
import org.apache.spark.HashPartitioner
val rdd = sc.range(0, 20, 5).map((_, None))
Not partitioned, invalid distribution:
checkDistribution(rdd, new HashPartitioner(10))
Boolean = false
Partitioned, invalid partitioner:
checkDistribution(
rdd.partitionBy(new HashPartitioner(5)),
new HashPartitioner(10)
)
Boolean = false
Partitioned, valid partitioner:
checkDistribution(
rdd.partitionBy(new HashPartitioner(10)),
new HashPartitioner(10)
)
Boolean = true
Not partitioned, valid distribution:
checkDistribution(
rdd.partitionBy(new HashPartitioner(10)).map(identity),
new HashPartitioner(10)
)
Boolean = true
Without assuming particular partitioner the only option that comes to mind requires shuffle, so it it unlikely to be an improvement.
def checkDistribution[K : ClassTag, V : ClassTag](rdd: RDD[(K, V)]) =
rdd.keys.mapPartitionsWithIndex((i, iter) => iter.map((_, i)))
.combineByKey(
x => Seq(x),
(x: Seq[Int], y: Int) => x,
(x: Seq[Int], y: Seq[Int]) => x ++ y) // Should be more or less OK
.values
.mapPartitions(iter => Iterator(iter.forall(_.size == 1)))
.fold(true)(_ && _)
One possible improvement is that you can use the same logic to automatically define Partitioner for the data. If you collectAsMap before values and check that all Seqs are of size 1 you have a valid partitioner which guarantees no network traffic.
Not 100% what you requested but you can check this by using spark_partition_id. Basically do:
withColumn("pid", spark_partition_id())
and then do:
df.groupby(what you want to check).agg(max($"pid").as("pidmax"),min($"pid").as("pidmin")).filter($"pidmax"===$"pidmin").count()
The count would give you how many elements are not partitioned.
Note that this is relatively low cost being a simple aggregation.
I don't believe there is a generic way because if we read from a generic source (e.g. file), we don't necessarily know how the source was originally partitioned.
It would be nice if there was something like "get current partitioner" which would get explicit partitioners (e.g. if we had an explicit repartition command or reading something from parquet which was written using PartitionBy) as an approximation though.

Reduce Spark RDD to return multiple values

I have the following RDD containing sets of items which I would like to group by item similarity (Items in the same set are considered similar. Similarity is transitive and all the items in sets which have atleast one common item are also considered similar)
Input RDD:
Set(w1, w2)
Set(w1, w2, w3, w4)
Set(w5, w2, w6)
Set(w7, w8, w9)
Set(w10, w5, w8) --> All the first 5 set elements are similar as each of the sets have atleast one common item
Set(w11, w12, w13)
I would like the above RDD to be reduced to
Set(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
Set(w11, w12, w13)
Any suggestions on how I could do this ? I am unable to do something like below where I could ignore reducing two sets if they don't contain any common elements:
data.reduce((a,b) => if (a.intersect(b).size > 0) a ++ b ***else (a,b)***)
Thanks.
Your reduce algorithm are actually incorrect. For example, what if one set cannot be merge with the next set, but can still be merged with a different set in the collect.
There're probably better ways, but I think up a solution to this by converting it to a graph problem and use Graphx.
val data = Array(Set("w1", "w2", "w3"), Set("w5", "w6"), Set("w7"), Set("w2", "w3", "w4"))
val setRdd = sc.parallelize(data).cache
// Generate an unique id for each item to use as vertex's id in the graph
val itemToId = setRdd.flatMap(_.toSeq).distinct.zipWithUniqueId.cache
val idToItem = itemToId.map { case (item, itemId) => (itemId, item) }
// Convert to a RDD of set of itemId
val newSetRdd = setRdd.zipWithUniqueId
.flatMap { case (sets, setId) =>
sets.map { item => (item, setId) }
}.join(itemToId).values.groupByKey().values
// Create an RDD containing edges of the graph
val edgeRdd = newSetRdd.flatMap { set =>
val seq = set.toSeq
val head = seq.head
// Add an edge from the first item to each item in a set,
// including itself
seq.map { item => Edge[Long](head, item)}
}
val graph = Graph.fromEdges(edgeRdd, Nil)
// Run connected component algorithm to check which items are similar.
// Items in the same component are similar
val verticesRDD = graph.connectedComponents().vertices
verticesRDD.join(idToItem).values.groupByKey.values.collect.foreach(println)

Collect RDD to one node in sorted order

I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}

How do I select a range of elements in Spark RDD?

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?
I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.
I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.
val 60to80 = pairs.filter {
_ match {
case (k,v) => k >= 60 && k <= 80
case _ => false //incase of invalid input
}
}
I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.
From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:
// An array of upper bounds for the first (partitions - 1) partitions
private val rangeBounds: Array[K] = {
This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911
UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.
val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache()
val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]];
val (lower, upper) = (10, 20)
val range = p.getPartition(lower) to p.getPartition(upper)
println(range)
val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {
if (range.contains(i))
for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)
else
Iterator.empty
}
for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")
If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache();
glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()
search is not a member BTW I just made an implicit class that has a binary search function, not shown here
How big is your data set? You might be able to do what you need with:
data.take(80).drop(59)
This seems inefficient, but for small to medium-sized data, should work.
Is it possible to solve this in another way? What's the case for picking exactly a certain range out of the middle of your data? Would takeSample serve you better?
Following should be able to get the range. Note the cache will save you some overhead, because internally zipWithIndex need to scan the RDD partition to get the number of elements in each partition.
scala>val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3).cache
scala>val r2 = r1.zipWithIndex
scala>val r3 = r2.filter(x=> {x._2>2 && x._2 < 4}).map(x=>x._1)
scala>r3.foreach(println)
d
For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange

Resources