How to guarantee dataset is split over unique partitions - apache-spark

I have a data type like this : case class Data(col: String, ...), and a Dataset[Data] ds. Some rows have columns filled with value 'a', and other with value 'b', etc.
I want to process separately all data with a 'a', and all data with a 'b'.
But I also need to have all the 'a' in the same partition.
Question 1 :
If I do :
ds.repartition(col("col")).mapPartition(data => ???)
Is it guaranteed by default that I will have all the 'a' in a single partition, and no 'b' mixed with it in this partition?
I can also do this to force the number of partitions :
val nbDistinct = ds.select("col").distinct.count
ds.repartition(nbDistinct , col("col")).mapPartition(data => ???)
But it adds an action that may be expensive in some cases.
Question 2 : Is there a good way to have this guarantees?
Thanks!

All the 'a' will be in the same partition, but 'a' and 'b' may be mixed.
Even using nbDistinct, it is no enough to guarantee dataset is split over unique partitions, so the code should rather be :
val nbDistinct = ds.select("col").distinct.count
ds.repartition(col("col")).mapPartition{ data =>
// split mixed values in a single partition with group by :
data.groupBy(_.col).flatMap { case (col, rows) => ??? }
}
Other option would be to use be groupBy / groupByKey

Related

How can you store the results from a forEach in Spark

DataSet#foreach(f) applies the function f to each row in the dataset. In a clustered environment, the data is split across the cluster. How can the results from each of these functions be collected?
For example, say the function would count the number of characters stored in each row. How can you create a DataSet or RDD that contains the results of each of these functions applied to each row?
The definition for foreach looks something like :
final def foreach(f: (A) ⇒ Unit): Unit
f : The function that is applied for its side-effect to every element.
The result of function f is discarded
foreach in Scala is generally used to denote the usage of a function that involves a side-effect, e.g. printing to STDOUT.
If you want to return something by applying a particular function, you'll have to use map
final def map[B](f: (A) ⇒ B): List[B]
I copied the syntax from the documentation for List but it'll be something similar for RDDs as well.
As you can see, it works the function f on datatype A and returns a collection of datatype B where A and B can be the same data type as well.
val rdd = sc.parallelize(Array(
"String1",
"String2",
"String3" ))
scala> rdd.foreach(x => (x, x.length) )
// Nothing happens
rdd.map(x => (x, x.length) ).collect
// Array[(String, Int)] = Array((String1,7), (String2,7), (String3,7))

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)

Spark RDD operation like top returning a smaller RDD

I am looking for a Spark RDD operation like top or takeOrdered, but that returns another RDD, not an Array, that is, does not collect the full result to RAM.
It can be a sequence of operations, but ideally, in no step trying to collect the full result into the memory of a single node.
Let's say you want to have the top 50% of an RDD.
def top50(rdd: RDD[(Double, String)]) = {
val sorted = rdd.sortByKey(ascending = false)
val partitions = sorted.partitions.size
// Throw away the contents of the lower partitions.
sorted.mapPartitionsWithIndex { (pid, it) =>
if (pid <= partitions / 2) it else Nil
}
}
This is an approximation — you may get more or less than 50%. You could do better but it would cost an extra evaluation of the RDD. For the use cases I have in mind this would not be worth it.
Take a look at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val rdd: RDD[(String, Int)] // the first string is the key, the rest is the value
val topByKey:RDD[(String, Array[Int])] = rdd.topByKey(n)
Or use aggregate with BoundedPriorityQueue.

spark scala most efficient way to do partial string count

I have a question about the most efficient way to do a partial string match in a spark RDD (or scala Array) of 10 million length. Consider the following:
val set1 = Array("star wars", "ipad") //These are the String I am looking for
val set2 = RDD[("user1", "star wars 7 is coming out"),
("user1", "where to watch star wars"),
("user2", "star wars"),
("user2", "cheap ipad")]
I want to be able to count the number of occurrences of each string that belongs in Set1 that also occurs in Set2. So the result should be something like:
Result = ("star wars", 3),("ipad", 1)
I also want to count the number of users (i.e. distinct users) who have searched for the term, so the result should be:
Result = ("star wars", 2), ("ipad", 1)
I had a try at 2 methods, the first involves converting the RDD string to set, flatMapValues and then doing a join operation, but it is memory consuming. The other method I was considering is a regex approach, as only the count is needed and the exact string is given, but I don't know how to make it efficient (by making a function and calling it when I map the RDD?)
I seem to be able to do this quite easily in pgsql using LIKE, but not sure if there is a RDD join that works the same way.
Any help would be greatly appreciated.
So as advised by Yijie Shen you could use regular expressions:
val regex = set1.mkString("(", "|", ")").r
val results = rdd.flatMap {
case (user, str) => regex.findAllIn(str).map(user -> _)
}
val count = results.map(_._2).countByValue()
val byUser = results.distinct().map(_._2).countByValue()

Are multiple reduceByKey on the same RDD compiled into a single scan?

Suppose I have an RDD (50M records/dayredu) which I want to summarize in several different ways.
The RDD records are 4-tuples: (keep, foo, bar, baz).
keep - boolean
foo, bar, baz - 0/1 int
I want to count how many of each of the foo &c are kept and dropped, i.e., I have to do the following for foo (and the same for bar and baz):
rdd.filter(lambda keep, foo, bar, baz: foo == 1)
.map(lambda keep, foo, bar, baz: keep, 1)
.reduceByKey(operator.add)
which would return (after collect) a list like [(True,40000000),(False,10000000)].
The question is: is there an easy way to avoid scanning rdd 3 times (once for each of foo, bar, baz)?
What I mean is not a way to rewrite the above code to handle all 3 fields, but telling spark to process all 3 pipelines in a single pass.
It's possible to execute the three pipelines in parallel by submitting the job with different threads, but this will pass through the RDD three times and require up to 3x more resources on the cluster.
It's possible to get the job done in one pass by rewriting the job to handle all counts at once - the answer regarding aggregate is an option. Splitting the data in pairs (keep, foo) (keep, bar), (keep, baz) would be another.
It's not possible to get the job done in one pass without any code changes, as there would not be a way for Spark to know that those jobs relate to the same dataset. At most, the speed of subsequent jobs after the first one could be improved by caching the initial rdd with rdd.cache before the .filter().map().reduce() steps; this will still pass through the RDD 3 times, but the 2nd and 3rd time will be potentially a lot faster if all data fits in the memory of the cluster:
rdd.cache
// first reduceByKey action will trigger the cache and rdd data will be kept in memory
val foo = rdd.filter(fooFilter).map(fooMap).reduceByKey(???)
// subsequent operations will execute faster as the rdd is now available in mem
val bar = rdd.filter(barFilter).map(barMap).reduceByKey(???)
val baz = rdd.filter(bazFilter).map(bazMap).reduceByKey(???)
If I were doing this, I would create pairs of the relevant data and count them in a single pass:
// We split the initial tuple into pairs keyed by the data type ("foo", "bar", "baz") and the keep information. dataPairs will contain data like: (("bar",true),1), (("foo",false),1)
val dataPairs = rdd.flatmap{case (keep, foo, bar, baz) =>
def condPair(name:String, x:Int):Option[((String,Boolean), Int)] = if (x==1) Some(((name,keep),x)) else None
Seq(condPair("foo",foo), condPair("bar",bar), condPair("baz",baz)).flatten
}
val totals = dataPairs.reduceByKey(_ + _)
This is easy and will pass over the data only once, but requires rewriting of the code. I'd say it scores 66,66% in answering the question.
If I'm reading your question correctly, you want RDD.aggregate.
val zeroValue = (0L, 0L, 0L, 0L, 0L, 0L) // tfoo, tbar, tbaz, ffoo, fbar, fbaz
rdd.aggregate(zeroValue)(
(prior, current) => if (current._1) {
(prior._1 + current._2, prior._2 + current._3, prior._3 + current._4,
prior._4, prior._5, prior._6)
} else {
(prior._1, prior._2, prior._3,
prior._4 + current._2, prior._5 + current._3, prior._6 + current._4)
},
(left, right) =>
(left._1 + right._1,
left._2 + right._2,
left._3 + right._3,
left._4 + right._4,
left._5 + right._5,
left._6 + right._6)
)
Aggregate is conceptually like the conceptual reduce function on a list, but RDDs aren't lists, they're distributed, so you provide two function arguments, one to operate on each partition, and one to combine the results of processing the partitions.

Resources