SPARK - How to force error on sc.parallelize - apache-spark

Question:
This statement always gives the right result, no matter how much paralleziation provided. Why does it always give the correct result?
Reading a big file or mapPartitions approach will result in minor loss of accuracy, why not here? It must be simple, but I cannot see it.
val rdd = sc.parallelize(Array("A", "B", "C", "D", "E", "F"),5)
rdd.sliding(2).collect()

Reading a big file or mapPartitions approach will result in minor loss of accuracy,
It won't. Result is exact independent of the source.

From Hortonworks:
sliding() keeps track of the partition index, which in this case corresponds to the ordering of the unigrams.
Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect()
to help with the intuition.

Related

Scala - Executing every element until they all have finished

I cannot figure out why my function invokeAll does not give out the correct output/work properly. Any solutions? (No futures or parallel collections allowed and the return type needs to be Seq[Int])
def invokeAll(work: Seq[() => Int]): Seq[Int] = {
//this is what we should return as an output "return res.toSeq"
//res cannot be changed!
val res = new Array[Int](work.length)
var list = mutable.Set[Int]()
var n = res.size
val procedure = (0 until n).map(work =>
new Runnable {
def run {
//add the finished element/Int to list
list += work
}
}
)
val threads = procedure.map(new Thread(_))
threads.foreach(x => x.start())
threads.foreach (x => (x.join()))
res ++ list
//this should be the final output ("return res.toSeq")
return res.toSeq
}
OMG, I know a java programmer, when I see one :)
Don't do this, it's not java!
val results: Future[Seq[Int]] = Future.traverse(work)
This is how you do it in scala.
This gives you a Future with the results of all executions, that will be satisfied when all work is finished. You can use .map, .flatMap etc. to access and transform those results. For example
val sumOfAll: Future[Int] = results.map(_.sum)
Or (in the worst case, when you want to just give the result back to imperative code), you could block and wait on the future to get ahold of the actual result (don't do this unless you are absolutely desperate): Await.result(results, 1 year)
If you want the results as array, results.map(_.toArray) will do that ... but you really should not: arrays aren't really a good choice for the vast majority of use cases in scala. Just stick with Seq.
The main problem in your code is that you are using fixed size array and trying to add some elements using ++ (concatenate) operator: res ++ list. It produces new Seq but you don't store it in some val.
You could remove last line return res.toSeq and see that res ++ lest will be return value. It will be your work.length array of zeros res with some list sequence at the end. Try read more about scala collections most of them immutable and there is a good practice to use immutable data structures. In scala Arrays doesn't accumulate values using ++ operator in left operand. Array's in scala are fixed size.

takeRightWhile() method in scala

I might be missing something but recently I came across a task to get last symbols according to some condition. For example I have a string: "this_is_separated_values_5". Now I want to extract 5 as Int.
Note: number of parts separated by _ is not defined.
If I would have a method takeRightWhile(f: Char => Boolean) on a string it would be trivial: takeRightWhile(ch => ch != '_'). Moreover it would be efficient: a straightforward implementation would actually involve finding the last index of _ and taking a substring while the use of this method would save first step and provide better average time complexity.
UPDATE: Guys, all the variations of str.reverse.takeWhile(_!='_').reverse are quite inefficient as you actually use additional O(n) space. If you want to implement method takeRightWhile efficiently you could iterate starting from the right, accumulating result in string builder of whatever else, and returning the result. I am asking about this kind of method, not implementation which was already described and declined in the question itself.
Question: Does this kind of method exist in scala standard library? If no, is there method combination from the standard library to achieve the same in minimum amount of lines?
Thanks in advance.
Possible solution:
str.reverse.takeWhile(_!='_').reverse
Update
You can go from right to left with following expression using foldRight:
str.toList.foldRight(List.empty[Char]) {
case (item, acc) => item::acc
}
Here you need to check condition and stop adding items after condition met. For this you can pass a flag to accumulated value:
val (_, list) = str.toList.foldRight((false, List.empty[Char])) {
case (item, (false, list)) if item!='_' => (false, item::list)
case (_, (_, list)) => (true, list)
}
val res = list.mkString.toInt
This solution is even more inefficient then solution with double reverse:
Implementation of foldRight uses combination of List reverse and foldLeft
You cannot break foldRight execution, so you need flag to skip all items after condition met
I'd go with this:
val s = "string_with_following_number_42"
s.split("_").reverse.head
// res:String = 42
This is a naive attempt and by no means optimized. What it does is splitting the String into an Array of Strings, reverses it and takes the first element. Note that, because the reversing happens after the splitting, the order of the characters is correct.
I am not exactly sure about the problem you are facing. My understanding is that you want have a string of format xxx_xxx_xx_...._xxx_123 and you want to extract the part at the end as Int.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val yourInt = yourStr.split('_').last.toInt
// But remember that the above is unsafe so you may want to take it as Option
val yourIntOpt = Try(yourStr.split('_').last.toInt).toOption
Or... lets say your requirement is to collect a right-suffix till some boolean condition remains true.
import scala.util.Try
val yourStr = "xxx_xxx_xxx_xx...x_xxxxx_123"
val rightSuffix = yourStr.reverse.takeWhile(c => c != '_').reverse
val yourInt = rightSuffix.toInt
// but above is unsafe so
val yourIntOpt = Try(righSuffix.toInt).toOption
Comment if your requirement is different from this.
You can use StringBuilder and lastIndexWhere.
val str = "this_is_separated_values_5"
val sb = new StringBuilder(str)
val lastIdx = sb.lastIndexWhere(ch => ch != '_')
val lastCh = str.charAt(lastIdx)

How do I split an RDD into two or more RDDs?

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)

Apply ZipWithIndex on a RDD of key-value pairs

I have a RDD of key-value pairs like RDD[Int,String] and I want to apply zipWithIndex for each key. Is there a way to do this?
For example if I have a RDD of kvPairs like
(0,10),(0,20),(0,30),(0,40),(1,11),(1,21),(1,31),(1,41) ...
I want the output to be like
((0,10),1),((0,20),2),((0,30),3),((0,40),4),
((1,11),1),((1,21),2),((1,31),3),((1,41),4) ...
The RDD can have any key-value combination. First I want to apply sortByKey and then zipWithIndex as above.
Thanks !!
Something like this?
rdd
.groupByKey
.flatMap{case (k, vs) =>
vs
.toList
.sortBy(_.toInt) // Assuming this is expected order
.zipWithIndex
.map{case (v, i) => ((k, v), i + 1)}
}

Reading and learning Spark API?

I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:
val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
When I read the reduceByKey API, I see:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
In the programming guide: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
Ok, through the example I know (x, y) is (V, V), and that should be the value part of the map. I give a function to compute the V and I get RDD[(K, V)]. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V), why 2 V? The 1st and 2nd V in (V, V) is same or not?
I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!
in the code below:
reduceByKey((x, y) => x + y)
you could read for more clarity, something like this:
reduceByKey((sum, addend) => sum + addend)
so, for every key, you iterate that function fore every element with that key.
Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.
Usually the data sets will be of the form ("key1",val11),("key2",val21),("key1",val12),("key2",val22)...so on
There will be the same key with multiple values in the RDD[(K,V)]
When you use the reduceByKey . For each values in the key the function will be applied.
For example consider the following program
val data = Array(("key1",2),("key1",20),("key2",21),("key1",2),("key2",10),("key2",33))
val rdd = sc.parallelize(data)
val res = rdd.reduceByKey((x,y) => x+y)
res.foreach(println)
You will get the output as
(key2,64)
(key1,24)
Here the Sequence of values are passed to the function . For key1 -> (2,20,2)
In the end , You will have a single value for each key.
You could use spark shell to try out the APIs.

Resources