I am trying to understand mapPartitionsWithIndex in Spark. I found that the following two examples produce vastly different output:
parallel = sc.parallelize(range(1,10),2)
def show(index, iterator): yield 'index: '+str(index)+" values: "+
str(list(iterator))
parallel.mapPartitionsWithIndex(show).collect()
parallel = sc.parallelize(range(1,10),2)
def show(index, iterator): return 'index: '+str(index)+" values: "+
str(list(iterator))
parallel.mapPartitionsWithIndex(show).collect()
As highlighted, the difference lies in whether the show function returns a generator or an iterator.
I guess I do not understand how mapPartitionsWithIndex combines the results from the individual partitions.
Can you please explain to me how this behavior occurs?
mapPartitionsWithIndex(self, f, preservesPartitioning=False)
The parameter: f must return an iterable object.
In general it should be raise an error if there is no iterable object returned.
But in your case2, return String is turned into return list of letter by mistake through iterator = iter(iterator) in source code(pyspark/serializers.py, line 266).
Just return ["I'm String"] if you insist on using return.
Related
I cannot figure out why my function invokeAll does not give out the correct output/work properly. Any solutions? (No futures or parallel collections allowed and the return type needs to be Seq[Int])
def invokeAll(work: Seq[() => Int]): Seq[Int] = {
//this is what we should return as an output "return res.toSeq"
//res cannot be changed!
val res = new Array[Int](work.length)
var list = mutable.Set[Int]()
var n = res.size
val procedure = (0 until n).map(work =>
new Runnable {
def run {
//add the finished element/Int to list
list += work
}
}
)
val threads = procedure.map(new Thread(_))
threads.foreach(x => x.start())
threads.foreach (x => (x.join()))
res ++ list
//this should be the final output ("return res.toSeq")
return res.toSeq
}
OMG, I know a java programmer, when I see one :)
Don't do this, it's not java!
val results: Future[Seq[Int]] = Future.traverse(work)
This is how you do it in scala.
This gives you a Future with the results of all executions, that will be satisfied when all work is finished. You can use .map, .flatMap etc. to access and transform those results. For example
val sumOfAll: Future[Int] = results.map(_.sum)
Or (in the worst case, when you want to just give the result back to imperative code), you could block and wait on the future to get ahold of the actual result (don't do this unless you are absolutely desperate): Await.result(results, 1 year)
If you want the results as array, results.map(_.toArray) will do that ... but you really should not: arrays aren't really a good choice for the vast majority of use cases in scala. Just stick with Seq.
The main problem in your code is that you are using fixed size array and trying to add some elements using ++ (concatenate) operator: res ++ list. It produces new Seq but you don't store it in some val.
You could remove last line return res.toSeq and see that res ++ lest will be return value. It will be your work.length array of zeros res with some list sequence at the end. Try read more about scala collections most of them immutable and there is a good practice to use immutable data structures. In scala Arrays doesn't accumulate values using ++ operator in left operand. Array's in scala are fixed size.
I don't understand the difference when hinting Iterable and Sequence.
What is the main difference between those two and when to use which?
I think set is an Iterable but not Sequence, are there any built-in data type that is Sequence but not Iterable?
def foo(baz: Sequence[float]):
...
# What is the difference?
def bar(baz: Iterable[float]):
...
The Sequence and Iterable abstract base classes (can also be used as type annotations) follow Python's definition of sequence and iterable. To be specific:
Iterable is any object that defines __iter__ or __getitem__.
Sequence is any object that defines __getitem__ and __len__. By definition, any sequence is an iterable. The Sequence class also defines other methods such as __contains__, __reversed__ that calls the two required methods.
Some examples:
list, tuple, str are the most common sequences.
Some built-in iterables are not sequences. For example, reversed returns a reversed object (or list_reverseiterator for lists) that cannot be subscripted.
When writing a function/method with an items argument, I often prefer Iterable to Sequence.
Hereafter is why and I hope it will help understanding the difference.
Say my_func_1 is:
from typing import Iterable
def my_func_1(items: Iterable[int]) -> None:
for item in items:
...
if condition:
break
return
Iterable offers the maximum possibilities to the caller. Correct calls include:
my_func_1((1, 2, 3)) # tuple is Sequence, Collection, Iterator
my_func_1([1, 2, 3]) # list is MutableSequence, Sequence, Collection, Iterator
my_func_1({1, 2, 3}) # set is Collection, Iterator
my_func_1(my_dict) # dict is Mapping, Collection, Iterator
my_func_1(my_dict.keys()) # dict.keys() is MappingKeys, Set, Collection, Iterator
my_func_1(range(10)) # range is Sequence, Collection, Iterator
my_func_1(x**2 for x in range(100)) # "strict' Iterator, i.e. neither a Collection nor a Sequence
...
... because all areIterable.
The implicit message to a function caller is: transfer data "as-is", just don't transform it.
In case the caller doesn't have data as a Sequence (e.g. tuple, list) or as a non-Sequence Collection (e.g. set), and because the iteration breaks before StopIteration, it is also more performing if he provides an 'strict' Iterator.
However if the function algorithm (say my_func_2) requires more than one iteration, then Iterable will fail if the caller provides a 'strict' Iterator because the first iteration exhausts it. Hence use a Collection:
from typing import Collection
def my_func_2(items: Collection[int]) -> None:
for item in items:
...
for item in items:
...
return
If the function algorithm (my_func_3) has to access by index to specific items, then both Iterable and Collection will fail if the caller provides a set, a Mapping or a 'strict' Iterator.
Hence use a Sequence:
from typing import Sequence
def my_func_3(items: Sequence[int]) -> None:
return items[5]
Conclusion: The strategy is: "use the most generic type that the function can handle". Don't forget that all this is only about typing, to help a static type checker to report incorrect calls (e.g. using a set when a Sequence is required). Then it's the caller responsibility to transform data when necessary, such as:
my_func_3(tuple(x**2 for x in range(100)))
Actually, all this is really about performance when scaling the length of items.
Always prefer Iterator when possible. Performance shall be handle as a daily task, not as a firemen task force.
In that direction, you will probably face the situation when a function only handles the empty use case and delegates the others, and you don't want to transform items into a Collection or a Sequence. Then do something like this:
from more_itertools import spy
def my_func_4(items: Iterable[int]) -> None:
(first, items) = spy(items)
if not first: # i.e. items is empty
...
else:
my_func_1(items) # Here 'items' is always a 'strict' Iterator
return
num_of_words = (doc_title,num) #number of words in a document
lines = (doc_title,word,num_of_occurrences) #number of occurrences of a specific word in a document
When I called lines.join(num_of_words), I was expecting to get something like:
(doc_title,(word,num_of_occurrences,num))
but I got instead:
(doc_title,(word,num))
and num_of_occurrences was omitted. What did I do wrong here? How am I supposed to join these two RDDs to get the result I'm expecting?
In the API docs of Spark for the join method:
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
So the join method can only be used on pairs (or at least will only return you a result of the described form).
A way to overcome this would be to have tuples of (doc_title, (word, num_occurrences)) instead of (doc_title, word, num_occurrences).
Working example:
num_of_words = sc.parallelize([("harry potter", 4242)])
lines = sc.parallelize([("harry potter", ("wand", 100))])
result = lines.join(num_of_words)
print result.collect()
# [('harry potter', (('wand', 100), 4242))]
(Note that sc.parallelize only turns a local python collection into a Spark RDD, and that collect() does the exact opposite)
I'm surprised at this output from fold, I can't imagine what it's doing.
I would expect that something.fold(0, lambda a,b: a+1) would return the number of elements in something, since the fold starts at 0 and adds 1 for each element.
sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
8
I'm coming from Scala, where fold works as the way I've described. So how is fold supposed to work in pyspark? Thanks for your thoughts.
To understand what's going on here, let's look at the definition of Spark's fold operation. Since you're using PySpark, I'm going to show the Python version of the code, but the Scala version exhibits the exact same behavior (you can also browse the source on GitHub):
def fold(self, zeroValue, op):
"""
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value."
The function C{op(t1, t2)} is allowed to modify C{t1} and return it
as its result value to avoid object allocation; however, it should not
modify C{t2}.
>>> from operator import add
>>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
15
"""
def func(iterator):
acc = zeroValue
for obj in iterator:
acc = op(obj, acc)
yield acc
vals = self.mapPartitions(func).collect()
return reduce(op, vals, zeroValue)
(For comparison, see the Scala implementation of RDD.fold).
Spark's fold operates by first folding each partition and then folding the results. The problem is that an empty partition gets folded down to the zero element, so the final driver-side fold ends up folding one value for every partition rather than one value for each non-empty partition. This means that the result of fold is sensitive to the number of partitions:
>>> sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
100
>>> sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
50
>>> sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
1
In this last case, what's happening is that the single partition is being folded down to the correct value, then that value is folded with the zero-value at the driver to yield 1.
It seems that Spark's fold() operation actually requires the fold function to be commutative in addition to associative. There are actually other places in Spark that impose this requirement, such as the fact that the ordering of elements within a shuffled partition can be non-deterministic across runs (see SPARK-5750).
I've opened a Spark JIRA ticket to investigate this issue: https://issues.apache.org/jira/browse/SPARK-6416.
Lets me try to give simple examples to explain fold method of spark. I will be using pyspark here.
rdd1 = sc.parallelize(list([]),1)
Above line is going to create an empty rdd with one partition
rdd1.fold(10, lambda x,y:x+y)
This yield output as 20
rdd2 = sc.parallelize(list([1,2,3,4,5]),2)
Above line is going to create rdd with values 1 to 5 and will be having a total of 2 partitions
rdd2.fold(10, lambda x,y:x+y)
This yields output as 45
So in above case for sake of simplicity what is happening here is you are having zeroth element as 10. So the sum that you would otherwise get of all numbers in the RDD, is now added by 10(i.e. zeroth element+all other elements => 10+1+2+3+4+5 = 25). Also now we have two partitions(i.e. number of partitions*zeroth element => 2*10 = 20)
Final output that fold emits is 25+20 = 45
Using similar process its clear why the fold operation on rdd1 yielded 20 as output.
Reduce fails when we have empty list something like rdd1.reduce(lambda x,y:x+y)
ValueError: Can not reduce() empty RDD
Fold can be used if we think we can have empty list in the rdd
rdd1.fold(0, lambda x,y:x+y)
As expected this will yield output as 0.
Is there a way that I can write a predicate function that will compare two strings and see which one is greater? Right now I have
def helper1(x, y):
return x > y
However, I'm trying to use the function in this way,
new_tuple = divide((helper1(some_value, l[0]),l[1:])
Please note that the above function call is probably wrong because my helper1 is incomplete. But the gist is I'm trying to compare two items to see if one's greater than the other, and the items are l[1:] to l[0]
Divide is a function that, given a predicate and a list, divides that list into a tuple that has two lists, based on what the predicate comes out as. Divide is very long, so I don't think I should post it on here.
So given that a predicate should only take one parameter, how should I write it so that it will take one parameter?
You should write a closure.
def helper(x):
def cmp(y):
return x > y
return cmp
...
new_tuple = divide(helper1(l[0]), l[1:])
...