spark reduce function: understand how it works - apache-spark

I am taking this course.
It says that the reduce operation on RDD is done one machine at a time. That mean if your data is split across 2 computers, then the below function will work on data in the first computer, will find the result for that data and then it will take a single value from second machine, run the function and it will continue that way until it finishes with all values from machine 2. Is this correct?
I thought that the function will start operating on both machines at the same time and then once it has results from 2 machines, it will again run the function for the last time
rdd1=rdd.reduce(lambda x,y: x+y)
update 1--------------------------------------------
will below steps give faster answer as compared to reduce function?
Rdd=[3,5,4,7,4]
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
collData.aggregate(0, seqOp, combOp)
Update 2-----------------------------------
Should both set of codes below execute in same amount time? I checked and it seems that both take the same time.
import datetime
data=range(1,1000000000)
distData = sc.parallelize(data,4)
print(datetime.datetime.now())
a=distData.reduce(lambda x,y:x+y)
print(a)
print(datetime.datetime.now())
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
print(datetime.datetime.now())
b=distData.aggregate(0, seqOp, combOp)
print(b)
print(datetime.datetime.now())

reduce behavior differs a little bit between native (Scala) and guest languages (Python) but simplifying things a little:
each partition is processed sequentially element by element
multiple partitions can be processed at the same time either by a single worker (multiple executor threads) or different workers
partial results are fetched to the driver where the final reduction is applied (this is a part which has different implementation in PySpark and Scala)
Since it looks like you're using Python lets take a look at the code:
reduce creates a simple wrapper for a user provided function:
def func(iterator):
...
This is wrapper is used to mapPartitions:
vals = self.mapPartitions(func).collect()
It should be obvious this code is embarrassingly parallel and doesn't care how the results are utilized
Collected vals are reduced sequentially on the driver using standard Python reduce:
reduce(f, vals)
where f is a functions passed to RDD.reduce
In comparison Scala will merge partial results asynchronously as they come from the workers.
In case of treeReduce step 3. can performed in a distributed manner as well. See Understanding treeReduce() in Spark
To summarize reduce, excluding driver side processing, uses exactly the same mechanisms (mapPartitions) as the basic transformations like map or filter, and provide the same level of parallelism (once again excluding driver code). If you have a large number of partitions or f is expensive you can parallelism / distribute final merging using tree* family of methods.

Related

Pyspark - Loop n times - Each loop gets gradually slower

So basically i want to loop n times through my dataframe and apply a function in each loop
(perform a join).
My test-Dataframe is like 1000 rows and in each iteration, exactly one column will be added.
The first three loops perform instantly and from then its gets really really slow.
The 10th loop e.g. needs more than 10 minutes.
I dont understand why this happens because my Dataframe wont grow larger in terms of rows.
If i call my functions with n=20 e.g., the join performs instantly.
But when i loop iteratively 20 times, it gets stucked soon.
You have any idea what can potentially cause this problem?
Examble Code from Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller
import time
from pyspark import SparkContext
sc = SparkContext()
def push_and_pop(rdd):
# two transformations: moves the head element to the tail
first = rdd.first()
return rdd.filter(
lambda obj: obj != first
).union(
sc.parallelize([first])
)
def serialize_and_deserialize(rdd):
# perform a collect() action to evaluate the rdd and create a new instance
return sc.parallelize(rdd.collect())
def do_test(serialize=False):
rdd = sc.parallelize(range(1000))
for i in xrange(25):
t0 = time.time()
rdd = push_and_pop(rdd)
if serialize:
rdd = serialize_and_deserialize(rdd)
print "%.3f" % (time.time() - t0)
do_test()
I have fixed this issue with converting the df every n times to a rdd and back to df.
Code runs fast now. But i dont understand what exactly is the reason for that. The explain plan seems to rise very fast during iterations if i dont do the conversion.
This fix is also issued in the book "High Performance Spark" with this workaround.
While the Catalyst optimizer is quite powerful, one of the cases where
it currently runs into challenges is with very large query plans.
These query plans tend to be the result of iterative algorithms, like
graph algorithms or machine learning algorithms. One simple workaround
for this is converting the data to an RDD and back to
DataFrame/Dataset at the end of each iteration

Spark short-circuiting, sorting, and lazy maps

I'm working on an optimization problem that involves minimizing an expensive map operation over a collection of objects.
The naive solution would be something like
rdd.map(expensive).min()
However, the map function returns values that guaranteed to be >= 0. So, if any single result is 0, I can take that as the answer and do not need to compute the rest of the map operations.
Is there an idiomatic way to do this using Spark?
Is there an idiomatic way to do this using Spark?
No. If you're concerned with low level optimizations like this one, then Spark is not the best option. It doesn't mean it is completely impossible.
If you can for example try something like this:
rdd.cache()
(min_value, ) = rdd.filter(lambda x: x == 0).take(1) or [rdd.min()]
rdd.unpersist()
short circuit partitions:
def min_part(xs):
min_ = None
for x in xs:
min_ = min(x, min_) if min_ is not None else x
if x == 0:
return [0]
return [min_] in min_ is not None else []
rdd.mapPartitions(min_part).min()
Both will usually execute more than required, each giving slightly different performance profile, but can skip evaluating some records. With rare zeros the first one might be better.
You can even listen to accumulator updates and use sc.cancelJobGroup once 0 is seen. Here is one example of similar approach Is there a way to stream results to driver without waiting for all partitions to complete execution?
If "expensive" is really expensive, maybe you can write the result of "expensive" to, say, SQL (Or any other storage available to all the workers).
Then in the beginning of "expensive" check the number currently stored, if it is zero return zero from "expensive" without performing the expensive part.
You can also do this localy for each worker which will save you a lot of time but won't be as "global".

when to use mapParitions and mapPartitionsWithIndex?

The PySpark documentation describes two functions:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> def f(iterator): yield sum(iterator)
>>> rdd.mapPartitions(f).collect()
[3, 7]
And ...
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD,
while tracking the index of the original partition.
>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>> def f(splitIndex, iterator): yield splitIndex
>>> rdd.mapPartitionsWithIndex(f).sum()
6
What use cases do these functions attempt to solve? I can't see why they would be required.
To answer this question we need to compare map with mapPartitions/mapPartitionsWithIndex (mapPartitions and mapPartitionsWithIndex pretty much do the same thing except with mapPartitionsWithIndex you can track which partition is being processed).
Now mapPartitions and mapPartitionsWithIndex are used to optimize the performance of your application. Just for the sake of understanding let's say all the elements in your RDD are XML elements and you need a parser to process each of them. So you have to take an instance of a good parser class to move ahead with. You could do it in two ways:
map + foreach: In this case for each element, an instance of the parser class will be created, the element will be processed and then the instance will be destroyed in time but this instance will not be used for other elements. So if you are working with an RDD of 12 elements distributed among 4 partitions, the parser instance will be created 12 times. And as you know creating an instance is a very expensive operation so it will take time.
mapPartitions/mapPartitionsWithIndex: These two methods are able to address the above situation a little bit. mapPartitions/mapPartitionsWithIndex works on the partitions, not on the elements (please don't get me wrong, all elements will be processed). These methods will create the parser instance once for each partition. And as you have only 4 partitions, the parser instance will be created 4 times (for this example 8 times less than map). But the function you will pass to these methods should take an Iterator object (to take all the elements of a partition at once as input). So in case of mapPartitions and mapPartitionsWithIndex the parser instance will be created, all elements for the current partition will be processed, and then the instance will be destroyed later by GC. And you will notice that they can improve the performance of your application significantly.
So the bottom-line is, whenever you see that some operations are common to all elements, and in general, you could do it once and could process all of them, it's better to go with mapPartitions/mapPartitionsWithIndex.
Please find the below two links for explanations with code example:
https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/
http://apachesparkbook.blogspot.in/2015/11/mappartition-example.html

Lazy foreach on a Spark RDD

I have a big RDD of Strings (obtained through a union of several sc.textFile(...)).
I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found.
I could retrofit foreach, or filter, or mapfor this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached.
Is there a way to short-circuit this process and avoid iterating through the whole RDD?
I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD
Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take or first):
import numpy as np
from __future__ import print_function
np.random.seed(323)
acc = sc.accumulator(0)
def good_enough(x, threshold=7000):
global acc
acc += 1
return x > threshold
rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))
x = rdd.filter(good_enough).first()
Now lets check accum:
>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109
and just to be sure if everything works as expected:
acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()
Same thing could be done, probably in a more efficient manner using data frames and udf.
Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.
Not really. There is no find method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.
Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.

Spark: getting cumulative frequency from frequency values

My question is rather simple to be answered in a single node environment, but I don't know how to do the same thing in a distributed Spark environment. What I have now is a "frequency plot", in which for each item I have the number of times it occurs. For instance, it may be something like this: (1, 2), (2, 3), (3,1) which means that 1 occurred 2 times, 2 3 times and so on.
What I would like to get is the cumulated frequency for each item, so the result I would need from the instance data above is: (1, 2), (2, 3+2=5), (3, 1+3+2=6).
So far, I have tried to do this by using mapPartitions which gives the correct result if there is only one partition...otherwise obviously no.
How can I do that?
Thanks.
Marco
I don't think what you want is possible as a distributed transformation in Spark unless your data is small enough to be aggregated into a single partition. Spark functions work by distributing jobs to remote processes, and the only way to communicate back is using an action which returns some value, or using an accumulator. Unfortunately, accumulators can't be read by the distributed jobs, they're write-only.
If your data is small enough to fit in memory on a single partition/process, you can coalesce(1), and then your existing code will work. If not, but a single partition will fit in memory, then you might use a local iterator:
var total = 0L
rdd.sortBy(_._1).toLocalIterator.foreach(tuple => {
total = total + tuple._2;
println((tuple._1, total)) // or write to local file
})
If I understood your question correctly, it really looks like a fit for one of the combiner functions – take a look at different versions of aggregateByKey or reduceByKey functions, both located here.

Resources