Spark short-circuiting, sorting, and lazy maps - apache-spark

I'm working on an optimization problem that involves minimizing an expensive map operation over a collection of objects.
The naive solution would be something like
rdd.map(expensive).min()
However, the map function returns values that guaranteed to be >= 0. So, if any single result is 0, I can take that as the answer and do not need to compute the rest of the map operations.
Is there an idiomatic way to do this using Spark?

Is there an idiomatic way to do this using Spark?
No. If you're concerned with low level optimizations like this one, then Spark is not the best option. It doesn't mean it is completely impossible.
If you can for example try something like this:
rdd.cache()
(min_value, ) = rdd.filter(lambda x: x == 0).take(1) or [rdd.min()]
rdd.unpersist()
short circuit partitions:
def min_part(xs):
min_ = None
for x in xs:
min_ = min(x, min_) if min_ is not None else x
if x == 0:
return [0]
return [min_] in min_ is not None else []
rdd.mapPartitions(min_part).min()
Both will usually execute more than required, each giving slightly different performance profile, but can skip evaluating some records. With rare zeros the first one might be better.
You can even listen to accumulator updates and use sc.cancelJobGroup once 0 is seen. Here is one example of similar approach Is there a way to stream results to driver without waiting for all partitions to complete execution?

If "expensive" is really expensive, maybe you can write the result of "expensive" to, say, SQL (Or any other storage available to all the workers).
Then in the beginning of "expensive" check the number currently stored, if it is zero return zero from "expensive" without performing the expensive part.
You can also do this localy for each worker which will save you a lot of time but won't be as "global".

Related

Pyspark - Loop n times - Each loop gets gradually slower

So basically i want to loop n times through my dataframe and apply a function in each loop
(perform a join).
My test-Dataframe is like 1000 rows and in each iteration, exactly one column will be added.
The first three loops perform instantly and from then its gets really really slow.
The 10th loop e.g. needs more than 10 minutes.
I dont understand why this happens because my Dataframe wont grow larger in terms of rows.
If i call my functions with n=20 e.g., the join performs instantly.
But when i loop iteratively 20 times, it gets stucked soon.
You have any idea what can potentially cause this problem?
Examble Code from Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller
import time
from pyspark import SparkContext
sc = SparkContext()
def push_and_pop(rdd):
# two transformations: moves the head element to the tail
first = rdd.first()
return rdd.filter(
lambda obj: obj != first
).union(
sc.parallelize([first])
)
def serialize_and_deserialize(rdd):
# perform a collect() action to evaluate the rdd and create a new instance
return sc.parallelize(rdd.collect())
def do_test(serialize=False):
rdd = sc.parallelize(range(1000))
for i in xrange(25):
t0 = time.time()
rdd = push_and_pop(rdd)
if serialize:
rdd = serialize_and_deserialize(rdd)
print "%.3f" % (time.time() - t0)
do_test()
I have fixed this issue with converting the df every n times to a rdd and back to df.
Code runs fast now. But i dont understand what exactly is the reason for that. The explain plan seems to rise very fast during iterations if i dont do the conversion.
This fix is also issued in the book "High Performance Spark" with this workaround.
While the Catalyst optimizer is quite powerful, one of the cases where
it currently runs into challenges is with very large query plans.
These query plans tend to be the result of iterative algorithms, like
graph algorithms or machine learning algorithms. One simple workaround
for this is converting the data to an RDD and back to
DataFrame/Dataset at the end of each iteration

Why the operation on accumulator doesn't work without collect()?

I am learning Spark and following a tutorial. In an exercise I am trying to do some analysis on a data set. This data set has data in each line like:
userid | age | gender | ...
I have the following piece of code:
....
under_age = sc.accumulator(0)
over_age = sc.accumulator(0)
def count_outliers(data):
global under_age, over_age
if data[1] == '0-10':
under_age += 1
if data[1] == '80+':
over_age += 1
return data
data_set.map(count_outliers).collect()
print('Kids: {}, Seniors: {}'.format(under_age, over_age))
I found that I must use the method ".collect()" to make this code work. That is, without calling this method, the code won't count the two accumulators. But in my understanding ".collect()" is used to get the whole dataset to the memory. Why it is necessary here? Is it sth related to lazy evaluation thing? Please advise.
Yes, it is due to lazy evaluation.
Spark doesn't calculate anything until you execute an action such as collect, and the accumulators are only updated as a side-effect of that calculation.
Transformations such as map define what work needs to be done, but it's only executed once an action is triggered to "pull" the data through the transformations.
This is described in the documentation:
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map().
It's also important to note that:
In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
so your accumulators will not necessarily give correct answers; they may overstate the totals.

Exit a slow Spark map on a timeout, but preserve results so far

I am mapping over a Spark RDD, with a very expensive function (potentially tens of seconds per row).
It is possible that the job will take too long, and I will need to abort it to make way for other jobs in our data flow.
However, the results computed so far would still be useful to me, so I don't want to discard them, especially as they may have already taken hours to compute.
Is there a way to exit the transformation early, on a timeout, but preserve the partial results computed so far?
There are at least a couple of ways to do this, by switching from map to a related transformation:
mapPartitions
mapPartitions gives us access to an iterator over each partition, so we can simply pretend there were no items in it, if the timeout has expired:
val data = sc.parallelize(1 to 100)
val timeout = 10000
val start = System.currentTimeMillis
data.repartition(10).mapPartitions { iter =>
if (System.currentTimeMillis - start > timeout) Iterator.empty
else iter.map(x => { Thread.sleep(500); x + 1 })
}.count
Depending on your environment, you may need to tweak the timeouts in this spark-shell example, but it should produce different numbers of results depending exactly when you run the transformation relative to setting start.
Note that there must be a significantly higher number of partitions than the total number of executor cores, otherwise all partitions will start immediately, and there's nothing to skip. Therefore, for this example, we explicitly re-partition the data before we start the mapPartitions. You might or might not need to do this, depending on the size of your data and the number of cores provisioned.
flatMap
A finer-grained approach is to use flatMap, which lets us process or skip each individual row conditionally, via a function that returns an Option (and just returns None if the timeout has expired);
// setup as before
data.flatMap{ x => if (System.currentTimeMillis - start > timeout) None
else Some({Thread.sleep(500); x + 1}) }.count
This approach doesn't require partitioning, but will scan over all the remaining partitions even after the timeout (but not perform the expensive processing on any of the rows).

spark reduce function: understand how it works

I am taking this course.
It says that the reduce operation on RDD is done one machine at a time. That mean if your data is split across 2 computers, then the below function will work on data in the first computer, will find the result for that data and then it will take a single value from second machine, run the function and it will continue that way until it finishes with all values from machine 2. Is this correct?
I thought that the function will start operating on both machines at the same time and then once it has results from 2 machines, it will again run the function for the last time
rdd1=rdd.reduce(lambda x,y: x+y)
update 1--------------------------------------------
will below steps give faster answer as compared to reduce function?
Rdd=[3,5,4,7,4]
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
collData.aggregate(0, seqOp, combOp)
Update 2-----------------------------------
Should both set of codes below execute in same amount time? I checked and it seems that both take the same time.
import datetime
data=range(1,1000000000)
distData = sc.parallelize(data,4)
print(datetime.datetime.now())
a=distData.reduce(lambda x,y:x+y)
print(a)
print(datetime.datetime.now())
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
print(datetime.datetime.now())
b=distData.aggregate(0, seqOp, combOp)
print(b)
print(datetime.datetime.now())
reduce behavior differs a little bit between native (Scala) and guest languages (Python) but simplifying things a little:
each partition is processed sequentially element by element
multiple partitions can be processed at the same time either by a single worker (multiple executor threads) or different workers
partial results are fetched to the driver where the final reduction is applied (this is a part which has different implementation in PySpark and Scala)
Since it looks like you're using Python lets take a look at the code:
reduce creates a simple wrapper for a user provided function:
def func(iterator):
...
This is wrapper is used to mapPartitions:
vals = self.mapPartitions(func).collect()
It should be obvious this code is embarrassingly parallel and doesn't care how the results are utilized
Collected vals are reduced sequentially on the driver using standard Python reduce:
reduce(f, vals)
where f is a functions passed to RDD.reduce
In comparison Scala will merge partial results asynchronously as they come from the workers.
In case of treeReduce step 3. can performed in a distributed manner as well. See Understanding treeReduce() in Spark
To summarize reduce, excluding driver side processing, uses exactly the same mechanisms (mapPartitions) as the basic transformations like map or filter, and provide the same level of parallelism (once again excluding driver code). If you have a large number of partitions or f is expensive you can parallelism / distribute final merging using tree* family of methods.

Lazy foreach on a Spark RDD

I have a big RDD of Strings (obtained through a union of several sc.textFile(...)).
I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found.
I could retrofit foreach, or filter, or mapfor this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached.
Is there a way to short-circuit this process and avoid iterating through the whole RDD?
I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD
Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take or first):
import numpy as np
from __future__ import print_function
np.random.seed(323)
acc = sc.accumulator(0)
def good_enough(x, threshold=7000):
global acc
acc += 1
return x > threshold
rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))
x = rdd.filter(good_enough).first()
Now lets check accum:
>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109
and just to be sure if everything works as expected:
acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()
Same thing could be done, probably in a more efficient manner using data frames and udf.
Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.
Not really. There is no find method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.
Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.

Resources