when to use mapParitions and mapPartitionsWithIndex? - apache-spark

The PySpark documentation describes two functions:
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> def f(iterator): yield sum(iterator)
>>> rdd.mapPartitions(f).collect()
[3, 7]
And ...
mapPartitionsWithIndex(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD,
while tracking the index of the original partition.
>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>> def f(splitIndex, iterator): yield splitIndex
>>> rdd.mapPartitionsWithIndex(f).sum()
6
What use cases do these functions attempt to solve? I can't see why they would be required.

To answer this question we need to compare map with mapPartitions/mapPartitionsWithIndex (mapPartitions and mapPartitionsWithIndex pretty much do the same thing except with mapPartitionsWithIndex you can track which partition is being processed).
Now mapPartitions and mapPartitionsWithIndex are used to optimize the performance of your application. Just for the sake of understanding let's say all the elements in your RDD are XML elements and you need a parser to process each of them. So you have to take an instance of a good parser class to move ahead with. You could do it in two ways:
map + foreach: In this case for each element, an instance of the parser class will be created, the element will be processed and then the instance will be destroyed in time but this instance will not be used for other elements. So if you are working with an RDD of 12 elements distributed among 4 partitions, the parser instance will be created 12 times. And as you know creating an instance is a very expensive operation so it will take time.
mapPartitions/mapPartitionsWithIndex: These two methods are able to address the above situation a little bit. mapPartitions/mapPartitionsWithIndex works on the partitions, not on the elements (please don't get me wrong, all elements will be processed). These methods will create the parser instance once for each partition. And as you have only 4 partitions, the parser instance will be created 4 times (for this example 8 times less than map). But the function you will pass to these methods should take an Iterator object (to take all the elements of a partition at once as input). So in case of mapPartitions and mapPartitionsWithIndex the parser instance will be created, all elements for the current partition will be processed, and then the instance will be destroyed later by GC. And you will notice that they can improve the performance of your application significantly.
So the bottom-line is, whenever you see that some operations are common to all elements, and in general, you could do it once and could process all of them, it's better to go with mapPartitions/mapPartitionsWithIndex.
Please find the below two links for explanations with code example:
https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/
http://apachesparkbook.blogspot.in/2015/11/mappartition-example.html

Related

How many Iterators are there in Spark mapInPandas?

I am trying to understand how "mapInPandas" works in Spark.
The example quoted on the Databricks blog is:
from typing import Iterator
import pandas as pd
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def pandas_filter(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf[pdf.id == 1]
df.mapInPandas(pandas_filter, schema=df.schema).show()
Question is, how many "pdf" are going to be in the iterator?
I guessed that perhaps they would be as many as the number of partitions
but when I further tested the code it seemed like they were far too many (on a different dataset with ~100 m records)
So is there a way to know how the number of iterations is determined and
if there is a way to make it equal to the number of partitions?
You can find that in documentation:
Data partitions in Spark are converted into Arrow record batches, which can temporarily lead to high memory usage in the JVM. To avoid possible out of memory exceptions, the size of the Arrow record batches can be adjusted by setting the conf “spark.sql.execution.arrow.maxRecordsPerBatch” to an integer that will determine the maximum number of rows for each batch. The default value is 10,000 records per batch. If the number of columns is large, the value should be adjusted accordingly. Using this limit, each data partition will be made into 1 or more record batches for processing.
so if you have 10M records, the you will have ~10,000 iterators

Spark short-circuiting, sorting, and lazy maps

I'm working on an optimization problem that involves minimizing an expensive map operation over a collection of objects.
The naive solution would be something like
rdd.map(expensive).min()
However, the map function returns values that guaranteed to be >= 0. So, if any single result is 0, I can take that as the answer and do not need to compute the rest of the map operations.
Is there an idiomatic way to do this using Spark?
Is there an idiomatic way to do this using Spark?
No. If you're concerned with low level optimizations like this one, then Spark is not the best option. It doesn't mean it is completely impossible.
If you can for example try something like this:
rdd.cache()
(min_value, ) = rdd.filter(lambda x: x == 0).take(1) or [rdd.min()]
rdd.unpersist()
short circuit partitions:
def min_part(xs):
min_ = None
for x in xs:
min_ = min(x, min_) if min_ is not None else x
if x == 0:
return [0]
return [min_] in min_ is not None else []
rdd.mapPartitions(min_part).min()
Both will usually execute more than required, each giving slightly different performance profile, but can skip evaluating some records. With rare zeros the first one might be better.
You can even listen to accumulator updates and use sc.cancelJobGroup once 0 is seen. Here is one example of similar approach Is there a way to stream results to driver without waiting for all partitions to complete execution?
If "expensive" is really expensive, maybe you can write the result of "expensive" to, say, SQL (Or any other storage available to all the workers).
Then in the beginning of "expensive" check the number currently stored, if it is zero return zero from "expensive" without performing the expensive part.
You can also do this localy for each worker which will save you a lot of time but won't be as "global".

spark reduce function: understand how it works

I am taking this course.
It says that the reduce operation on RDD is done one machine at a time. That mean if your data is split across 2 computers, then the below function will work on data in the first computer, will find the result for that data and then it will take a single value from second machine, run the function and it will continue that way until it finishes with all values from machine 2. Is this correct?
I thought that the function will start operating on both machines at the same time and then once it has results from 2 machines, it will again run the function for the last time
rdd1=rdd.reduce(lambda x,y: x+y)
update 1--------------------------------------------
will below steps give faster answer as compared to reduce function?
Rdd=[3,5,4,7,4]
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
collData.aggregate(0, seqOp, combOp)
Update 2-----------------------------------
Should both set of codes below execute in same amount time? I checked and it seems that both take the same time.
import datetime
data=range(1,1000000000)
distData = sc.parallelize(data,4)
print(datetime.datetime.now())
a=distData.reduce(lambda x,y:x+y)
print(a)
print(datetime.datetime.now())
seqOp = (lambda x, y: x+y)
combOp = (lambda x, y: x+y)
print(datetime.datetime.now())
b=distData.aggregate(0, seqOp, combOp)
print(b)
print(datetime.datetime.now())
reduce behavior differs a little bit between native (Scala) and guest languages (Python) but simplifying things a little:
each partition is processed sequentially element by element
multiple partitions can be processed at the same time either by a single worker (multiple executor threads) or different workers
partial results are fetched to the driver where the final reduction is applied (this is a part which has different implementation in PySpark and Scala)
Since it looks like you're using Python lets take a look at the code:
reduce creates a simple wrapper for a user provided function:
def func(iterator):
...
This is wrapper is used to mapPartitions:
vals = self.mapPartitions(func).collect()
It should be obvious this code is embarrassingly parallel and doesn't care how the results are utilized
Collected vals are reduced sequentially on the driver using standard Python reduce:
reduce(f, vals)
where f is a functions passed to RDD.reduce
In comparison Scala will merge partial results asynchronously as they come from the workers.
In case of treeReduce step 3. can performed in a distributed manner as well. See Understanding treeReduce() in Spark
To summarize reduce, excluding driver side processing, uses exactly the same mechanisms (mapPartitions) as the basic transformations like map or filter, and provide the same level of parallelism (once again excluding driver code). If you have a large number of partitions or f is expensive you can parallelism / distribute final merging using tree* family of methods.

Lazy foreach on a Spark RDD

I have a big RDD of Strings (obtained through a union of several sc.textFile(...)).
I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found.
I could retrofit foreach, or filter, or mapfor this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached.
Is there a way to short-circuit this process and avoid iterating through the whole RDD?
I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD
Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take or first):
import numpy as np
from __future__ import print_function
np.random.seed(323)
acc = sc.accumulator(0)
def good_enough(x, threshold=7000):
global acc
acc += 1
return x > threshold
rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))
x = rdd.filter(good_enough).first()
Now lets check accum:
>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109
and just to be sure if everything works as expected:
acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()
Same thing could be done, probably in a more efficient manner using data frames and udf.
Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.
Not really. There is no find method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.
Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.

Spark: getting cumulative frequency from frequency values

My question is rather simple to be answered in a single node environment, but I don't know how to do the same thing in a distributed Spark environment. What I have now is a "frequency plot", in which for each item I have the number of times it occurs. For instance, it may be something like this: (1, 2), (2, 3), (3,1) which means that 1 occurred 2 times, 2 3 times and so on.
What I would like to get is the cumulated frequency for each item, so the result I would need from the instance data above is: (1, 2), (2, 3+2=5), (3, 1+3+2=6).
So far, I have tried to do this by using mapPartitions which gives the correct result if there is only one partition...otherwise obviously no.
How can I do that?
Thanks.
Marco
I don't think what you want is possible as a distributed transformation in Spark unless your data is small enough to be aggregated into a single partition. Spark functions work by distributing jobs to remote processes, and the only way to communicate back is using an action which returns some value, or using an accumulator. Unfortunately, accumulators can't be read by the distributed jobs, they're write-only.
If your data is small enough to fit in memory on a single partition/process, you can coalesce(1), and then your existing code will work. If not, but a single partition will fit in memory, then you might use a local iterator:
var total = 0L
rdd.sortBy(_._1).toLocalIterator.foreach(tuple => {
total = total + tuple._2;
println((tuple._1, total)) // or write to local file
})
If I understood your question correctly, it really looks like a fit for one of the combiner functions – take a look at different versions of aggregateByKey or reduceByKey functions, both located here.

Resources