How to avoid large intermediate result before reduce? - apache-spark

I'm getting an error in a spark job that's surprising me:
Total size of serialized results of 102 tasks (1029.6 MB) is
bigger than spark.driver.maxResultSize (1024.0 MB)
My job is like this:
def add(a,b): return a+b
sums = rdd.mapPartitions(func).reduce(add)
rdd has ~500 partitions and func takes the rows in that partition and returns a large array (a numpy array of 1.3M doubles, or ~10Mb).
I'd like to sum all these results and return their sum.
Spark seems to be holding the total result of mapPartitions(func) in memory (about 5gb) instead of processing it incrementally, which would require about only 30Mb.
Instead of increasing spark.driver.maxResultSize, is there a way perform the reduce more incrementally?
Update: Actually I'm kinda surprised that more that two results are ever held in memory.

When using reduce Spark applies final reduction on the driver. If func returns a single object this is effectively equivalent to:
reduce(add, rdd.collect())
You may use treeReduce:
import math
# Keep maximum possible depth
rdd.treeReduce(add, depth=math.log2(rdd.getNumPartitions()))
or toLocalIterator:
sum(rdd.toLocalIterator())
The former one will recursively merge partitions on the workers at the cost of increased network exchange. You can use depth parameter tune the performance.
The latter one will collect only a single partition at the time, but it might require re-evaluation of the rdd and significant part of the job will be performed by the driver.
Depending on the exact logic used in func you can also improve work distribution by splitting the matrix into blocks, and performing addition by-block, for example using BlockMatrices

Related

How to handle Spark tasks that require much more memory than the inputs and outputs combined?

I want to perform a computation that can be written as a simple Python UDF. This particular computation consumes much more memory while generating intermediate results than is needed to store the inputs and outputs combined.
Here's the rough structure of the computational task:
import pyspark.sql.functions as fx
#fx.udf("double")
def my_computation(reference):
large_object = load_large_object(reference)
result = summarize_large_object(large_object)
return result
df_input = spark.read.parquet("list_of_references.parquet")
df_result = df_input.withColumn("result", my_computation(fx.col("reference")))
pdf_result = df_result.toPandas()
The idea is that load_large_object takes a small input (reference) and generates a much larger object, whereas summarize_large_object takes that large object and summarizes it down to a much smaller result. So, while the inputs and outputs (reference and result) can be quite small, the intermediate value large_object requires much more memory.
I haven't found any way to reliably run computations like this without severely limiting the amount of parallelism that I can achieve on upstream and downstream computations in the same Spark session.
Naively running a computation like the one above (without any changes to the default Spark configuration) often leads to worker nodes running out of memory. For example, if large_object consumes 2 GB of memory and the relevant Spark stage is running as 8 parallel tasks on 8 cores of a machine with less than 16 GB of RAM, the worker node will run out of memory (or at least start swapping to disk and slow down significantly).
IMO, the best solution would be to temporarily limit the number of parallel tasks that can run simultaneously. The closest configuration parameter that I'm aware of is spark.task.cpus, but this affects all upstream and downstream computations within the same Spark session.
Ideally, there would be some way to provide a hint to Spark that effectively says "for this step, make sure to allocate X amount of extra memory per task" (and then Spark wouldn't schedule such a task on any worker node that isn't expected to have that amount of extra memory available). Upstream and downstream jobs/stages could remain unaffected by the constraints imposed by this sort of hint.
Does such a mechanism exist?

Spark map-side aggregation: Per partition only?

I have been reading on map-side reduce/aggregation and there is one thing I can't seem to understand clearly. Does it happen per partition only or is it broader in scope? I mean does it also reduce across partitions if the same key appears in multiple partitions processed by the same Executor?
Now I have a few more questions depending on whether the answer is "per partition only" or not.
Assuming it's per partition:
What are good ways to deal with a situation where I know my dataset lends itself well to reducing further across local partitions before a shuffle. E.g. I process 10 partitions per Executor and I know they all include many overlapping keys, so it could potentially be reduced to just 1/10th. Basically I'm looking for a local reduce() (like so many). Coalesce()ing them comes to mind, any common methods to deal with this?
Assuming it reduces across partitions:
Does it happen per Executor? How about Executors assigned to the same Worker node, do they have the ability to reduce across each others partitions recognizing that they are co-located?
Does it happen per core (Thread) within the Executor? The reason I'm asking this is because some of the diagrams I looked at seem to show a Mapper per core/Thread of the executor, it looks like results of all tasks coming out of that core goes to a single Mapper instance. (which does the shuffle writes if I am not mistaken)
Is it deterministic? E.g. if I have a record, let's say A=1 in 10 partitions processed by the same Executor, can I expect to see A=10 for the task reading the shuffle output? Or is it best-effort, e.g. it still reduces but there are some constraints (buffer size etc.) so the shuffle read may encounter A=4 and A=6.
Map side aggregation is similar to Hadoop combiner approach. Reduce locally makes sense to Spark as well and means less shuffling. So it works per partition - as you state.
When applying reducing functionality, e.g. a groupBy & sum, then shuffling occurs initially so that keys are in same partition, so that the above can occur (with dataframes automatically). But a simple count, say, will also reduce locally and then the overall count will be computed by taking the intermediate results back to the driver.
So, results are combined on the Driver from Executors - depending on what is actually requested, e.g. collect, print of a count. But if writing out after aggregation of some nature, then the reducing is limited to the Executor on a Worker.

"java.lang.OutOfMemoryError: Requested array size exceeds VM limit" during pyspark collect_list() execution

I have a large dataset of 50 million rows with about 40 columns of floats.
For custom transformation reasons, I am trying to collect all float values per column using collect_list() function of pyspark, using the following pseudocode:
for column in columns:
set_values(column, df.select(collect_list(column)).first()[0])
For each column, it executes the collect_list() function and sets the values into some other internal structure.
I am running the aforementioned standalone cluster with 2 hosts of 8 cores and 64 GB RAM, allocating max 30 GB and 6 cores for 1 executor per host, and I am getting the following exception during execution, which I suspect it has to do with the size of the collected array.
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
I have tried multiple configurations in spark-defaults.conf, including allocating more memory, partition number, parallelism, even Java options, but still no luck.
So my assumption is that collect_list() is deeply related to the executors/drivers resources on larger datasets or has nothing to do with these?
Are there any settings i could use, to help me eliminate this issue, otherwise i have to use collect() function?
collect_list is not better than just calling collect in your case. Both are incredibly bad idea for large datasets. and have very little practical applications.
Both require amount of memory proportional to the number of records, and collect_list just adds overhead of shuffle.
In other words - if you don't have a choice, and you need a local structure, use select and collect and increase driver memory. It won't make things any worse:
df.select(column).rdd.map(lambda x: x[0]).collect()

What is the proper way to do element-wise operations on two vectors in the Spark model?

I'm using PySpark with YARN and have two RDD's, A & B that I wish to perform a chain of various, element-wise calculations on. Both RDD's are vectors that are the same length and broken into the same number of partitions. My current approach is as follows :
#C = A + B
C = A.zip(B).map(lambda x: x[0] + x[1])
When I call a collect on an RDD that is the result of a chain of 5-8 of these types of operations, I start to lose executors. If I keep collecting further down the chain, enough executors will be lost to cause the calculation to fail. Increasing the amount of memory per executor allows the calculation to finish. This prompts the following questions :
Is the A.zip(B).map(operation(A,B)) the intended method of operating on two RDD's?
Are there hidden pitfalls that I'm running into in using this method? I read that executors may be lost due to excessive memory allocation on them. This wouldn't make sense to me as my interpretation of an RDD was simply as a set of instructions in making an intermediate data set, not the actual distributed data set itself.

Spark map is only one task while it should be parallel (PySpark)

I have a RDD with around 7M entries with 10 normalized coordinates in each. I also have a number of centers and I'm trying to map every entry to the closest (Euclidean distance) center. The problem is that this only generates one task which means it is not parallelizing. This is the form:
def doSomething(point,centers):
for center in centers.value:
if(distance(point,center)<1):
return(center)
return(None)
preppedData.map(lambda x:doSomething(x,centers)).take(5)
The preppedData RDD is cached and already evaluated, the doSomething function is represented a lot easier than it actually is but it's the same principle. The centers is a list that has been broadcast. Why is this map only in one task?
Similar pieces of code in other projects just map to +- 100 tasks and get run on all the executors, this one is 1 task on 1 executor. My job has 8 executors with 8 GB and 2 cores per executor available.
This could be due to the conservative nature of the take() method.
See the code in RDD.scala.
What it does is first take the first partition of your RDD (if your RDD doesn't require a shuffle, this will require only one task) and if there are enough results in that one partition, it will return that. If there is not enough data in your partition, it will then grow the number of partitions it tries to take until it gets the required number of elements.
Since your RDD is already cached, and your operation is only a map function, as long as any of your RDDs have >5 rows, this will only ever require one task. More tasks would be unnecessary.
This code exists to avoid overloading the driver with too much data by fetching from all partitions at once for a small take.

Resources