PySpark PandasUDF on GCP - Memory Allocation - apache-spark

I am using a pandas udf to train many ML models on GCP in Dataproc (Spark). The main idea is that I have a grouping variable that represents the various sets of data in my data frame and I run something like this:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test_train(grp_df):
#train model on grp_df
#evaluate model
#return metrics on
return (metrics)
result=df.groupBy('group_id').apply(test_train)
This works fine except when I use the non-sampled data, where errors are returned that appear to be related to memory issues. The messages are cryptic (to me) but if I sample down the data it runs, if I dont, it fails. Error messages are things like:
OSError: Read out of bounds (offset = 631044336, size = 69873416) in
file of size 573373864
or
Container killed by YARN for exceeding memory limits. 24.5 GB of 24
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My Question is how to set memory in the cluster to get this to work?
I understand that each group of data and the process being ran needs to fit entirely in the memory of the executor. I current have a 4-worker cluster with the following:
If I think the maximum size of data in the largest group_id requires 150GB of memory, it seems I really need each machine to operate on one group_id at a time. At least I get 4 times the speed compared to having a single worker or VM.
If I do the following, is this in fact creating 1 executor per machine that has access to all the cores minus 1 and 180 GB of memory? So that if in theory the largest group of data would work on a single VM with this much RAM, this process should work?
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '180g') \
.config('spark.executor.cores', '63') \
.config('spark.executor.instances', '1') \
.getOrCreate()

Let's break the answer into 3 parts:
Number of executors
The GroupBy operation
Your executor memory
Number of executors
Straight from the Spark docs:
spark.executor.instances
Initial number of executors to run if dynamic allocation is enabled.
If `--num-executors` (or `spark.executor.instances`) is set and larger
than this value, it will be used as the initial number of executors.
So, No. You only get a single executor which won't scale up unless dynamic allocation is enabled.
You can increase the number of such executors manually by configuring spark.executor.instances or setup automatic scale up based on workload, by enabling dynamic executor allocation.
To enable dynamic allocation, you have to also enable the shuffle service which allows you to safely remove executors. This can be done by setting two configs:
spark.shuffle.service.enabled to true. Default is false.
spark.dynamicAllocation.enabled to true. Default is false.
GroupBy
I have observed group_by being done using hash aggregates in Spark which means given x number of partitions, and unique group_by values greater than x, multiple group by values will lie in the same partition.
For example, say two unique values in group_by column are a1 and a2 having total rows' size 100GiB and 150GiB respectively.
If they fall into separate partitions, your application will run fine since each partition will fit into the executor memory (180GiB), which is required for in-memory processing and the remaining will be spilled to disk if they do not fit into the remaining memory. However, if they fall into same partition, your partition will not fit into the executor memory (180GiB < 250GiB) and you will get an OOM.
In such instances, it's useful to configure spark.default.parallelism to distribute your data over a reasonably larger number of partitions or apply salting or other techniques to remove data skewness.
If your data is not too skewed, you are correct to say that as long as your executor can handle the largest groupby value, it should work since your data will be evenly partitioned and chances of the above happening will be rare.
Another point to note is that since you are using group_by which requires data shuffle, you should also turn on the shuffle service. Without the shuffle service, each executor has to serve the shuffle requests along with doing it's own work.
Executor memory
The total executor memory (actual executor container size) in Spark is determined by adding the executor memory alloted for container along with the alloted memoryOverhead. The memoryOverhead accounts for things like VM overheads, interned strings, other native overheads, etc. So,
Total executor memory = (spark.executor.memory + spark.executor.memoryOverhead)
spark.executor.memoryOverhead = max(executorMemory*0.10, 384 MiB)
Based on this, you can configure your executors to have an appropriate size as per your data.
So, when you set the spark.executor.memory to 180GiB, the actual executor launched should be of around 198GiB.

To Resolve yarn overhead issue you can increase yarn overhead memory by adding .config('spark.yarn.executor.memoryOverhead','30g') and for maximum parallelism it is recommended to keep no of cores to 5 where as you can increase the no of executors.
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '18g') \
.config('spark.executor.cores', '5') \
.config('spark.executor.instances', '12') \
.getOrCreate()
# or use dynamic resource allocation refer below config
spark = SparkSession.builder \
.appName('test') \
.config('spark.shuffle.service.enabled':'true')\
.config('spark.dynamicAllocation.enabled':'true')\
.getOrCreate()

I solved OSError: Read out of bounds ****
by making group number large
result=df.groupBy('group_id').apply(test_train)

Related

Spark configuration based on my data size

I know there's a way to configure a Spark Application based in your cluster resources ("Executor memory" and "number of Executor" and "executor cores") I'm wondering if exist a way to do it considering the data input size?
What would happen if data input size does not fit into all partitions?
Example:
Data input size = 200GB
Number of partitions in cluster = 100
Size of partitions = 128MB
Total size that partitions could handle = 100 * 128MB = 128GB
What about the rest of the data (72GB)?
I guess Spark will wait to have free the resources free due to is designed to process batches of data Is this a correct assumption?
Thank in advance
I recommend for best performance, don't set spark.executor.cores. You want one executor per worker. Also, use ~70% of the executor memory in spark.executor.memory. Finally- if you want real-time application statistics to influence the number of partitions, use Spark 3, since it will come with Adaptive Query Execution (AQE). With AQE, Spark will dynamically coalesce shuffle partitions. SO you set it to an arbitrarily-large number of partitions, such as:
spark.sql.shuffle.partitions=<number of cores * 50>
Then just let AQE do its thing. You can read more about it here:
https://www.databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html
There are 2 aspects to your question. The first is regarding storage of this data, & the second is regarding data execution.
With regards to storage, when you say Size of partitions = 128MB, I assume you use HDFS to store this data & 128M is your default block size. HDFS itself internally decides how to split this 200GB file & store in chunks not exceeding 128M. And your HDFS cluster should have more than 200GB * replication factor of combined storage to persist this data.
Coming to the Spark execution part of the question, once you define spark.default.parallelism=100, it means that Spark will use this value as the default level of parallelism while performing certain operations (like join etc). Please note that the amount of data being processed by each executor is not affected by the block size (128M) in any way. Which means each executor task will work on 200G/100 = 2G of data (provided the executor memory is sufficient for the required operation being performed). In case there isn't enough capacity in the spark cluster to run 100 executors in parallel, then it will launch as many executors it can in batches as and when resources are available.

PySpark OOM for multiple data files

I want to process several idependent csv files of similar sizes (100 MB) in parallel with PySpark.
I'm running PySpark on a single machine:
spark.driver.memory 20g
spark.executor.memory 2g
local[1]
File content:
type (has the same value within each csv), timestamp, price
First I tested it on one csv (note I used 35 different window functions):
logData = spark.read.csv("TypeA.csv", header=False,schema=schema)
// Compute moving avg. I used 35 different moving averages.
w = (Window.partitionBy("type").orderBy(f.col("timestamp").cast("long")).rangeBetween(-24*7*3600 * i, 0))
logData = logData.withColumn("moving_avg", f.avg("price").over(w))
// Some other simple operations... No Agg, no sort
logData.write.parquet("res.pr")
This works great. However, i had two issues with scaling this job:
I tried to increase number of window functions to 50 the job OOMs. Not sure why PySpark doesn't spill to disk in this case, since window functions are independent of each other
I tried to run the job for 2 CSV files, it also OOMs. It is also not clear why it is not spilled to disk, since the window functions are basically partitioned by CSV files, so they are independent.
The question is why PySpark doesn't spill to disk in these two cases to prevent OOM, or how can I hint the Spark to do it?
If your machine cannot run all of these you can do that in sequence and write the data of each bulk of files before loading the next bulk.
I'm not sure if this is what you mean but you can try hint spark to write some of the data to your disk instead of keep it on RAM with:
df.persist(StorageLevel.MEMORY_AND_DISK)
Update if it helps
In theory, you could process all these 600 files in one single machine. Spark should spill to disk when meemory is not enough. But there're some points to consider:
As the logic involves window agg, which results in heavy shuffle operation. You need to check whether OOM happened on map or reduce phase. Map phase process each partition of file, then write shuffle output into some file. Then reduce phase need to fetch all these shuffle output from all map tasks. It's obvious that in your case you can't hold all map tasks running.
So it's highly likely that OOM happened on map phase. If this is the case, it means the memory per core can't process one signle partition of file. Please be aware that spark will do rough estimation of memory usage, then do spill if it thinks it should be. As the estatimation is not accurate, so it's still possible OOM. You can tune partition size by below configs:
spark.sql.files.maxPartitionBytes (default 128MB)
Usaually, 128M input needs 2GB heap with total 4G executor memory as
executor JVM heap execution memory (0.5 of total executor memory) =
(total executor memory - executor.memoryOverhead (default 0.1)) * spark.memory.storageFraction (0.6)
You can post all your configs in Spark UI for further investigation.

SPARK - assign multiple cores to one task in RDD.map in pyspark

I am new to SPARK and I'm trying to use RDD.map in pyspark to parallelize running of a method named function in the SPARK framework (72 cores in total in an Standalone SPARK cluster - one driver with 100G RAM and 3 workers each with 24 cores and 100G RAM).
My goal is to run function for 200 times and average over the results. The output of the function is an numpy.array of size 12 by num_of_samples (which is a huge variable in terms of memory).
My first attempt was to create an RDD of size 200, then use RDD.map and reduce at the end:
sum_data = sc.parallelize(range(0,200)).map(function).reduce(lambda x,y:x+y)
Despite the fact that I set the spark driver-memory to maximum, it runs out of memory at the reduce level (I guess due to the huge numpy.array output of the function). I figured the maximum number of element that I can put into my RDD in order to avoid memory error is something about 40 elements:
sum_data = sc.parallelize(range(0,40)).map(function).reduce(lambda x,y:x+y)
Now when I try this, I see that SPARK creates 40 tasks and assign exactly one core to each of them (using only 40 cores out of 72 available cores in the cluster). So the other 32 cores are idle and not being used, resulting in a very slower run-time. I was wondering if this approach is correct and how can I make RDD.map to consume all the available cores instead on using one core for each mapping?
I think this can be achieved by specifying the number of partitions that spark have to divide your RDDs into.
the simplest way for doing this is to add the optional numSlices parameter in the parallelize method call, this will ensure that spark split your data into numSlices partitions and I think it will be using the whole cores.
Please refer to the official documentation for more information.

Spark: executor memory exceeds physical limit

My input dataset is about 150G.
I am setting
--conf spark.cores.max=100
--conf spark.executor.instances=20
--conf spark.executor.memory=8G
--conf spark.executor.cores=5
--conf spark.driver.memory=4G
but since data is not evenly distributed across executors, I kept getting
Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used
here are my questions:
1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?
Thank you!
When using yarn, there is another setting that figures into how big to make the yarn container request for your executors:
spark.yarn.executor.memoryOverhead
It defaults to 0.1 * your executor memory setting. It defines how much extra overhead memory to ask for in addition to what you specify as your executor memory. Try increasing this number first.
Also, a yarn container won't give you memory of an arbitrary size. It will only return containers allocated with a memory size that is a multiple of it's minimum allocation size, which is controlled by this setting:
yarn.scheduler.minimum-allocation-mb
Setting that to a smaller number will reduce the risk of you "overshooting" the amount you asked for.
I also typically set the below key to a value larger than my desired container size to ensure that the spark request is controlling how big my executors are, instead of yarn stomping on them. This is the maximum container size yarn will give out.
nodemanager.resource.memory-mb
The 9GB is composed of the 8GB executor memory which you add as a parameter, spark.yarn.executor.memoryOverhead which is set to .1, so the total memory of the container is spark.yarn.executor.memoryOverhead + (spark.yarn.executor.memoryOverhead * spark.yarn.executor.memoryOverhead) which is 8GB + (.1 * 8GB) ≈ 9GB.
You could run the entire process using a single executor, but this would take ages. To understand this you need to know the notion of partitions and tasks. The number of partition is defined by your input and the actions. For example, if you read a 150gb csv from hdfs and your hdfs blocksize is 128mb, you will end up with 150 * 1024 / 128 = 1200 partitions, which maps directly to 1200 tasks in the Spark UI.
Every single tasks will be picked up by an executor. You don't need to hold all the 150gb in memory ever. For example, when you have a single executor, you obviously won't benefit from the parallel capabilities of Spark, but it will just start at the first task, process the data, and save it back to the dfs, and start working on the next task.
What you should check:
How big are the input partitions? Is the input file splittable at all? If a single executor has to load a massive amount of memory, it will run out of memory for sure.
What kind of actions are you performing? For example, if you do a join with very low cardinality, you end up with a massive partitions because all the rows with a specific value, end up in the same partitions.
Very expensive or inefficient actions performed? Any cartesian product etc.
Hope this helps. Happy sparking!

spark scalability: what am I doing wrong?

I am processing data with spark and it works with a day worth of data (40G) but fails with OOM on a week worth of data:
import pyspark
import datetime
import operator
sc = pyspark.SparkContext()
sqc = pyspark.sql.SQLContext(sc)
sc.union([sqc.parquetFile(hour.strftime('.....'))
.map(lambda row:(row.id, row.foo))
for hour in myrange(beg,end,datetime.timedelta(0,3600))]) \
.reduceByKey(operator.add).saveAsTextFile("myoutput")
The number of different IDs is less than 10k.
Each ID is a smallish int.
The job fails because too many executors fail with OOM.
When the job succeeds (on small inputs), "myoutput" is about 100k.
what am I doing wrong?
I tried replacing saveAsTextFile with collect (because I actually want to do some slicing and dicing in python before saving), there was no difference in behavior, same failure. is this to be expected?
I used to have reduce(lambda x,y: x.union(y), [sqc.parquetFile(...)...]) instead of sc.union - which is better? Does it make any difference?
The cluster has 25 nodes with 825GB RAM and 224 cores among them.
Invocation is spark-submit --master yarn --num-executors 50 --executor-memory 5G.
A single RDD has ~140 columns and covers one hour of data, so a week is a union of 168(=7*24) RDDs.
Spark very often suffers from Out-Of-Memory errors when scaling. In these cases, fine tuning should be done by the programmer. Or recheck your code, to make sure that you don't do anything that is way too much, such as collecting all the bigdata in the driver, which is very likely to exceed the memoryOverhead limit, no matter how big you set it.
To understand what is happening you should realize when yarn decides to kill a container for exceeding memory limits. That will happen when the container goes beyond the memoryOverhead limit.
In the Scheduler you can check the Event Timeline to see what happened with the containers. If Yarn has killed a container, it will be appear red and when you hover/click over it, you will see a message like:
Container killed by YARN for exceeding memory limits. 16.9 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
So in that case, what you want to focus on is these configuration properties (values are examples on my cluster):
# More executor memory overhead
spark.yarn.executor.memoryOverhead 4096
# More driver memory overhead
spark.yarn.driver.memoryOverhead 8192
# Max on my nodes
#spark.executor.cores 8
#spark.executor.memory 12G
# For the executors
spark.executor.cores 6
spark.executor.memory 8G
# For the driver
spark.driver.cores 6
spark.driver.memory 8G
The first thing to do is to increase the memoryOverhead.
In the driver or in the executors?
When you are overviewing your cluster from the UI, you can click on the Attempt ID and check the Diagnostics Info which should mention the ID of the container that was killed. If it is the same as with your AM Container, then it's the driver, else the executor(s).
That didn't resolve the issue, now what?
You have to fine tune the number of cores and the heap memory you are providing. You see pyspark will do most of the work in off-heap memory, so you want not to give too much space for the heap, since that would be wasted. You don't want to give too less, because the Garbage Collector will have issues then. Recall that these are JVMs.
As described here, a worker can host multiple executors, thus the number of cores used affects how much memory every executor has, so decreasing the #cores might help.
I have it written in memoryOverhead issue in Spark and Spark – Container exited with a non-zero exit code 143 in more detail, mostly that I won't forget! Another option, that I haven't tried would be spark.default.parallelism or/and spark.storage.memoryFraction, which based on my experience, didn't help.
You can pass configurations flags as sds mentioned, or like this:
spark-submit --properties-file my_properties
where "my_properties" is something like the attributes I list above.
For non numerical values, you could do this:
spark-submit --conf spark.executor.memory='4G'
It turned out that the problem was not with spark, but with yarn.
The solution is to run spark with
spark-submit --conf spark.yarn.executor.memoryOverhead=1000
(or modify yarn config).

Resources