Explain the difference between Spark configurations - apache-spark

I have to set the number of executors in my spark application as 20. While looking at the official documentation I'm confused which is a better config to set
spark.dynamicAllocation.initialExecutors = 20
spark.executor.instances=20
I have the following config enabled
spark.dynamicAllocation.enabled = true
In what use case scenario will I use either?

As per the spark documentation
spark.dynamicAllocation.initialExecutors
Initial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger
than this value, it will be used as the initial number of executors.
as you can see in the highlighted text it can be overwritten by --num-executors when it is set to a higher value then spark.dynamicAllocation.initialExecutors.
basically, when your application starts it will launch spark.dynamicAllocation.initialExecutors and then slowly increase till spark.dynamicAllocation.maxExecutors when dynamic allocation enabled.
spark.executor.instances
number of executors for static allocation.
In layman terms,
It is like saying I want x resources(spark.executor.instances) to finish a job
(OR)
I want min(x resources) and max(y resources) and initially(z resources) to finish a job...
condition (x<=z<=y) should always satisfy and your resources usage will be decided on the needed when your job is running.
when to use dynamic allocation?
when you have multiple streaming applications running on your cluster OR on-demand spark-sql jobs.most of the time your jobs might need few resources and almost remain idle only in big data stream chunks(peak hours) job might need more resource to process data otherwise cluster resources should be freed and used for other purposes.
Note: make sure to enable external shuffle service (spark.shuffle.service.enabled=true) when dynamic allocation is enabled.
The purpose of the external shuffle service is to allow executors to
be removed without deleting shuffle files written by them (more
detail). The way to set up this service varies across cluster
managers
Referrences :
https://dzone.com/articles/spark-dynamic-allocation

Related

Why shouldn't I always use dynamic allocation in spark?

Spark's dynamic allocation enables a more efficient use of resources. It is described here: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
It is not a default, and must be set for each job.
What are the downsides of setting every job to use dynamic allocation by default? What effects might be seen if I change this setting for all my running jobs?
To understand it lets take a look at Documentation
Here you can find this:
spark.dynamicAllocation.enabled false(default)
Whether to use dynamic resource
allocation, which scales the number of executors registered with this
application up and down based on the workload. For more detail, see
the description here.
This requires spark.shuffle.service.enabled or
spark.dynamicAllocation.shuffleTracking.enabled to be set. The
following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.executorAllocationRatio
Default values for revelevant parameters are:
spark.dynamicAllocation.initialExecutors = minExecutors
spark.dynamicAllocation.minExecutors = 0
spark.dynamicAllocation.maxExecutors = infinite
spark.dynamicAllocation.executorAllocationRatio = 1
Lets take a look at executorAllocationRatio description:
By default, the dynamic allocation will request enough executors to
maximize the parallelism according to the number of tasks to process.
While this minimizes the latency of the job, with small tasks this
setting can waste a lot of resources due to executor allocation
overhead, as some executor might not even do any work. This setting
allows to set a ratio that will be used to reduce the number of
executors w.r.t. full parallelism. Defaults to 1.0 to give maximum
parallelism. 0.5 will divide the target number of executors by 2 The
target number of executors computed by the dynamicAllocation can still
be overridden by the spark.dynamicAllocation.minExecutors and
spark.dynamicAllocation.maxExecutors settings
So what may happen when you just turn dynamic allocation on?
Lets say you have a job which at some stage is doing repartition(2000). Number of cores avilable per executor is set to 2. What is Spark going to do with dynamic allocation enabled with default parameters? It will try to get as many executors as it needs for max paralellism. In this case it will be 2000/2 (number of tasks at given stage/number of cores per executor, executorAllocationRatio is set to 1 so i am skipping it) = 1000 executors.
I saw a real scenarios in which some jobs where taking a lot of resources to work on really small inputs just because dynamic allocation was turned on
Imo if you want to use it you should tune also other parameters and for sure limit maxExecutors, especially if you are not alone on your cluster and you dont want to waste time&resources. Sometimes overhead for creating new exeuctor is just not worth it

Why caching small Spark RDDs takes big memory allocation in Yarn?

The RDDs that are cached (in total 8) are not big, only around 30G, however, on Hadoop UI, it shows that the Spark application is taking lots of memory (no active jobs are running), i.e. 1.4T, why so much?
Why it shows around 100 executors (here, i.e. vCores) even when there's no active jobs running?
Also, if cached RDDs are stored across 100 executors, are those executors preserved and no more other Spark apps can use them for running tasks any more? To rephrase the question: will preserving a little memory resource (.cache) in executors prevents other Spark app from leveraging the idle computing resource of them?
Is there any potential Spark config / zeppelin config that can cause this phenomenon?
UPDATE 1
After checking the Spark conf (zeppelin), it seems there's the default (configured by administrator by default) setting for spark.executor.memory=10G, which is probably the reason why.
However, here's a new question: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
Spark configuration
Perhaps you can try to repartition(n) your RDD to a fewer n < 100 partitions before caching. A ~30GB RDD would probably fit into storage memory of ten 10GB executors. A good overview of Spark memory management can be found here. This way, only those executors that hold cached blocks will be "pinned" to your application, while the rest can be reclaimed by YARN via Spark dynamic allocation after spark.dynamicAllocation.executorIdleTimeout (default 60s).
Q: Is it possible to keep only the memory needed for the cached RDDs in each executors and release the rest, instead of holding always the initially set memory spark.executor.memory=10G?
When Spark uses YARN as its execution engine, YARN allocates the containers of a specified (by application) size -- at least spark.executor.memory+spark.executor.memoryOverhead, but may be even bigger in case of pyspark -- for all the executors. How much memory Spark actually uses inside a container becomes irrelevant, since the resources allocated to a container will be considered off-limits to other YARN applications.
Spark assumes that your data is equally distributed on all the executors and tasks. That's the reason why you set memory per task. So to make Spark to consume less memory, your data has to be evenly distributed:
If you are reading from Parquet files or CSVs, make sure that they have similar sizes. Running repartition() causes shuffling, which if the data is so skewed may cause other problems if executors don't have enough resources
Cache won't help to release memory on the executors because it doesn't redistribute the data
Can you please see under "Event Timeline" on the Stages "how big are the green bars?" Normally that's tied to the data distribution, so that's a way to see how much data is loaded (proportionally) on every task and how much they are doing. As all tasks have same memory assigned, you can see graphically if resources are wasted (in case there are mostly tiny bars and few big bars). A sample of wasted resources can be seen on the image below
There are different ways to create evenly distributed files for your process. I mention some possibilities, but for sure there are more:
Using Hive and DISTRIBUTE BY clause: you need to use a field that is equally balanced in order to create as many files (and with proper size) as expected
If the process creating those files is a Spark process reading from a DB, try to create as many connections as files you need and use a proper field to populate Spark partitions. That is achieved, as explained here and here with partitionColumn, lowerBound, upperBound and numPartitions properties
Repartition may work, but see if coalesce also make sense in your process or in the previous one generating the files you are reading from

Spark jobs seem to only be using a small amount of resources

Please bear with me because I am still quite new to Spark.
I have a GCP DataProc cluster which I am using to run a large number of Spark jobs, 5 at a time.
Cluster is 1 + 16, 8 cores / 40gb mem / 1TB storage per node.
Now I might be misunderstanding something or not doing something correctly, but I currently have 5 jobs running at once, and the Spark UI is showing that only 34/128 vcores are in use, and they do not appear to be evenly distributed (The jobs were executed simultaneously, but the distribution is 2/7/7/11/7. There is only one core allocated per running container.
I have used the flags --executor-cores 4 and --num-executors 6 which doesn't seem to have made any difference.
Can anyone offer some insight/resources as to how I can fine tune these jobs to use all available resources?
I have managed to solve the issue - I had no cap on the memory usage so it looked as though all memory was allocated to just 2 cores per node.
I added the property spark.executor.memory=4G and re-ran the job, it instantly allocated 92 cores.
Hope this helps someone else!
The Dataproc default configurations should take care of the number of executors. Dataproc also enables dynamic allocation, so executors will only be allocated if needed (according to Spark).
Spark cannot parallelize beyond the number of partitions in a Dataset/RDD. You may need to set the following properties to get good cluster utilization:
spark.default.parallelism: the default number of output partitions from transformations on RDDs (when not explicitly set)
spark.sql.shuffle.partitions: the number of output partitions from aggregations using the SQL API
Depending on your use case, it may make sense to explicitly set partition counts for each operation.

How to rebalance RDD during processing time for unbalanced executor workloads

Suppose I have an RDD with 1,000 elements and 10 executors. Right now I parallelize the RDD with 10 partitions and process 100 elements by each executor (assume 1 task per executor).
My difficulty is that some of these partitioned tasks may take much longer than others, so say 8 executors will be done quickly, while the remaining 2 will be stuck doing something for longer. So the master process will be waiting for the 2 to finished before moving on, and 8 will be idling.
What would be a way to make the idling executors 'take' some work from the busy ones? Unfortunately I can't anticipate ahead of time which ones will end up 'busier' than others, so can't balance the RDD properly ahead of time.
Can I somehow make executors communicate with each other programmatically? I was thinking of sharing a DataFrame with the executors, but based on what I see I cannot manipulate a DataFrame inside an executor?
I am using Spark 2.2.1 and JAVA
Try using spark dynamic resource allocation, which scales the number of executors registered with the application up and down based on the workload.
You can endable the below properties
spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true
You can consider to configure the below properties as well
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.minExecutors
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.

Is there a way to specify all three resource properties (executor instances, cores and memory) in Spark on YARN (Dataproc)

I'm trying to setup a small Dataproc Spark cluster of 3 workers (2 regular and one preemptible) but I'm running into problems.
Specifically, I've been struggling to find a way to let the Spark application submitters to have freedom to specify the number of executors while being able to specify how many cores should be assigned to them
Dataproc image of Yarn and Spark has the following defaults:
Spark dynamic allocation enabled
Yarn Capacity Scheduler configured with DefaultResourceCalculator
With these defaults the number of cores is not taken into account (the ratio container-vcores is always 1:1), as DefaultResourceCalculator only cares about memory. In any case, when configured this way, the number of executors is honored (by means of setting spark.dynamicAllocation.enabled = false and spark.executor.instances = <num> as properties in gcloud submit)
So I changed it to DominantResourceCalculator and now it takes care of the requested cores but I'm no longer able to specify the number of executors, regardless of disabling the Spark dynamic allocation or not.
It might also be of interest to know that the default YARN queue is limited to 70 % of capacity by configuration (in capacity-scheduler.xml) and that there is also another non-default queue configured (but not used yet). My understanding is that both Capacity and Fair schedulers do not limit the resource allocation in case of uncontended job submission as long as the max capacity is kept at 100. In any case, for the sake of clarity, these are the properties setup during the cluster creation:
capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
capacity-scheduler:yarn.scheduler.capacity.root.queues=default,online
capacity-scheduler:yarn.scheduler.capacity.root.default.capacity=30
capacity-scheduler:yarn.scheduler.capacity.root.online.capacity=70
capacity-scheduler:yarn.scheduler.capacity.root.online.user-limit-factor=1
capacity-scheduler:yarn.scheduler.capacity.root.online.maximum-capacity=100
capacity-scheduler:yarn.scheduler.capacity.root.online.state=RUNNING
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_submit_applications=*
capacity-scheduler:yarn.scheduler.capacity.root.online.acl_administer_queue=*
The job submission is done by means of gcloud tool and the queue used is the default.
E.g, the following properties set when executing gcloud dataproc submit:
--properties spark.dynamicAllocation.enabled=false,spark.executor.memory=5g,spark.executor.instances=3
end up in the following assignment:
Is there a way to configure YARN so that it accepts both?
EDITED to specify queue setup
You may try setting a higher value, such as 2, for yarn.scheduler.capacity.root.online.user-limit-factor in place of the present value of 1, the value you have set. This setting enables the user to leverage twice the chosen capacity. Your setting of 100% as the maximum capacity allows for this doubling of the chosen capacity.

Resources