Changing Number of Executors in Spark during run time - apache-spark

Can Spark change number of executors during runtime?
Example, In an Action(Job), Stage 1 runs with 4 executor * 5 partitions per executor = 20 partitions in parallel.
If I repartition with .repartition(100), Which is Stage 2 now (because of repartition shuffle), Can in any case Spark increases from 4 executors to 5 executors (or more)?
If I cache some data in Stage 1 which was running with 4 executors * 5 partition per executor = 20 partitions, then the cached data must be in the RAM of 4 machines. If I repartition with .repartition(2), in this case, definitely there will be <= 2 executor machines involved. Will spark shuffle my cached data to the active tasks?

Related

Running more than one spark applications in cluster, all spark applications are not running optimally as some are getting completed sooner

I am running 20 spark applications on an emr cluster of 2 workers and 1 master node with c5.24xlarge instances, thus I have 192 cores in total and 1024 gb ram in total.
Each application is processing around 1.5gb data!
I am having dynamic allocation as enabled and other spark configuration as following!
spark.executor.memory = 9000M,
spark.executor.memoryoverhead = 1000M,
spark.executor.cores = 5,
spark.sql.shuffle.partitions = 40,
spark.dynamicallocation.initialExecutor = 2,
spark.dynamicallocation.maxExecutor = 10,
spark.dynamicallocation.minExecutor = 10,
spark.dynamicallocation.executorIdleTimeOut = 60s,
spark.dynamicallocation.SchedulerBacklogTimeout = 120s,
spark.driver.memory = 9000M,
spark.driver.memoryoverhead = 1000M,
spark.driver.cores = 9
And spark.default.parallelism =384 (This is decided by emr, not sure how it is decided)
These, I have set at the cluster level which means for all 20 applications this would be the spark confs properties.
having these configurations settings, I can see that only few applications are getting completed in around 20 mins, and around 10 applications keep running for more than 2 hours with only one task running and for some around 10.
Questions:
Why other applications are not getting completed like other
completed one?
I am giving partitions count as 40, then why around 350 tasks
are being added (is it because of parallelism) for each application.
the data for each task is showing like 1.4 gb for task 0 , then
1.3gb for task 1 and so on, is that how data is shown for task pane or data is not being divided properly? (Although I can see in event matrics that partitions seem to be of same size, execution timewise.)
Data to executor in executor summary tab is showing more
than for 1.4 gb, which is sum of all input for each task, but its
processing 1.5gb only (from spark sql tab), so it means that's how
data is shown in executor pane?
Thanks!

Can reduced parallelism lead to no shuffle spill?

Consider an example:
I have a cluster with 5 nodes and each node has 64 cores with 244 GB memory.
I decide to run 3 executors on each node and set executor-cores to 21 and executor memory of 80GB, so that each executor can execute 21 tasks in parallel. Now consider that 315(63 * 5) partitions of data, out of which 314 partitions are of size 3GB but one of them is 30GB(due to data skew).
All of the executors that received the 3GB partitions have 63GB(21 * 3 = since each executor can run 21 tasks in parallel and each task takes 3GB of memory space) occupied.
But the one executor that received the 30GB partition will need 90GB(20 * 3 + 30) memory. So will this executor first execute the 20 tasks of 3GB and then load 30GB task or will it just try to load 21 tasks and find that for one task it has to spill to disk? If I set executor-cores to just 15 then the executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72 gb and hence won't spill to disk.
So in this case will reduced parallelism lead to no shuffle spill?
#Venkat Dabri ,
Could you please format the questions with appropriate carriage return/spaces ?
Here are few pointers
Spark (Shuffle)Map Stage ==> the size of each partition depends on filesystem's block size. E.g. if data is read from HDFS , each partition will try to have data as close as 128MB so for input data number of partitions = floor(number of files * blocksize/128 (actually 122.07 as Mebibyte is used))
Now the scenario you are describing is for Shuffled data in Reducer(Result Stage)
Here the blocks processed by reducer tasks are called Shuffled Blocks and By default Spark ( for SQL/Core APIs) will launch 200 reducer tasks
Now important thing to remember Spark can hold Max 2GB so if you have too few partitions and one of them does a remote fetch of a shuffle block > 2GB, you will see an error like Size exceeds Integer.MAX_VALUE
To mitigate that , within default limit Spark employs many optimization (compression/tungsten-sort-shuffle etc) but as a developer we can try to repartition skewed data intelligently and tune default parallelism

How DataFrame Size cached in Memory Affect Processing time in Apache Spark?

I was experimenting with, what is the maximum raw data that i can cache without affecting overall processing time for a Job in Spark.
Spark Cluster - 2 machines, 12 Cores, 96GB Ram. I have create 12 Workers each with 1 Core and 8GB Ram.
I have cached a parquet of ~2.4GB which create a memory footprint of around 5.4GB on RAM. On Simple task its taking ~8 sec (Count --> GroupBY --> Collect).
I have again cached 6 similar files, each parquet of ~2.4GB which create a total memory footprint of around ~30GB. again performing same task (Count--> GroupBY --> Collect) on loaded 5.4GB cached dataframe its taking ~12 sec.
data is mix of (date , timestamp , string , double ) fields ~300 columns in each file.
Already Tried -
Case 1 - Total Executors - 4 , Each Executor Cores - 3 cores , Each Executor Memory 24GB
Case 2 - Total Executors - 6 , Each Executor Cores - 2 cores , Each Executor Memory 16GB
Case 3 - Total Executors - 12 , Each Executor Cores - 1 cores , Each Executor Memory 8GB
Case 3 is giving me best results.
is this correct behaviour for spark?
Spark v2.0.2

How to handle Spark Executors when number of partitions do not match no of Executors?

Let's say I have 3 executors and 4 partitions, and we assume theses number cannot be changed.
This is not an efficient setup, because we have to read 2 passes: in 1 pass, we read 3 partitions; and in the second partition, we read 1 partition.
Is there a way in Spark that we can improve the efficiency without changing the number of executors and partitions?
In your scenario you need to update the number of cores.
In spark each partition is taken up for execution by one task of spark. As you have 3 executors and 4 partitions and if you assume you have total 3 cores I.e one core per executor then 3 partition of data will be run in parallel and one partition will be taken once one core for the executor will be free. To handle this latency we need to increase spark.executor.cores=2. I.e each executor can run 2 threads at a time I.e 2 tasks at a time.
So all your partitions will be executed in parallel but it does not guarantee whether 1 executor will run 2 tasks and 2 executors will run one task each or 2 executors will run 2 tasks on 2 individual partitions with one executor will be idle.

Losing Spark Executors with many tasks outstanding

Whether I use dynamic allocation or explicitly specify executors (16) and executor cores (8), I have been losing executors even though the tasks outstanding are well beyond the current number of executors.
For example, I have a job (Spark SQL) running with over 27,000 tasks and 14,000 of them were complete, but executors "decayed" from 128 down to as few as 16 with thousands of tasks still outstanding. The log doesn't note any errors/exceptions precipitating these lost executors.
It is a Cloudera CDH 5.10 cluster running on AWS EC2 instances with 136 CPU cores and Spark 2.1.0 (from Cloudera).
17/05/23 18:54:17 INFO yarn.YarnAllocator: Driver requested a total number of 91 executor(s).
17/05/23 18:54:17 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 91 executors.
It's a slow decay where every minute or so more executors are removed.
Some potentially relevant configuration options:
spark.dynamicAllocation.maxExecutors = 136
spark.dynamicAllocation.minExecutors = 1
spark.dynamicAllocation.initialExecutors = 1
yarn.nodemanager.resource.cpu-vcores = 8
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.increment-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 8
Why are the executors decaying away and how can I prevent it?

Resources