Is there a spark config parameter we can pass while submitting jobs through spark-submit, that will kill/fail job if it does not gets containers in given time?
For example if job requested for 8 yarn containers, which could not be allocated for 2 hours, then the job kill itself.
EDIT: We have scripts launching spark or MR jobs on a cluster. This issue is not a major one for MR jobs as they can start even if 1 container is available. Also MR jobs need much less memory so their containers can be smaller, hence more containers are available in cluster.
Related
My main Intent is to get Spark (3.3) running on k8s with HDFS.
I went though the Spark website and got the spark pi program running on a k8s cluster in the form of spark-submit command. I read that if we submit multiple jobs to k8s cluster the k8s may end up starving all pods -- meaning there is no queueing in place, no scheduler (like yarn) which keeps a check on resources and arranges the tasks across the nodes.
So, my question is: what is the simplest way to write a scheduler in k8s? I read about volcano -- but it's not yet in GA. I read about Gang, Younikorn -- but I don't see much community support.
I have a Dataproc cluster:
master - 6cores| 32g
worker{0-7} - 6cores| 32g
Maximum allocation: memory:24576, vCores:6
Have two spark-streaming jobs to submit, one after another
In the first place, I tried to submit with default configurations spark.dynamicAllocation.enabled=true
In 30% of cases, I saw that the first job caught almost all available memory and the second was queued and waited for resources for ages. (This is a streaming job which took a small portion of resources every batch ).
My second try was to change a dynamic allocation. I submitted the same two jobs with identical configurations:
spark.dynamicAllocation.enabled=false
spark.executor.memory=12g
spark.executor.cores=3
spark.executor.instances=6
spark.driver.memory=8g
Surprisingly in Yarn UI I saw:
7 Running Containers with 84g Memory allocation for the first job.
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
In Spark UI there are 6 executors and driver for the first job and 2 executors and driver for the second job
After retrying(deleting previous jobs and submitting the same jobs) without dynamic allocation and same configurations, I got a totally different result:
5 containers 59g Memory allocation for both jobs and 71g Reserved Memory for the second job. In spark UI I see 4 executors and driver in both cases.
I have a couple of questions:
If dynamicAllocation=false, why the number of yarn containers is
different from the number of executors? (Firstly I thought that
additional yarn container is a driver, but it differs in memory.)
If dynamicAllocation=false, Why Yarn doesn't create containers by my
exact requirements- 6 containers(spark executors) for both jobs. Why two different attempts with the same configuration lead to different results?
If dynamicAllocation=true - how may it be possible that low consuming memory spark job takes control of all Yarn resources
Thanks
Spark and YARN scheduling are pretty confusing. I'm going to answer the questions in reverse order:
3) You should not be using dynamic allocation in Spark streaming jobs.
The issue is that Spark continuously asks YARN for more executors as long as there's a backlog of tasks to run. Once a Spark job gets an executor, it keeps it until the executor is idle for 1 minute (configurable, of course). In batch jobs, this is okay because there's generally a large, continuous backlog of tasks.
However, in streaming jobs, there's a spike of tasks at the start of every micro-batch, but executors are actually idle most of the time. So a streaming job will grab a lot of executors that it doesn't need.
To fix this, the old streaming API (DStreams) has its own version of dynamic allocation: https://issues.apache.org/jira/browse/SPARK-12133. This JIRA has more background on why Spark's batch dynamic allocation algorithm isn't a good fit for streaming.
However, Spark Structured Streaming (likely what you're using) does not support dynamic allocation: https://issues.apache.org/jira/browse/SPARK-24815.
tl;dr Spark requests executors based on its task backlog, not based on memory used.
1 & 2) #Vamshi T is right. Every YARN application has an "Application Master", which is responsible for requesting containers for the application. Each of your Spark jobs has an app master that proxies requests for containers from the driver.
Your configuration doesn't seem to match what you're seeing in YARN, so not sure what's going on there. You have 8 workers with 24g given to YARN. With 12g executors, you should have 2 executors per node, for a total of 16 "slots". An app master + 6 executors should be 7 containers per application, so both applications should fit within the 16 slots.
We configure the app master to have less memory, that's why total memory for an application isn't a clean multiple of 12g.
If you want both applications to schedule all their executors concurrently, you should set spark.executor.instances=5.
Assuming you're using structured streaming, you could also just run both streaming jobs in the same Spark application (submitting them from different threads on the driver).
Useful references:
Running multiple jobs in one application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Dynamic allocation: https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Spark-on-YARN: https://spark.apache.org/docs/latest/running-on-yarn.html
I have noticed similar behavior in my experience as well and here is what I observed. Firstly the resource allocation by yarn depends on available resources on cluster when the job is submitted. When both jobs are submitted at almost the same time with same config, yarn distributes the available resources equally between the jobs. Now when you throw dynamic allocation in to the mix, things get a little confusing/complex. Now in your case below:
7 Running Containers with 84g Memory allocation for the first job.
--You got 7 containers because you requested 6 executors, one container for each executor and the extra one container is for the application Master
3 Running Containers with 36g Memory allocation and 72g Reserved Memory for the second job
--Since the second job was submitted after some time, Yarn allocated the remaining resources...2 containers, one for each executor and the extra one for your application master.
Your containers will never match the executors you requested and will always be one more than the number of executors you requested because you need one container to run your application master.
Hope that answers part of your question.
I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.
How does spark choose nodes to run executors?(spark on yarn)
We use spark on yarn mode, with a cluster of 120 nodes.
Yesterday one spark job create 200 executors, while 11 executors on node1,
10 executors on node2, and other executors distributed equally on the other nodes.
Since there are so many executors on node1 and node2, the job run slowly.
How does spark select the node to run executors?
according to yarn resourceManager?
As you mentioned Spark on Yarn:
Yarn Services choose executor nodes for spark job based on the availability of the cluster resource. Please check queue system and dynamic allocation of Yarn. the best documentation https://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/
Cluster Manager allocates resources across the other applications.
I think the issue is with bad optimized configuration. You need to configure Spark on the Dynamic Allocation. In this case Spark will analyze cluster resources and add changes to optimize work.
You can find all information about Spark resource allocation and how to configure it here: http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
Are all 120 nodes having identical capacity?
Moreover the jobs will be submitted to a suitable node manager based on the health and resource availability of the node manager.
To optimise spark job, You can use dynamic resource allocation, where you do not need to define the number of executors required for running a job. By default it runs the application with the configured minimum cpu and memory. Later it acquires resource from the cluster for executing tasks. It will release the resources to the cluster manager once the job has completed and if the job is idle up to the configured idle timeout value. It reclaims the resources from the cluster once it starts again.
With spark on yarn - I dont see a way to prevent concurrent jobs being scheduled. I have my architecture setup for doing purely batch processing.
I need this for the following reasons:
Resource Constraints
UserCache for spark grows really quickly. Having multiple jobs run causes an explosion of space on cache.
Ideally I'd love to see if there is a config that would ensure only one job to run at any time on Yarn.
You can run create a queue which can host only one application master and run all Spark jobs on that queue. Thus, if a Spark job is running the other will be accepted but they won't be scheduled and running until the running execution has finished...
Finally found the solution - was in yarn documents: yarn.scheduler.capacity.max-applications has to be set to 1 instead of 10000.