can we kill one of the jobs(time consuming) of a running spark application and move on to next job ?
Let us say , there are 50 jobs in a Spark Application and one of them is taking more time (may be it requires more memory than what we have configured) , So can we kill that job and move on to next job ?
and then we can run that job(that action which triggers that job) later with higher memory configuration
If this is not possible then how to handle these conditions ?
You can kill running job by:
opening Spark application UI.
going to jobs tab.
find job among running jobs.
click on kill link and confirm.
Related
I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.
I run a job on a Spark cluster, and it takes most resources of the server, but now I want to suspend the job so that another program can run faster, and resume the job later, but I don't know how to do it.
I don't think you can suspend a job and resume it later
Sorry but can't find he configuration point a need. I schedule spark application, sometimes they may not succeed after 1 hour, in this case I want to automatically kill this task (because I am sure it will never succeed, and another scheduling may start).
I found a timeout configuration, but as I understand it, this is used to delay the start of a workflow.
So is there a kind of living' timeout ?
Oozie cannot kill a workflow that it triggered. However you can ensure that a single workflow is running at same time by setting Concurrency = 1 in the Coordinator.
Also you can have a second Oozie workflow monitoring the status of the Spark job.
Anyawy, you should investigate the root cause of Spark job not successful or being blocked.
I'm experiencing a strange behavior while streaming from Kafka using spark 2.1.0/2.0.2 on AWS EMR.
"spark.streaming.concurrentJobs" was set to 1 explicitly to the streaming job but after running for a while, the job tab showed more than 1 active jobs running and such "active" jobs keep increasing.
Inside such jobs, some stages remaining not executed for ever (status is --). However all the tasks are shown as SUCCEED under those jobs.
What could be wrong here? A more strange thing is that, such behavior seems not occurring unless I open the Spark UI page to check the current status frequently.
Jobs tab - http://ibb.co/j6XkXk
Stages - http://ibb.co/budg55
It was only Job 12109 at the beginning. Things got piled up when I switched tabs a couple of times.
Regards,
Alex
With spark on yarn - I dont see a way to prevent concurrent jobs being scheduled. I have my architecture setup for doing purely batch processing.
I need this for the following reasons:
Resource Constraints
UserCache for spark grows really quickly. Having multiple jobs run causes an explosion of space on cache.
Ideally I'd love to see if there is a config that would ensure only one job to run at any time on Yarn.
You can run create a queue which can host only one application master and run all Spark jobs on that queue. Thus, if a Spark job is running the other will be accepted but they won't be scheduled and running until the running execution has finished...
Finally found the solution - was in yarn documents: yarn.scheduler.capacity.max-applications has to be set to 1 instead of 10000.