All jobs of my Spark task have been executed, but the saprkcontext cannot be stop.
The following is the dirver log:
This doesn't happen every time, about one of 10 runs fails to end the application.
i use Executors.newFixedThreadPool().submit() to process About 500 SQL statements
Related
I have a spark structured streaming application where every batch gets processed in a few seconds. Right now, the current batch is stuck with all tasks in RUNNING status from more than an hour.
How can I specify a timeout in Spark at task level to tell spark that it should retry if task is not completed within defined time?
I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.
I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.
can we kill one of the jobs(time consuming) of a running spark application and move on to next job ?
Let us say , there are 50 jobs in a Spark Application and one of them is taking more time (may be it requires more memory than what we have configured) , So can we kill that job and move on to next job ?
and then we can run that job(that action which triggers that job) later with higher memory configuration
If this is not possible then how to handle these conditions ?
You can kill running job by:
opening Spark application UI.
going to jobs tab.
find job among running jobs.
click on kill link and confirm.
I created 2 jobs in Heritrix 3.2.0 and I launched both after building, both started running but after 15 to 20 seconds, one job is stopped and other continues and when a job is stopped, the status in jobs log is as follows:
2015-05-12T06:40:33.715Z INFO EMPTY 20150512063923
So could not multi-process the jobs. How to fix it?
No it just means that this job is done (queue is empty). If no pages were downloaded, it probably means that your decide rules are too strict and don't allow anything to be downloaded.