Job inside spark application is complete but I still see the status as running, why? - apache-spark

I am running a spark application that completed all its jobs but still the status of this job yarn cluster portal is RUNNING (for more than 30 mins). Please let me know why it is happening.
Spark UI showing my jobs are completed
Spark application status is still running

I had the same problem with Spark 2.4.8 running on K8S, I didn’t understand why, but I solved it by stopping the context manually
spark.sparkContext.stop()

Related

Why is spark process on my cluster even before I have started my sparksession?

I have set up a dataproc cluster to run my spark jobs on. I have just set up the cluster and have not started any spark session yet. Still, I am seeing spark process, mapreduce process, yarn etc in my top command. What is that about? Should not the spark process start after I have started the SparkSession with configurations of my choice?
These are all background processes and daemons running in the background, running and monitoring the hadoop and spark ecosystem, and waiting for you to submit a request or program, that can be run. They need to be up and running first before you can run a spark app. Pretty normal on Linux.

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.

Spark 2.31 custom non ambari installation on HDP performance issue

I have Spark 2.3.1 custom non ambari installation on HDP 2.6.2 running on a cluster. I have made all the necessary configuration as per the spark and non ambari installation guides.
Now when I submit the spark job in Yarn cluster mode, I see huge gap of 10-12 minutes between the jobs and I do not see any error or operation that are being performed between the jobs. Attached screenshot shows the delay of close to 10 minutes between the jobs and this is leading to unnecessary delay in completing the Spark job.
Spark 2.3.1 job submitted in Yarn Cluster mode
I have checked the Yarn logs and Spark UI and I do not see any errors or any operations logged with the timestamp between the jobs.
Looking through the event timeline I see the gap of 10 +minutes between the jobs.
Event timeline gap between the jobs
Need help in providing any pointers to know how to fix this issue and improve the performance of the job.
Regards,
Vish

Why can't I see my spark job on port 4040 even though I see it running in the yarn UI?

It was my understanding that we can always see spark jobs at port 4040 on the master node.
I am running spark on yarn, and I'm running a job through spark submit. When I go into the yarn UI, I am able to see that the job is RUNNING, and when I go to the spark master at <master ip>:18081 then I can see that the spark job is picked up by the master in the history server.
(and of course I do not see it runing at <master ip>:18080 because I'm not running in standalone).
However, I am not able to see the job running at port 4040 in the yarn UI.
The job does not fail - it just hangs for forever. And I can see that my workers are alive in the spark master UI.
This is the resource utilization shown by yarn while it is running:
How do I fix this to show that my spark job is running, or at least give me some feedback on whether or not my job is progressing?

How to tell if your spark job is waiting for resources

I have a pyspark job that I submit to a standalone spark cluster - this is an auto scaling cluster on ec2 boxes so when jobs are submitted and not enough nodes are available, after a few minutes a few more boxes spin up and become available.
We have a #timeout decorator on the main part of the spark job to timeout and error when it's exceeded a certain time threshold (put in place because of some jobs hanging). The issue is that sometimes a job may not have gotten to actually starting because its waiting on resources yet #timeout function is evaluated and jobs error out as a result.
So I'm wondering if there's anyway to tell from within the application itself, with code, if the job is waiting for resources?
To know the status of the application then you need to access the Spark Job History server from where you can get the current status of the job.
You can solve your problem as follows:
Get Application ID of your job through sc.applicationId.
Then use this application Id with Spark History Server REST APIs to get the status of the submitted job.
You can find the Spark History Server Rest APIs at link.

Resources