Zeppelin persists job in YARN - apache-spark

When I run a Spark job from Zeppelin, the job finishes with success, but it stays in YARN on mode running.
The problem is the job is taking a resource in YARN. I think that Zeppelin persists the job in YARN.
How can I resolve this problem?
Thank you

There are two solutions.
The quick one is to use the "restart interpreter" functionality, which is misnamed, since it merely stops the interpreter. In this case the Spark job in Yarn.
The elegant one is to configure Zeppelin to use dynamic allocation with Spark. In that case the Yarn application master will continue running, and with it the Spark driver, but all executors (which are the real resource hog) can be freed by Yarn, when they're not in use.

The easiest and straight-forward solution is to restart the spark interpreter.
But as Rick mentioned if you should use the spark dynamic allocation, an additional step of enabling spark shuffle service on all agent nodes is required(this by default is disabled).

Just close your spark context so that the spark job will get the status FINISHED.
Your memory should be released.

Related

What's the most elegant/right way to stop a spark job running on a Kubernetes cluster?

I'm new to apache spark and I'm trying to run a spark job using spark-submit on my Kubernetes cluster. I was wondering if there's a right way to stop spark jobs once the driver and executor pods are spawned? Would deleting the pods themselves be enough?
Thanks!
When you will delete executor it will be recreated again and spark application will work. However if you will delete driver pod it will stop application.
So killing driver pod is actually the way to stop the Spark
Application during the execution.
As you are new to Spark and you want to run it on Kubernetes, you should check this tutorial.
At present the only way to stop Spark job running on Kuberentes is to delete the Driver Pod (unless you have an app controlling Spark context which is able to manipulate it). Since all other job-related resources are linked to Spark Driver Pod with such as called ownerReferences, they will be removed automatically by Kubernetes.
It should clean things up when the job completes automatically.

Apache Spark running on YARN with fix allocation

What's happening right now is YARN simply gets a number of executor from one spark job and give it to another spark job. As a result, this spark job encounters error and die.
Is there a way or an existing configuration where a certain spark job running on YARN have a fix resource allocation?
Fix resource allocation is an old concept and doesn't give benefit of proper resource utilization. Dynamic resource allocation is an advanced/expected feature of YARN. So, I recommend that you see what is happening actually. If a job is already running on then YARN doesn't take the resources and gives it to others. If resources are not available then the 2nd job will get queued and resources will not be pulled up abruptly from the 1st job. The reason is containers have a combination of memory and CPU. If memory is allocated to other job then basically it means that the JVM of the 1st job is lost for ever. YARN doesn't do what have mentioned.

Submitting Spark Job On Scheduler Pool

I am running a spark streaming job on cluster mode , i have created a pool with memory of 200GB(CDH). I wanted to run my spark streaming job on that pool, i tried setting
sc.setLocalProperty("spark.scheduler.pool", "pool")
in code but its not working and i also tried the
spark.scheduler.pool seems not working in spark streaming, whenever i run the job it goes in the default pool. What would be the possible issue? Is there any configuration i can add while submitting the job?
In yarn we can add the
--conf spark.yarn.queue="que_name"
to the spark-submit command . Then it will use that particular queue and its resources only.
I ran into this same issue with Spark 2.4. In my case, the problem was resolved by removing the default "spark.scheduler.pool" option in my Spark config.
I traced the issue to a bug in Spark - https://issues.apache.org/jira/browse/SPARK-26988. The problem is that if you set the config property "spark.scheduler.pool" in the base configuration, you can't then override it using setLocalProperty. Removing it from the base configuration made it work correctly. See the bug description for more detail.

Spark on Yarn: How to prevent multiple spark jobs being scheduled

With spark on yarn - I dont see a way to prevent concurrent jobs being scheduled. I have my architecture setup for doing purely batch processing.
I need this for the following reasons:
Resource Constraints
UserCache for spark grows really quickly. Having multiple jobs run causes an explosion of space on cache.
Ideally I'd love to see if there is a config that would ensure only one job to run at any time on Yarn.
You can run create a queue which can host only one application master and run all Spark jobs on that queue. Thus, if a Spark job is running the other will be accepted but they won't be scheduled and running until the running execution has finished...
Finally found the solution - was in yarn documents: yarn.scheduler.capacity.max-applications has to be set to 1 instead of 10000.

Spark job scheduler without YARN/MESOS

I want to schedule some spark jobs in specified time intervals. Every scheduler that I found works only with Yarn/Mesos(e.g. Oozie, Luigi, Azkaban, Airflow). I'm running Datastax and it doesn't have the option of running with Yarn or Mesos. I saw somewhere that maybe Oozie can work with Datastax but couldn't find any help for that. Is there any solution to this problem or the only one is to write a scheduler myself?

Resources