Configure (no) retries for Spark job in Livy submit - apache-spark

When submitting a long-running Spark batch job through Livy, the job is defaulting to five retries, which takes forever to finally fail. How can I change this so the job fails immediately?
My environment is Spark 1.6, running on Azure HDInsight (HDP).
Thanks!

This is a configuration on your yarn, not on Livy.
Go to yarn's configuration page under "Advanced yarn-site" and change "yarn.resourcemanager.am.max-attempts" from 5 to 1 if you want it to do no retry.

Related

Job inside spark application is complete but I still see the status as running, why?

I am running a spark application that completed all its jobs but still the status of this job yarn cluster portal is RUNNING (for more than 30 mins). Please let me know why it is happening.
Spark UI showing my jobs are completed
Spark application status is still running
I had the same problem with Spark 2.4.8 running on K8S, I didn’t understand why, but I solved it by stopping the context manually
spark.sparkContext.stop()

Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.
Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.
What is best way to track Spark job using Airflow if even I submitted?
My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:
Specifying remote master IP: Requires modifying global configurations / environment variables
Using SSHOperator: SSH connection might break
Using EmrAddStepsOperator: Dependent on EMR
Regarding tracking
Livy only reports state and not progress (% completion of stages)
If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)
Other considerations
Livy doesn't support reusing SparkSession for POST/batches request
If that's imperative, you'll have to write your application code in PySpark and use POST/session requests
References
How to submit Spark jobs to EMR cluster from Airflow?
livy/examples/pi_app
rssanders3/livy_spark_operator_python_example
Useful links
How to submit Spark jobs to EMR cluster from Airflow?
Remote spark-submit to YARN running on EMR

Spark 2.31 custom non ambari installation on HDP performance issue

I have Spark 2.3.1 custom non ambari installation on HDP 2.6.2 running on a cluster. I have made all the necessary configuration as per the spark and non ambari installation guides.
Now when I submit the spark job in Yarn cluster mode, I see huge gap of 10-12 minutes between the jobs and I do not see any error or operation that are being performed between the jobs. Attached screenshot shows the delay of close to 10 minutes between the jobs and this is leading to unnecessary delay in completing the Spark job.
Spark 2.3.1 job submitted in Yarn Cluster mode
I have checked the Yarn logs and Spark UI and I do not see any errors or any operations logged with the timestamp between the jobs.
Looking through the event timeline I see the gap of 10 +minutes between the jobs.
Event timeline gap between the jobs
Need help in providing any pointers to know how to fix this issue and improve the performance of the job.
Regards,
Vish

Zeppelin persists job in YARN

When I run a Spark job from Zeppelin, the job finishes with success, but it stays in YARN on mode running.
The problem is the job is taking a resource in YARN. I think that Zeppelin persists the job in YARN.
How can I resolve this problem?
Thank you
There are two solutions.
The quick one is to use the "restart interpreter" functionality, which is misnamed, since it merely stops the interpreter. In this case the Spark job in Yarn.
The elegant one is to configure Zeppelin to use dynamic allocation with Spark. In that case the Yarn application master will continue running, and with it the Spark driver, but all executors (which are the real resource hog) can be freed by Yarn, when they're not in use.
The easiest and straight-forward solution is to restart the spark interpreter.
But as Rick mentioned if you should use the spark dynamic allocation, an additional step of enabling spark shuffle service on all agent nodes is required(this by default is disabled).
Just close your spark context so that the spark job will get the status FINISHED.
Your memory should be released.

How to make spark streaming job run perpetually on HD Insights (YARN)?

I am developing a spark application running in HD Insights Cluster (YARN based) with IntelliJ. Currently, I submit jobs through the Azure HD Insights plug-in directly from IntelliJ. This, in turns, use the Livy API to submit the job remotely.
When I am done with developing the code, I would like the streaming job to be run perpetually. Currently, if the job fails five times, the program stops and doesn't restart itself. Is there any way to change this behavior? Or what solution do most people use to make spark restart after failing?
Restart of Yarn Spark jobs is controlled by Yarn settings. So you need to increase number of restarts for the spark application (yarn application master) in yarn. I believe it's: yarn.resourcemanager.am.max-attempts.
In HDInsight go to Ambari UI and change this setting in Yarn -> Config -> Advanced Yarn-site.
In order to submit production job you can use livy APIs directly as described here: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-eventhub-streaming#run-the-application-remotely-on-a-spark-cluster-using-livy

Resources