How to make spark streaming job run perpetually on HD Insights (YARN)? - apache-spark

I am developing a spark application running in HD Insights Cluster (YARN based) with IntelliJ. Currently, I submit jobs through the Azure HD Insights plug-in directly from IntelliJ. This, in turns, use the Livy API to submit the job remotely.
When I am done with developing the code, I would like the streaming job to be run perpetually. Currently, if the job fails five times, the program stops and doesn't restart itself. Is there any way to change this behavior? Or what solution do most people use to make spark restart after failing?

Restart of Yarn Spark jobs is controlled by Yarn settings. So you need to increase number of restarts for the spark application (yarn application master) in yarn. I believe it's: yarn.resourcemanager.am.max-attempts.
In HDInsight go to Ambari UI and change this setting in Yarn -> Config -> Advanced Yarn-site.
In order to submit production job you can use livy APIs directly as described here: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-eventhub-streaming#run-the-application-remotely-on-a-spark-cluster-using-livy

Related

Determine where spark program is failing?

Is there anyway to debug a Spark application that is running in a cluster mode? I have a program that has been running successfully for a while, which processes a couple hundred GB at a time. Recently I had some data cause the run to fail due to executors being disconnected. From what I have read, this is likely a memory issue. I'm trying to determine what function/action is causing the memory issue to trigger. I am using Spark on an EMR cluster(which uses YARN), what would be the best way to debug this issue?
For cluster mode you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.
For client mode you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040 where driverHostName is the Master Node in EMR and 4040 is the default port (this can be changed).
Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/
These are the essential resources with the Spark UI being the main toolkit for your request.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html

Spark jobs not showing up in Hadoop UI in Google Cloud

I created a cluster in Google Cloud and submitted a spark job. Then I connected to the UI following these instructions: I created an ssh tunnel and used it to open the Hadoop web interface. But the job is not showing up.
Some extra information:
If I connect to the master node of the cluster via ssh and run spark-shell, this "job" does show up in the hadoop web interface.
I'm pretty sure I did this before and I could see my jobs (both running and already finished). I don't know what happened in between for them to stop appearing.
The problem was that I was running my jobs in local mode. My code had a .master("local[*]") that was causing this. After removing it, the jobs showed up in the Hadoop UI as before.

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

How to start, stop and re-start cluster manually in Bluemix Spark?

What is the simplest way to start and stop Spark clusters manually in Bluemix? I would basically want to run the same things that
sbin/start-all.sh
and
sbin/stop-all.sh
do on a standalone Spark installment.
You can't. The Apache Spark service in Bluemix runs clusters that are shared by many users. No user is allowed to shut down or start these clusters.

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

Resources