Azure Spark how to configure application retry limit - azure

I am currently running Azure HD insight Spark cluster and by default if an application fails, it retries 5 times. I would like to change this and set it to 2. Do you know how to do this?

You can pass this in as a configuration point to spark-submit.
spark-submit ... --conf spark.yarn.maxAppAttempts=2
You can also set an override at the cluster level in Ambari. You can access Custom spark-defaults via spark->configs and set overrides via the ambari UI.

Related

Cluster-wide Spark configuration in standalone mode

We are running a Spark standalone cluster in a Docker environment. How do I set cluster-wide configurations that every application getting submitted to the cluster use? As far as I understand it, it seems that the local spark-defaults get used from the host submitting the application, even if cluster is used as deploy mode. Can that be changed?

Set Cloudera application tags for Spark application

I have set spark.yarn.tags in my spark application and it is visible as well in my config when printed.
But Cloudera manager is unable to detect it in application_tags field of yarn application.
Does application_tags map to spark.yarn.tags for spark applications?
I think I found the solution.
When spark.yarn.tags is set while calling spark-submit, cloudera manager detects it. So I believe it is something it requires before spark context is created, hence it has to be passed as conf while submitting.
This is how it can be passed to the spark-submit
--conf spark.yarn.tags=tag-name

Dataproc adding extra field when adding keys using --properties?

I am trying to update or add new fields to hive config in dataproc cluster using --properties field. I am running dataproc cluster command from cloud shell. What i am seeing is dataproc is adding new key with final. I am unable to find what it means?
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
<final>false</final>
<source>Dataproc Cluster Properties</source>
</property>
Also when does dataproc applies these changes to hive.xml? after hive service start running on the cluster or before ?
Also i am unable to find any documentation for how to restart hive and spark after making some changes to hive config after cluster creation ?
1) If a property is marked final, it cannot be overriden by users on a per-job basis (e.g. using command line parameters or setting properties in SparkConf/Configuration). We have explicitly made cluster-wide properties overridable. https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html
2) Dataproc applies --properties to the xml files before starting any services
3) If you manually change properties, you can restart the services relevant services by ssh'ing into the master node of the cluster and running sudo systemctl restart <service>. For hive, that's hive-metastore and hive-server2. For spark, that's spark-history-server. Several initialization actions do this.
4) Consider deleting and recreating your cluster if you want to change properties -- that's a bit easier than figuring out what services to restart.
5) Remember that you can still set per-job configs on a per-job basis. If you're using gcloud, that's something like gcloud dataproc jobs submit spark --properties spark.executors.cores=4 ...other args..., with spark-submit you can use --conf, and with hive, you can use set prop=value.

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

Configuring master node in spark cluster

Apologies in advance as I am new to spark. I have created a spark cluster in standalone mode with 4 workers, and after successfully being able to configure worker properties, I wanted to know how to configure the master properties.
I am writing an application and connecting it to the cluster using SparkSession.builder (I do not want to submit it using spark-submit.)
I know that that the workers can be configured in the conf/spark-env.sh file and has parameters which can be set such as 'SPARK_WORKER_MEMORY' and 'SPARK_WORKER_CORES'
My question is: How do I configure the properties for the master? Because there is no 'SPARK_MASTER_CORES' or 'SPARK_MASTER_MEMORY' in this file.
I thought about setting this in the spark-defaults.conf file, however it seems that this is only used for spark-submit.
I thought about setting it in the application using SparkConf().set("spark.driver.cores", "XX") however this only specifies the number of cores for this application to use.
Any help would be greatly appreciated.
Thanks.
Three ways of setting the configurations of Spark Master node (Driver) and spark worker nodes. I will show examples of setting the memory of the master node. Other settings can be found here
1- Programatically through SpackConf class.
Example:
new SparkConf().set("spark.driver.memory","8g")
2- Using Spark-Submit: make sure not to set the same configuraiton in your code (Programatically like 1) and while doing spark submit. if you already configured settings programatically, every job configuration mentioned in spark-submit that overlap with (1) will be ignored.
example :
spark-submit --driver-memory 8g
3- through the Spark-defaults.conf:
In case none of the above is set this settings will be the defaults.
example :
spark.driver.memory 8g

Resources