Once the cluster is created, is there any way that I can modify the properties like spark config under the configuration tab?
Related
I am wondering if I change the settings in spark-env.sh for the worker nodes, will that be changed for that executor? I know SPARK_WORKER_INSTANCES is applied for each executor, I can see that in web UI, but what about the rest?
I have set this for one executor, i am not able to verify if this settings is indeed being applied for this worker node. I am using spark standalone in cluster mode.
SPARK_EXECUTOR_CORES=4
SPARK_EXECUTOR_MEMORY=6G
Is there any way of verifying the settings for each executor node? If not, is there a way I can apply different settings for different executor?
Thanks and Regards,
Sudip
I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.
I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.).
From documentation of airflow didn't got any help. Also tried many things but didn't worked out.
Help is appreciated.
To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.
For example, you can set deployMode like this:
DataProcSparkOperator(
dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
In this answer you can find more details.
I am trying to update or add new fields to hive config in dataproc cluster using --properties field. I am running dataproc cluster command from cloud shell. What i am seeing is dataproc is adding new key with final. I am unable to find what it means?
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
<final>false</final>
<source>Dataproc Cluster Properties</source>
</property>
Also when does dataproc applies these changes to hive.xml? after hive service start running on the cluster or before ?
Also i am unable to find any documentation for how to restart hive and spark after making some changes to hive config after cluster creation ?
1) If a property is marked final, it cannot be overriden by users on a per-job basis (e.g. using command line parameters or setting properties in SparkConf/Configuration). We have explicitly made cluster-wide properties overridable. https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html
2) Dataproc applies --properties to the xml files before starting any services
3) If you manually change properties, you can restart the services relevant services by ssh'ing into the master node of the cluster and running sudo systemctl restart <service>. For hive, that's hive-metastore and hive-server2. For spark, that's spark-history-server. Several initialization actions do this.
4) Consider deleting and recreating your cluster if you want to change properties -- that's a bit easier than figuring out what services to restart.
5) Remember that you can still set per-job configs on a per-job basis. If you're using gcloud, that's something like gcloud dataproc jobs submit spark --properties spark.executors.cores=4 ...other args..., with spark-submit you can use --conf, and with hive, you can use set prop=value.
Apologies in advance as I am new to spark. I have created a spark cluster in standalone mode with 4 workers, and after successfully being able to configure worker properties, I wanted to know how to configure the master properties.
I am writing an application and connecting it to the cluster using SparkSession.builder (I do not want to submit it using spark-submit.)
I know that that the workers can be configured in the conf/spark-env.sh file and has parameters which can be set such as 'SPARK_WORKER_MEMORY' and 'SPARK_WORKER_CORES'
My question is: How do I configure the properties for the master? Because there is no 'SPARK_MASTER_CORES' or 'SPARK_MASTER_MEMORY' in this file.
I thought about setting this in the spark-defaults.conf file, however it seems that this is only used for spark-submit.
I thought about setting it in the application using SparkConf().set("spark.driver.cores", "XX") however this only specifies the number of cores for this application to use.
Any help would be greatly appreciated.
Thanks.
Three ways of setting the configurations of Spark Master node (Driver) and spark worker nodes. I will show examples of setting the memory of the master node. Other settings can be found here
1- Programatically through SpackConf class.
Example:
new SparkConf().set("spark.driver.memory","8g")
2- Using Spark-Submit: make sure not to set the same configuraiton in your code (Programatically like 1) and while doing spark submit. if you already configured settings programatically, every job configuration mentioned in spark-submit that overlap with (1) will be ignored.
example :
spark-submit --driver-memory 8g
3- through the Spark-defaults.conf:
In case none of the above is set this settings will be the defaults.
example :
spark.driver.memory 8g
I did modify the configurations on the driver of spark cluster, such as the both files of spark-defaults.conf and spark-env.sh. Do we need do the same things on the workers. It seems to not do those, but I am not sure.
Spark Properties (spark-defaults.conf):
No. Properties are applications specific not a cluster wide so has to be set only in your Spark directory.
Environment variables:
Yes if you need custom settings. Environment variables are machine specific and don't depend on application.