Spark configuration priority - apache-spark

Does there any difference or priority between specifying spark application configuration in the code :
SparkConf().setMaster(yarn)
and specifying them in command line
spark-submit --master yarn

Yes, the highest priority is given to the configuration in the user's code with the set() function. After that there the flags passed with spark-submit.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source

There are 4 precedence level: (1 to 4 , 1 being the highest priority):
SparkConf set in the application
Properties given with the spark-submit
Properties can be given in a property file. And the
property file can be given as argument while submission
Default values

Other than the priority, specifying it on a command-line would allow you to run on different cluster managers without modifying code. The same application can be run on local[n] or yarn or mesos or spark standalone cluster.

Related

what is the difference between sparksession.config() and spark.conf.set()

I tried use both ways to set spark.dynamicAllocation.minExecutors, but it seems like that only the first way works
spark2 = SparkSession \
.builder \
.appName("test") \
.config("spark.dynamicAllocation.minExecutors", 15) \
.getOrCreate()
vs.
spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)
It is not so much about the difference between the methods, as the difference in the context in which these are executed.
pyspark.sql.session.SparkSession.Builder options can be executed before Spark application has been started. This means that, if there is no active SparkSession to be retrieved, some cluster specific options can be still set.
If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
pyspark.sql.conf.RuntimeConfig can be retrieved only from exiting session, therefore its set method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.
In general RuntimeConfig.set is used to modify spark.sql.* configuration parameters, which normally can be changed on runtime.
Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using any of these methods, and can be modified only through spark-submit arguments or using configuration files.
I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors) cannot be set using spark2.conf.set vs SparkSession.config?
spark.dynamicAllocation.minExecutors is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.
The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.
With that said, let me answer your question directly, some configurations require to be set before a SparkSession instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors is among the configurations that cannot be changed after an instance of SparkSession has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).
Some config properties need to be set before the SparkSession starts for them to work. Sparksession uses them at the time of initialization. If u set spark.dynamicAllocation.minExecutors after the creation of sparksession there will still be a change in the value for that property in sparConf object and u can verify that by printing the property but it does not affect the sparksession session as it took the value present at the time of the initialization.

List of spark-submit options

There are a ton of tunable settings mentioned on Spark configurations page. However as told here, the SparkSubmitOptionParser attribute-name for a Spark property can be different from that property's-name.
For instance, spark.executor.cores is passed as --executor-cores in spark-submit.
Where can I find an exhaustive list of all tuning parameters of Spark (along-with their SparkSubmitOptionParser property name) that can be passed with spark-submit command?
While #suj1th's valuable inputs did solve my problem, I'm answering my own question to directly address my query.
You need not look up for SparkSubmitOptionParser's attribute-name for a given Spark property (configuration setting). Both will do just fine. However, do note that there's a subtle difference between there usage as shown below:
spark-submit --executor-cores 2
spark-submit --conf spark.executor.cores=2
Both commands shown above will have same effect. The second method takes configurations in the format --conf <key>=<value>.
Enclosing values in quotes (correct me if this is incorrect / incomplete)
(i) Values need not be enclosed in quotes (single '' or double "") of any kind (you still can if you want).
(ii) If the value has a space character, enclose the entire thing in double quotes "" like "<key>=<value>" as shown here.
For a comprehensive list of all configurations that can be passed with spark-submit, just run spark-submit --help
In this link provided by #suj1th, they say that:
configuration values explicitly set on a SparkConf take the highest
precedence, then flags passed to spark-submit, then values in the
defaults file.
If you are ever unclear where configuration options are coming from,
you can print out fine-grained debugging information by running
spark-submit with the --verbose option.
Following two links from Spark docs list a lot of configurations:
Spark Configuration
Running Spark on YARN
In your case, you should actually load your configurations from a file, as mentioned in this document, instead of passing them as flags to spark-submit. This relieves the overhead of mapping SparkSubmitArguments to Spark configuration parameters. To quote from the above document:
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

What is difference between configuration and and env variables of spark?

There is some configuration I confuse like
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 3
spark.eventLog.dir=/home/rabindra/etl/logs
SPARK_WORKER_DIR=/home/knoldus/work/sparkdata
Where these variable of spark i will use spark-env.sh or spark-defaults.conf?
What are the configuration we can do in spark standalone cluster ?
The first three go in spark-defaults.conf. The last goes into spark-env.sh as shown in this Knoldus example--maybe the one you're using.
I suppose an analogy might be the difference between JVM arguments and environment variables. As shown in the documentation, the configurations you want to apply to a SparkConf, like the application name, the URI of the master, or memory allocation, are on a per-application basis.
Meanwhile, environment variables, whether related to Spark or anything else, apply on a per-machine basis. Of course sometimes the machine-specific settings you would specify with an environment variable belong instead in your resource manager like YARN.
The list of configuration parameters is large. See the documentation linked above for more.

Standalone Cluster Mode: how does spark allocate spark.executor.cores?

I'm searching for how and where spark allocates cores per executor in the
source code.
Is it possible to control programmaticaly allocated cores in standalone
cluster mode?
Regards,
Matteo
Spark allows for configuration options to be passed through the .set method on the SparkConf class.
Here's some scala code that sets up a new spark configuration:
new SparkConf()
.setAppName("App Name")
.setMaster('local[2]')
.set("spark.executor.cores", "2")
Documentation about the different configuration options:
http://spark.apache.org/docs/1.6.1/configuration.html#execution-behavior
I haven't looked through the source code exhaustively, but I think this is the spot in the source code where the executor cores are defined prior to allocation:
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala
In stand alone mode, you have following options:
a. While starting the cluster, you can mention how many cpu cores to be allotted for spark applications. This can be set both as env variable SPARK_WORKER_CORES or passed as argument to shell script (-c or --cores)
b. Care should be taken (if other applications also share resources like cores) not to allow spark to take all the cores. This can be set using spark.cores.max parameter.
c. You can also pass --total-executor-cores <numCores> to the spark shell
For more info, you can look here

Which configuration options are preferred in spark?

I wanted to enquire that which configuration option is given priority in spark? Is it the configuration file or the options we manually specify when running the spark-submit shell?
What if I have different options for executor memory in my configuration file and I specify a different value while running the spark-submit shell?
The Spark (1.5.0) configuration page clearly states what the priorities are:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
So this is the priority order (from highest to lowest):
Properties set on the SparkConf (in program).
Flags passed to spark-submit or spark-shell.
Options set in the spark-defaults.conf file.

Resources