Which configuration options are preferred in spark? - apache-spark

I wanted to enquire that which configuration option is given priority in spark? Is it the configuration file or the options we manually specify when running the spark-submit shell?
What if I have different options for executor memory in my configuration file and I specify a different value while running the spark-submit shell?

The Spark (1.5.0) configuration page clearly states what the priorities are:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
So this is the priority order (from highest to lowest):
Properties set on the SparkConf (in program).
Flags passed to spark-submit or spark-shell.
Options set in the spark-defaults.conf file.

Related

List of spark-submit options

There are a ton of tunable settings mentioned on Spark configurations page. However as told here, the SparkSubmitOptionParser attribute-name for a Spark property can be different from that property's-name.
For instance, spark.executor.cores is passed as --executor-cores in spark-submit.
Where can I find an exhaustive list of all tuning parameters of Spark (along-with their SparkSubmitOptionParser property name) that can be passed with spark-submit command?
While #suj1th's valuable inputs did solve my problem, I'm answering my own question to directly address my query.
You need not look up for SparkSubmitOptionParser's attribute-name for a given Spark property (configuration setting). Both will do just fine. However, do note that there's a subtle difference between there usage as shown below:
spark-submit --executor-cores 2
spark-submit --conf spark.executor.cores=2
Both commands shown above will have same effect. The second method takes configurations in the format --conf <key>=<value>.
Enclosing values in quotes (correct me if this is incorrect / incomplete)
(i) Values need not be enclosed in quotes (single '' or double "") of any kind (you still can if you want).
(ii) If the value has a space character, enclose the entire thing in double quotes "" like "<key>=<value>" as shown here.
For a comprehensive list of all configurations that can be passed with spark-submit, just run spark-submit --help
In this link provided by #suj1th, they say that:
configuration values explicitly set on a SparkConf take the highest
precedence, then flags passed to spark-submit, then values in the
defaults file.
If you are ever unclear where configuration options are coming from,
you can print out fine-grained debugging information by running
spark-submit with the --verbose option.
Following two links from Spark docs list a lot of configurations:
Spark Configuration
Running Spark on YARN
In your case, you should actually load your configurations from a file, as mentioned in this document, instead of passing them as flags to spark-submit. This relieves the overhead of mapping SparkSubmitArguments to Spark configuration parameters. To quote from the above document:
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

Apache Spark: setting executor instances

I run my Spark application on YARN with parameters:
in spark-defaults.conf:
spark.master yarn-client
spark.driver.cores 1
spark.driver.memory 1g
spark.executor.instances 6
spark.executor.memory 1g
in yarn-site.xml:
yarn.nodemanager.resource.memory-mb 10240
All other parameters are set to default.
I have a 6-node cluster and the Spark Client component is installed on each node.
Every time I run the application there are only 2 executors and 1 driver visible in the Spark UI. Executors appears on different nodes.
Why can't Spark create more executors? Why are only 2 instead of 6?
I found a very similar question: Apache Spark: setting executor instances does not change the executors, but increasing the memoty-mb parameter didn't help in my case.
The configuration looks OK at first glance.
Make sure that you have overwritten the proper spark-defaults.conf file.
Execute echo $SPARK_HOME for the current user and verify, if the modified spark-defaults file is in the $SPARK_HOME/conf/ directory. Otherwise Spark cannot see your changes.
I have modified the wrong spark-defaults.conf file. I had two users in my system and each user had a different $SPARK_HOME directory set (I didn't know it before). That's why I couldn't see any effect of my settings for one of the users.
You can run your spark-shell or spark-submit with an argument --num-executors 6 (if you want to have 6 executors). If Spark creates more executors than before, you will be sure, that it's not the memory issue but something with the unreadable configuration.

Spark configuration priority

Does there any difference or priority between specifying spark application configuration in the code :
SparkConf().setMaster(yarn)
and specifying them in command line
spark-submit --master yarn
Yes, the highest priority is given to the configuration in the user's code with the set() function. After that there the flags passed with spark-submit.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source
There are 4 precedence level: (1 to 4 , 1 being the highest priority):
SparkConf set in the application
Properties given with the spark-submit
Properties can be given in a property file. And the
property file can be given as argument while submission
Default values
Other than the priority, specifying it on a command-line would allow you to run on different cluster managers without modifying code. The same application can be run on local[n] or yarn or mesos or spark standalone cluster.

Spark - config file that sets spark.storage.memoryFraction

I have come to learn that spark.storage.memoryFraction and spark.storage.safteyFraction are multiplied by the executor memory supplied in the sparkcontext. Also, I have learned that it is desirable to lower the memoryFraction for better performance.
The question is where do I set the spark.storage.memoryFraction? Is there a config file?
The default file that Spark search for such configurations is conf/spark-defaults.conf
If you want to change dir conf to a customized position, set SPARK_CONF_DIR in conf/spark-env.sh
I recommend you to keep it on per job basis instead of updatring spark-defaults.conf
you can create a config file per job, say spark.properties and pass it in spark-submit
--properties-file /spark.properties

Separate logs from Apache spark

I would like to have separate log files from workers, masters and jobs(executors, submits, don't know how call it). I tried configuration in log4j.properties like
log4j.appender.myAppender.File=/some/log/dir/${log4j.myAppender.FileName}
and than passing log4j.myAppender.FileName in SPARK_MASTER_OPTS, SPARK_WORKER_OPTS, spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.
It works perfectly well with workers and masters but fails with executors and drivers. Here is example of how I use these:
./spark-submit ... --conf "\"spark.executor.extraJavaOptions=log4j.myAppender.FileName=myFileName some.other.option=foo\"" ...
I also tried putting log4j.myAppender.FileName with some default value in spark-defaults.conf but it doesn't work neither.
Is there some way to achieve what I want?
Logging for Executors and Drivers can be configured by conf/spark-defaults.conf by adding these entries (from my windows config)
spark.driver.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-driver.properties
spark.executor.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-executor.properties
Note that each entry above references a different log4j.properties file so you can configure them independently.

Resources