Spark - config file that sets spark.storage.memoryFraction - apache-spark

I have come to learn that spark.storage.memoryFraction and spark.storage.safteyFraction are multiplied by the executor memory supplied in the sparkcontext. Also, I have learned that it is desirable to lower the memoryFraction for better performance.
The question is where do I set the spark.storage.memoryFraction? Is there a config file?

The default file that Spark search for such configurations is conf/spark-defaults.conf
If you want to change dir conf to a customized position, set SPARK_CONF_DIR in conf/spark-env.sh

I recommend you to keep it on per job basis instead of updatring spark-defaults.conf
you can create a config file per job, say spark.properties and pass it in spark-submit
--properties-file /spark.properties

Related

Spark Worker /tmp directory

Im using spark-2.1.1-bin-hadoop-2.7 standalone mode (cluster of 4 workers, 120g memory, 32 cores total)
Although I defined spark.local.dir conf param to write to /opt, spark worker keep writing to /tmp dir, for example /tmp/spark-e071ae1b-1970-47b2-bfec-19ca66693768
Is there a way to tell spark worker not write to /tmp dir?
As per the spark documentation, few environment variables will override the property 'spark.local.dir', please try checking these environment variables
Quoting from the documentation:
spark.local.dir
Directory to use for "scratch" space in Spark, including map output
files and RDDs that get stored on disk. This should be on a fast,
local disk in your system. It can also be a comma-separated list of
multiple directories on different disks. Note: This will be overridden
by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS
(YARN) environment variables set by the cluster manager.

What is difference between configuration and and env variables of spark?

There is some configuration I confuse like
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 3
spark.eventLog.dir=/home/rabindra/etl/logs
SPARK_WORKER_DIR=/home/knoldus/work/sparkdata
Where these variable of spark i will use spark-env.sh or spark-defaults.conf?
What are the configuration we can do in spark standalone cluster ?
The first three go in spark-defaults.conf. The last goes into spark-env.sh as shown in this Knoldus example--maybe the one you're using.
I suppose an analogy might be the difference between JVM arguments and environment variables. As shown in the documentation, the configurations you want to apply to a SparkConf, like the application name, the URI of the master, or memory allocation, are on a per-application basis.
Meanwhile, environment variables, whether related to Spark or anything else, apply on a per-machine basis. Of course sometimes the machine-specific settings you would specify with an environment variable belong instead in your resource manager like YARN.
The list of configuration parameters is large. See the documentation linked above for more.

Spark configuration priority

Does there any difference or priority between specifying spark application configuration in the code :
SparkConf().setMaster(yarn)
and specifying them in command line
spark-submit --master yarn
Yes, the highest priority is given to the configuration in the user's code with the set() function. After that there the flags passed with spark-submit.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source
There are 4 precedence level: (1 to 4 , 1 being the highest priority):
SparkConf set in the application
Properties given with the spark-submit
Properties can be given in a property file. And the
property file can be given as argument while submission
Default values
Other than the priority, specifying it on a command-line would allow you to run on different cluster managers without modifying code. The same application can be run on local[n] or yarn or mesos or spark standalone cluster.

Which configuration options are preferred in spark?

I wanted to enquire that which configuration option is given priority in spark? Is it the configuration file or the options we manually specify when running the spark-submit shell?
What if I have different options for executor memory in my configuration file and I specify a different value while running the spark-submit shell?
The Spark (1.5.0) configuration page clearly states what the priorities are:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
So this is the priority order (from highest to lowest):
Properties set on the SparkConf (in program).
Flags passed to spark-submit or spark-shell.
Options set in the spark-defaults.conf file.

Separate logs from Apache spark

I would like to have separate log files from workers, masters and jobs(executors, submits, don't know how call it). I tried configuration in log4j.properties like
log4j.appender.myAppender.File=/some/log/dir/${log4j.myAppender.FileName}
and than passing log4j.myAppender.FileName in SPARK_MASTER_OPTS, SPARK_WORKER_OPTS, spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.
It works perfectly well with workers and masters but fails with executors and drivers. Here is example of how I use these:
./spark-submit ... --conf "\"spark.executor.extraJavaOptions=log4j.myAppender.FileName=myFileName some.other.option=foo\"" ...
I also tried putting log4j.myAppender.FileName with some default value in spark-defaults.conf but it doesn't work neither.
Is there some way to achieve what I want?
Logging for Executors and Drivers can be configured by conf/spark-defaults.conf by adding these entries (from my windows config)
spark.driver.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-driver.properties
spark.executor.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-executor.properties
Note that each entry above references a different log4j.properties file so you can configure them independently.

Resources