What is difference between configuration and and env variables of spark? - apache-spark

There is some configuration I confuse like
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 3
spark.eventLog.dir=/home/rabindra/etl/logs
SPARK_WORKER_DIR=/home/knoldus/work/sparkdata
Where these variable of spark i will use spark-env.sh or spark-defaults.conf?
What are the configuration we can do in spark standalone cluster ?

The first three go in spark-defaults.conf. The last goes into spark-env.sh as shown in this Knoldus example--maybe the one you're using.
I suppose an analogy might be the difference between JVM arguments and environment variables. As shown in the documentation, the configurations you want to apply to a SparkConf, like the application name, the URI of the master, or memory allocation, are on a per-application basis.
Meanwhile, environment variables, whether related to Spark or anything else, apply on a per-machine basis. Of course sometimes the machine-specific settings you would specify with an environment variable belong instead in your resource manager like YARN.
The list of configuration parameters is large. See the documentation linked above for more.

Related

Explain the difference between Spark configurations

I have to set the number of executors in my spark application as 20. While looking at the official documentation I'm confused which is a better config to set
spark.dynamicAllocation.initialExecutors = 20
spark.executor.instances=20
I have the following config enabled
spark.dynamicAllocation.enabled = true
In what use case scenario will I use either?
As per the spark documentation
spark.dynamicAllocation.initialExecutors
Initial number of executors to run if dynamic allocation is enabled.
If --num-executors (or spark.executor.instances) is set and larger
than this value, it will be used as the initial number of executors.
as you can see in the highlighted text it can be overwritten by --num-executors when it is set to a higher value then spark.dynamicAllocation.initialExecutors.
basically, when your application starts it will launch spark.dynamicAllocation.initialExecutors and then slowly increase till spark.dynamicAllocation.maxExecutors when dynamic allocation enabled.
spark.executor.instances
number of executors for static allocation.
In layman terms,
It is like saying I want x resources(spark.executor.instances) to finish a job
(OR)
I want min(x resources) and max(y resources) and initially(z resources) to finish a job...
condition (x<=z<=y) should always satisfy and your resources usage will be decided on the needed when your job is running.
when to use dynamic allocation?
when you have multiple streaming applications running on your cluster OR on-demand spark-sql jobs.most of the time your jobs might need few resources and almost remain idle only in big data stream chunks(peak hours) job might need more resource to process data otherwise cluster resources should be freed and used for other purposes.
Note: make sure to enable external shuffle service (spark.shuffle.service.enabled=true) when dynamic allocation is enabled.
The purpose of the external shuffle service is to allow executors to
be removed without deleting shuffle files written by them (more
detail). The way to set up this service varies across cluster
managers
Referrences :
https://dzone.com/articles/spark-dynamic-allocation

what is the difference between sparksession.config() and spark.conf.set()

I tried use both ways to set spark.dynamicAllocation.minExecutors, but it seems like that only the first way works
spark2 = SparkSession \
.builder \
.appName("test") \
.config("spark.dynamicAllocation.minExecutors", 15) \
.getOrCreate()
vs.
spark2.conf.set("spark.dynamicAllocation.minExecutors", 15)
It is not so much about the difference between the methods, as the difference in the context in which these are executed.
pyspark.sql.session.SparkSession.Builder options can be executed before Spark application has been started. This means that, if there is no active SparkSession to be retrieved, some cluster specific options can be still set.
If the session was already initialized setting new config options might not work. See for example Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI
pyspark.sql.conf.RuntimeConfig can be retrieved only from exiting session, therefore its set method is called once the cluster is running. At this point majority of cluster specific options are frozen and cannot be modified.
In general RuntimeConfig.set is used to modify spark.sql.* configuration parameters, which normally can be changed on runtime.
Please note, that depending on the deployment mode, some options (most notably spark.*.extraJavaOptions) cannot be set using any of these methods, and can be modified only through spark-submit arguments or using configuration files.
I think you'd rather wanted to ask why certain configurations (e.g. spark.dynamicAllocation.minExecutors) cannot be set using spark2.conf.set vs SparkSession.config?
spark.dynamicAllocation.minExecutors is to control how to execute Spark jobs, most importantly to control the number of executors and as such should not be set within a Spark application. I'm even surprised to hear that it worked at all. It should not really IMHO.
The reason why this and some other configurations should not be set within a Spark application is that they control the execution environment for the underlying Spark runtime (that worked behind the scenes of Spark SQL) and as such should be changed using spark-submit that is more for application deployers or admins than developers themselves. Whether dynamic allocation (of executors) is used or not has no impact on the business use of Spark and is a decision to be made after the application is developed.
With that said, let me answer your question directly, some configurations require to be set before a SparkSession instance is created as they control how this instance is going to be instantiated. Once you created the instance, when you call spark2.conf the instance is already configured and some configurations cannot be changed ever. It seems that spark.dynamicAllocation.minExecutors is among the configurations that cannot be changed after an instance of SparkSession has been created. And given what I said earlier I'm happy to hear that this is the case (but unfortunately not in all cases).
Some config properties need to be set before the SparkSession starts for them to work. Sparksession uses them at the time of initialization. If u set spark.dynamicAllocation.minExecutors after the creation of sparksession there will still be a change in the value for that property in sparConf object and u can verify that by printing the property but it does not affect the sparksession session as it took the value present at the time of the initialization.

List of spark-submit options

There are a ton of tunable settings mentioned on Spark configurations page. However as told here, the SparkSubmitOptionParser attribute-name for a Spark property can be different from that property's-name.
For instance, spark.executor.cores is passed as --executor-cores in spark-submit.
Where can I find an exhaustive list of all tuning parameters of Spark (along-with their SparkSubmitOptionParser property name) that can be passed with spark-submit command?
While #suj1th's valuable inputs did solve my problem, I'm answering my own question to directly address my query.
You need not look up for SparkSubmitOptionParser's attribute-name for a given Spark property (configuration setting). Both will do just fine. However, do note that there's a subtle difference between there usage as shown below:
spark-submit --executor-cores 2
spark-submit --conf spark.executor.cores=2
Both commands shown above will have same effect. The second method takes configurations in the format --conf <key>=<value>.
Enclosing values in quotes (correct me if this is incorrect / incomplete)
(i) Values need not be enclosed in quotes (single '' or double "") of any kind (you still can if you want).
(ii) If the value has a space character, enclose the entire thing in double quotes "" like "<key>=<value>" as shown here.
For a comprehensive list of all configurations that can be passed with spark-submit, just run spark-submit --help
In this link provided by #suj1th, they say that:
configuration values explicitly set on a SparkConf take the highest
precedence, then flags passed to spark-submit, then values in the
defaults file.
If you are ever unclear where configuration options are coming from,
you can print out fine-grained debugging information by running
spark-submit with the --verbose option.
Following two links from Spark docs list a lot of configurations:
Spark Configuration
Running Spark on YARN
In your case, you should actually load your configurations from a file, as mentioned in this document, instead of passing them as flags to spark-submit. This relieves the overhead of mapping SparkSubmitArguments to Spark configuration parameters. To quote from the above document:
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

Spark configuration priority

Does there any difference or priority between specifying spark application configuration in the code :
SparkConf().setMaster(yarn)
and specifying them in command line
spark-submit --master yarn
Yes, the highest priority is given to the configuration in the user's code with the set() function. After that there the flags passed with spark-submit.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source
There are 4 precedence level: (1 to 4 , 1 being the highest priority):
SparkConf set in the application
Properties given with the spark-submit
Properties can be given in a property file. And the
property file can be given as argument while submission
Default values
Other than the priority, specifying it on a command-line would allow you to run on different cluster managers without modifying code. The same application can be run on local[n] or yarn or mesos or spark standalone cluster.

Spark - config file that sets spark.storage.memoryFraction

I have come to learn that spark.storage.memoryFraction and spark.storage.safteyFraction are multiplied by the executor memory supplied in the sparkcontext. Also, I have learned that it is desirable to lower the memoryFraction for better performance.
The question is where do I set the spark.storage.memoryFraction? Is there a config file?
The default file that Spark search for such configurations is conf/spark-defaults.conf
If you want to change dir conf to a customized position, set SPARK_CONF_DIR in conf/spark-env.sh
I recommend you to keep it on per job basis instead of updatring spark-defaults.conf
you can create a config file per job, say spark.properties and pass it in spark-submit
--properties-file /spark.properties

Resources