List of spark-submit options

List of spark-submit options - apache-spark

There are a ton of tunable settings mentioned on Spark configurations page. However as told here, the SparkSubmitOptionParser attribute-name for a Spark property can be different from that property's-name.
For instance, spark.executor.cores is passed as --executor-cores in spark-submit.
Where can I find an exhaustive list of all tuning parameters of Spark (along-with their SparkSubmitOptionParser property name) that can be passed with spark-submit command?

While #suj1th's valuable inputs did solve my problem, I'm answering my own question to directly address my query.
You need not look up for SparkSubmitOptionParser's attribute-name for a given Spark property (configuration setting). Both will do just fine. However, do note that there's a subtle difference between there usage as shown below:
spark-submit --executor-cores 2
spark-submit --conf spark.executor.cores=2
Both commands shown above will have same effect. The second method takes configurations in the format --conf <key>=<value>.
Enclosing values in quotes (correct me if this is incorrect / incomplete)
(i) Values need not be enclosed in quotes (single '' or double "") of any kind (you still can if you want).
(ii) If the value has a space character, enclose the entire thing in double quotes "" like "<key>=<value>" as shown here.
For a comprehensive list of all configurations that can be passed with spark-submit, just run spark-submit --help
In this link provided by #suj1th, they say that:
configuration values explicitly set on a SparkConf take the highest
precedence, then flags passed to spark-submit, then values in the
defaults file.
If you are ever unclear where configuration options are coming from,
you can print out fine-grained debugging information by running
spark-submit with the --verbose option.
Following two links from Spark docs list a lot of configurations:
Spark Configuration
Running Spark on YARN

In your case, you should actually load your configurations from a file, as mentioned in this document, instead of passing them as flags to spark-submit. This relieves the overhead of mapping SparkSubmitArguments to Spark configuration parameters. To quote from the above document:
Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the --master flag from spark-submit. In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

Related

How to understand the relationship and use of spark --jars, extraClassPath and extraLibraryPath?

First of all, I have seen similar problems. another problem link But I think its answer is not very clear.
Some of my questions are as follows：
（1）--jars parameter Is the same as the spark.executor.extraClassPath parameter, if they are different, what is the difference?
I have checked --help on the --jars in spark-submit command line, which explains as follows:
Comma-separated list of local jars to include on the driver
and executor classpaths.
However, I did not find the explanation of spark.executor.extraClassPath in the spark-submit command line. finally, I found the following explanation about spark.executor.extraClassPath in the official website of spark:
Extra classpath entries to prepend to the classpath of executors.
Seemingly the same effect from the interpretation of the two ？
But I see the following paragraph from another question link:
--jars vs SparkContext.addJar: These are identical, only one is set through spark submit and one via code. Choose the one which suites you better. One important thing to note is that using either of these options does not add the JAR to your driver/executor classpath, you'll need to explicitly add them using the extraClassPath config on both.
Why is this again?
（2）spark.executor.extraClassPath and spark.executor.extraLibraryPath, same as spark.driver.extraXXXXpath with the same prefix。
What is the difference between extraClassPath and extraLibraryPath?
Explanation of spark.executor.extraLibraryPath from spark official website。
Set a special library path to use when launching executor JVM's.
I don't understand, what is the difference between this explanation and --jars and spark.executor.extraClassPath?
I look forward to your explanation and answer，thank you .

What is difference between configuration and and env variables of spark?

There is some configuration I confuse like
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.minExecutors = 3
spark.eventLog.dir=/home/rabindra/etl/logs
SPARK_WORKER_DIR=/home/knoldus/work/sparkdata
Where these variable of spark i will use spark-env.sh or spark-defaults.conf?
What are the configuration we can do in spark standalone cluster ?

The first three go in spark-defaults.conf. The last goes into spark-env.sh as shown in this Knoldus example--maybe the one you're using.
I suppose an analogy might be the difference between JVM arguments and environment variables. As shown in the documentation, the configurations you want to apply to a SparkConf, like the application name, the URI of the master, or memory allocation, are on a per-application basis.
Meanwhile, environment variables, whether related to Spark or anything else, apply on a per-machine basis. Of course sometimes the machine-specific settings you would specify with an environment variable belong instead in your resource manager like YARN.
The list of configuration parameters is large. See the documentation linked above for more.

Spark configuration priority

Does there any difference or priority between specifying spark application configuration in the code :
SparkConf().setMaster(yarn)
and specifying them in command line
spark-submit --master yarn

Yes, the highest priority is given to the configuration in the user's code with the set() function. After that there the flags passed with spark-submit.
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source

There are 4 precedence level: (1 to 4 , 1 being the highest priority):
SparkConf set in the application
Properties given with the spark-submit
Properties can be given in a property file. And the
property file can be given as argument while submission
Default values

Other than the priority, specifying it on a command-line would allow you to run on different cluster managers without modifying code. The same application can be run on local[n] or yarn or mesos or spark standalone cluster.

spark streaming application and kafka log4j appender issue

I am testing my spark streaming application, and I have multiple functions in my code:
- some of them operate on a DStream[RDD[XXX]], some of them on RDD[XXX] (after I do DStream.foreachRDD).
I use Kafka log4j appender to log business cases that occur within my functions, that operate on both DStream[RDD] & RDD it self.
But data gets appended to Kafka only when from functions that operate on RDD -> it doesn't work when I want to append data to kafka from my functions that operate on DStream.
Does anyone know reason to this behaviour?
I am working on a single virtual machine, where I have Spark & Kafka. I submit applications using spark submit.
EDITED
Actually I have figured out the part of the problem. Data gets appended to Kafka only from the part of the code that is in my main function. All the code that Is outside of my main, doesnt write data to kafka.
In main I declared the logger like this:
val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
While outside of my main, I had to declare it like:
#transient lazy val kafkaLogger = org.apache.log4j.LogManager.getLogger("kafkaLogger")
in order to avoid serialization issues.
The reason might be behind JVM serialization concept, or simply because workers don't see the log4j configuration file (but my log4j file is in my source code, in resource folder)
Edited 2
I have tried in many ways to send log4j file to executors but not working. I tried:
sending log4j file in --files command of spark-submit
setting: --conf "spark.executor.extraJavaOptions =-Dlog4j.configuration=file:/home/vagrant/log4j.properties" in spark-submit
setting log4j.properties file in --driver-class-path of spark-submit...
None of this option worked.
Anyone has the solution? I do not see any errors in my error log..
Thank you

I think you are close..first you want to make sure all the files are exported to the WORKING DIRECTORY (not CLASSPATH) on all nodes using --files flag. And then you want to reference these files to extracClassPath option of executors and driver. I have attached the following command, hope it helps. Key is to understand once the files are exported, all the files can be accessed on the node using just file name of the working directory (and not url path).
Note: Putting log4j file in the resources folder will not work. (at least when i had tried, it didnt.)
sudo -u hdfs spark-submit --class "SampleAppMain" --master yarn --deploy-mode cluster --verbose --files file:///path/to/custom-log4j.properties,hdfs:///path/to/jar/kafka-log4j-appender-0.9.0.0.jar --conf "spark.driver.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.executor.extraClassPath=kafka-log4j-appender-0.9.0.0.jar" --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties" /path/to/your/jar/SampleApp-assembly-1.0.jar

Which configuration options are preferred in spark?

I wanted to enquire that which configuration option is given priority in spark? Is it the configuration file or the options we manually specify when running the spark-submit shell?
What if I have different options for executor memory in my configuration file and I specify a different value while running the spark-submit shell?

The Spark (1.5.0) configuration page clearly states what the priorities are:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
So this is the priority order (from highest to lowest):
Properties set on the SparkConf (in program).
Flags passed to spark-submit or spark-shell.
Options set in the spark-defaults.conf file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string