How to use --num-executors option with spark-submit? - apache-spark

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below :
spark-submit --class WC.WordCount \
--num-executors 8 \
--executor-cores 5 \
--executor-memory 3584M \
...../<myjar>.jar \
/public/blahblahblah /user/blahblah
However its running with default number of executors which is 2. But I am able to override properties if I add
--master yarn
Can someone explain why it is so ? Interestingly , in my application code I am setting master as yarn-client:
val conf = new SparkConf()
.setAppName("wordcount")
.setMaster("yarn-client")
.set("spark.ui.port","56487")
val sc = new SparkContext(conf)
Can someone throw some light as to how the option --master works

I am trying to override spark properties such as num-executors while submitting the application by spark-submit as below
It will not work (unless you override spark.master in conf/spark-defaults.conf file or similar so you don't have to specify it explicitly on the command line).
The reason is that the default Spark master is local[*] and the number of executors is exactly one, i.e. the driver. That's just the local deployment environment. See Master URLs.
As a matter of fact, num-executors is very YARN-dependent as you can see in the help:
$ ./bin/spark-submit --help
...
YARN-only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
That explains why it worked when you switched to YARN. It is supposed to work with YARN (regardless of the deploy mode, i.e. client or cluster which is about the driver alone not executors).
You may be wondering why it did not work with the master defined in your code then. The reason is that it is too late since the master has already been assigned on launch time when you started the application using spark-submit. That's exactly the reason why you should not specify deployment environment-specific properties in the code as:
It may not always work (see the case with master)
It requires that a code has to be recompiled every configuration change (and makes it a bit unwieldy)
That's why you should be always using spark-submit to submit your Spark applications (unless you've got reasons not to, but then you'd know why and could explain it with ease).

If you’d like to run the same application with different masters or different amounts of memory. Spark allows you to do that with an default SparkConf. As you are mentioning properties to SparkConf, those takes highest precedence for application, Check the properties precedence at the end.
Example:
val sc = new SparkContext(new SparkConf())
Then, you can supply configuration values at runtime:
./bin/spark-submit \
--name "My app" \
--deploy-mode "client" \
--conf spark.ui.port=56487 \
--conf spark.master=yarn \ #alternate to --master
--conf spark.executor.memory=4g \ #alternate to --executor-memory
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--class WC.WordCount \
/<myjar>.jar \
/public/blahblahblah \
/user/blahblah
Properties precedence order (top one is more)
Properties set directly on the SparkConf(in the code) take highest
precedence.
Any values specified as flags or in the properties file will be passed
on to the application and merged with those specified through
SparkConf.
then flags passed to spark-submit or spark-shell like --master etc
then options in the spark-defaults.conf file.
A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Source: Dynamically Loading Spark Properties

Related

hadoop multi node with spark sample job

I have just configured spark on my Hadoop cluster and i want to run the spark sample job.
before that I want to understand what, this below job code stands for.
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
You can see all possible parameters for submitting a spark job on here. I summarized the ones in your submit script as below:
spark-submit
--deploy-mode client # client/cluster. default value client. Whether to deploy your driver on the worker nodes or locally
--class org.apache.spark.examples.SparkPi # The entry point for your application
$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10 #jar file path and expected arguments
--master is another parameter usually defined in submit scripts. For my HDP cluster default value of master is yarn. You can see all possible values for master in spark documentation again.

df.show() prints empty result while in hdfs it is not empty

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs
in my code, i have a dataframe which is read directly from hdfs:
df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")
when i use df.show(n=2) directly in my code after the above code, it outputs:
+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+
But when i manually go to the hdfs path, data is not empty.
What i have tried?
1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.
2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist
What i am assuming?
1- i think this may have something to do with drivers and executors
2- it may i have something to do with yarn
3- configs provided when using spark-submit
current config:
spark-submit \
--master yarn \
--queue my_queue_name \
--deploy-mode cluster \
--jars some_jars \
--conf spark.yarn.dist.files some_files \
--conf spark.sql.catalogImplementation=in-memory \
--properties-file some_zip_file \
--py-files some_py_files \
main.py
What i am sure
data is not empty. the same hdfs path is provided in another project which is working fine.
So the problem was with the jar files i was providing
The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

Setting Environment variables in Spark Cluster Mode

I was going through this Apache Spark documentation, and it mentions that:
When running Spark on YARN in cluster mode, environment variables
need to be set using the
spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your
conf/spark-defaults.conf file.
I am running my EMR cluster on AWS data pipeline. I wanted to know that where do I have to edit this conf file. Also, if I create my own custom conf file, and specify it as part of --configurations (in the spark-submit), will it solve my use-case?
One way to do it, is the following: (The tricky part is that you might need to setup the environment variables on both executor and driver parameters)
spark-submit \
--driver-memory 2g \
--executor-memory 4g \
--conf spark.executor.instances=4 \
--conf spark.driver.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--conf spark.executor.extraJavaOptions="-DENV_KEY=ENV_VALUE" \
--master yarn \
--deploy-mode cluster\
--class com.industry.class.name \
assembly-jar.jar
I have tested it in EMR and client mode but should work on cluster mode as well.
For future reference you could directly pass the environment variable when creating the EMR cluster using the Configurations parameter as described in the docs here.
Specifically, the spark-defaults file can be modified by passing a configuration JSON as follows:
{
'Classification': 'spark-defaults',
'Properties': {
'spark.yarn.appMasterEnv.[EnvironmentVariableName]' = 'some_value',
'spark.executorEnv.[EnvironmentVariableName]': 'some_other_value'
}
},
Where spark.yarn.appMasterEnv.[EnvironmentVariableName] would be used to pass a variable in cluster mode using YARN (here). And spark.executorEnv.[EnvironmentVariableName] to pass a variable to the executor process (here).

spark-submit in deploy mode client not reading all the jars

I'm trying to submit an application to my spark cluster (standalone mode) through the spark-submit command. I'm following the
official spark documentation, as well as relying on this other one. Now the problem is that I get strange behaviors. My setup is the following:
I have a directory where all the dependency jars for my application are located, that is /home/myuser/jars
The jar of my application is in the same directory (/home/myuser/jars), and is called dat-test.jar
The entry point class in dat-test.jar is at the package path my.package.path.Test
Spark master is at spark://master:7077
Now, I submit the application directly on the master node, thus using the client deploy mode, running the command
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/*
and I received an error as
java.lang.ClassNotFoundException: my.package.path.Test
If I activate the verbose mode, what I see is that the primaryResource selected as jar containing the entry point is the first jar by alphabetical order in /home/myuser/jars/ (that is not dat-test.jar), leading (I supppose) to the ClassNotFoundException. All the jars in the same directory are anyway loaded as arguments.
Of course if I run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 /home/myuser/jars/dat-test.jar
it finds the Test class, but it doesn't find other classes contained in other jars. Finally, if I use the --jars flag and run
./spark-submit --class my.package.path.Test --master spark://master:7077 --executor-memory 5G --total-executor-cores 10 --jars /home/myuser/jars/* /home/myuser/jars/dat-test.jar
I obtain the same result as the first option. First jar in /home/myuser/jars/ is loaded as primaryResource, leading to ClassNotFoundException for my.package.path.Test. Same if I add --jars /home/myuser/jars/*.jar.
Important points are:
I do not want to have a single jar with all the dependencies for development reasons
The jars in /home/myuser/jars/ are many. I'd like to know if there's a way to include them all instead of using the comma separated syntax
If I try to run the same commands with --deploy-cluster on the master node, I don't get the error, but the computation fails for some other reasons (but this is another problem).
Which is then the correct way of running a spark-submit in client mode?
Thanks
There is no way to include all jars using the --jars option, you will have to create a small script to enumerate them. This part is a bit sub-optimal.

How to get SparkContext if Spark runs on Yarn?

We have a program based on Spark standalone, and in this program we use SparkContext and SqlContext to do lots of queries.
Now we want to deploy the system on a Spark which runs on Yarn. But when we modify the spark.master to yarn-cluster, the application throws an exception says this works with spark-submit type only. When we switch to yarn-client, although it no longer throws exceptions, it doesn't work properly.
It seems that if runs on Yarn, we can no longer use SparkContext to work, instead we should use something like yarn.Client, but in this way we don't know how to change our code to achieve what we have done before using SparkContext and SqlContext.
Is there a good way to solve this? Can we get SparkContext from yarn.Client or we should change our code to utilize new interfaces of yarn.Client?
Thank you!
When you run on cluster , you need to do a spark-submit like this
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
--master will be yarn
--deploy-mode will be cluster
In your application if you have something like setMaster("local[]") , you can remove it and build the code. when you do spark-submit with --Master yarn, yarn will launch containers for you instead of spark-standalone scheduler.
Your app code can look like this without any setting for Master
val conf = new SparkConf().setAppName("App Name")
val sc = new SparkContext(conf)
yarn deploy mode client is use when you want to launch driver on same machine from code is running. On a cluster the deploy mode should be cluster, this will make sure driver is launched on one of the worker node by yarn.

Resources