How does spark-submit.sh work with different modes and different cluster managers? - apache-spark

In Apache Spark, how does spark-submit.sh work with different modes and different cluster managers? Specifically:
In local deployment mode,
does spark-submit.sh skip calling any cluster manager?
Is it correct that there is no need to install a cluster manager on the local machine?
In client or cluster deployment mode,
Does spark-submit.sh work with different cluster managers (Spark standalone, YARN, Mesos, Kubernetes)? Do different cluster managers have different interfaces, and spark-submit.sh has to invoke them in different ways?
Does spark-submit.sh appear to programmers the same interface except --master? option --master of spark-submit.sh is used to specify a cluster manager.
Thanks.

To make things clear, there is absolutely no need to specify any cluster manager while running spark on any mode (client or cluster or whether you run spark in local mode). The cluster manager is only there to make resource allocation easier and independent, but it is always your choice to use one or not.
The spark-submit command doesn't need a cluster manager present to run.
The different ways in which you can use the command are:
1) local mode:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
2) client mode without a resource manager (also known as spark standalone mode):
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
3) cluster mode with spark standalone mode:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
4) Client/Cluster mode with a resource manager:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
As you can see above, the spark-submit.sh will behave in the same way whether there is a cluster manager or not. Also, if you want to use a resource manager like yarn, mesos, the behaviour of spark-submit will remain the same.
You can read more about spark-submit here.

Related

Can we have multiple executors in Spark master local[*] deployment code client

I have a 1 node Hadoop Cluster, I am submitting a spark job like this
spark-submit \
--class com.compq.scriptRunning \
--master local[*] \
--deploy-mode client \
--num-executors 3 \
--executor-cores 4 \
--executor-memory 21g \
--driver-cores 2 \
--driver-memory 5g \
--conf "spark.local.dir=/data/spark_tmp" \
--conf "spark.sql.shuffle.partitions=2000" \
--conf "spark.sql.inMemoryColumnarStorage.compressed=true" \
--conf "spark.sql.autoBroadcastJoinThreshold=200000" \
--conf "spark.speculation=false" \
--conf "spark.hadoop.mapreduce.map.speculative=false" \
--conf "spark.hadoop.mapreduce.reduce.speculative=false" \
--conf "spark.ui.port=8099" \
.....
Though I define 3 executors, I see only 1 executor in spark UI page running all the time. Can we have multiple executors running in parallel with
--master local[*] \
--deploy-mode client \
Its a on-prem, plain open source hadoop flavor installed in the cluster.
I tried changing master local to local[*] and playing around with deployment modes still, I could see only 1 executor running in spark UI

Apache Spark application is not seen in Spark Web UI (Java)

I am trying to run an ETL job using Apache Spark (Java) in Kubernetes cluster. The Application is running, and data is getting inserted into database (mysql). But, the application is not seen in Spark Web UI.
The command I used for submitting the application is:
./spark-submit --class com.xxxx.etl.EtlApplication \
--name MyETL \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf "spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:32" \
--conf "spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf "spark.kubernetes.authenticate.driver.serviceAccountName=my-spark" \
--conf "spark.kubernetes.driver.request.cores=256m" \
--conf "spark.kubernetes.driver.limit.cores=512m" \
--conf "spark.kubernetes.executor.request.cores=256m" \
--conf "spark.kubernetes.executor.limit.cores=512m" \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/EtlApplication-with-dependencies.jar 1000
I use a jenkins job to build my code and move the jar to /opt/bitnami/spark/examples/jars folder in the container inside the cluster.
The job is seen running in the pod when I check with kubectl get pods, and is seen on taking localhost:4040 after mapping the port to localhost using kubectl port-forward pod/myetl-df26f5843cb88da7-driver 4040:4040
Tried the same spark-submit command with Spark example jar (which came along with Spark installation in the container):
./spark-submit --class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.container.image=YYYYYY.yyy.ecr.us-west-2.amazonaws.com/spark-poc:5" \
--master k8s://XXXXXXXXXX.xxx.us-west-2.eks.amazonaws.com:443 \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077" \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=my-spark \
--deploy-mode cluster \
local:///opt/bitnami/spark/examples/jars/spark-examples_2.12-3.3.0.jar 1000
This time this application is getting listed in the Spark Web UI. I tried several options, and on removing the line --conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://my-spark-master-headless.default.svc.cluster.local:7077", the SparkPi example application is also not displayed in Spark Web UI.
Am I missing something? Do I need to change my java code to accept spark.kubernetes.driverEnv.SPARK_MASTER_URL? Tried several options buut nothing works.
Thanks in advance.

Pass arguments from a file to multiple spark jobs

Is it possible to have one master file that stores a list of arguments than can be referenced from a spark-submit command?
Example of the properties file, configurations.txt (does not have to be .txt):
school_library = "central"
school_canteen = "Nothernwall"
Expected requirement:
Calling it one spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library
Calling it in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_canteen
Calling both in another spark-submit:
spark-submit --master yarn \
--deploy-mode cluster \
--jars sample.jar \
/home/user/helloworld.py configurations.school_library configurations.school_canteen
Yes.
You can do that by the conf --files
For example, You are submitting a spark job with a config file: /data/config.conf:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster
--executor-memory 20G \
--num-executors 50 \
--files /data/config.conf \
/path/to/examples.jar
And this file will be uploaded & place in the working directory on the driver. So you have to access by its name.
Ex:
new FileInputStream("config.conf")
Spark-submit parameter "--properties-file" can be used.
Property names have to be started with "spark." prefix, for ex:
spark.mykey=myvalue
Values in this case extracted from configuration (SparkConf)

Submit job to DCOS Spark with multiple instances?

I have two instances of spark in my DCOS cluster, when I submit my job via CLI
dcos spark run --submit-args="\
--driver-cores 8 \
--driver-memory 16384M \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=hdfs://hdfs/history \
--class com.CalcPi \
<url to job -spark-test-assembly-0.0.5-SNAPSHOT.jar> 99000000"`
the job is forever stuck in the queue. But when I have only one instance everything works fine. I have already try
--deploy-mode cluster --supervise
The following config options are hopefully the answer you are looking for:
dcos config set spark.app_id spark-one
dcos spark run ...
dcos config set spark.app_id spark-two
dcos spark run ...

How to execute Spark programs with Dynamic Resource Allocation?

I am using spark-summit command for executing Spark jobs with parameters such as:
spark-submit --master yarn-cluster --driver-cores 2 \
--driver-memory 2G --num-executors 10 \
--executor-cores 5 --executor-memory 2G \
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
Now i want to execute the same program using Spark's Dynamic Resource allocation. Could you please help with the usage of Dynamic Resource Allocation in executing Spark programs.
In Spark dynamic allocation spark.dynamicAllocation.enabled needs to be set to true because it's false by default.
This requires spark.shuffle.service.enabled to be set to true, as spark application is running on YARN. Check this link to start the shuffle service on each NodeManager in YARN.
The following configurations are also relevant:
spark.dynamicAllocation.minExecutors,
spark.dynamicAllocation.maxExecutors, and
spark.dynamicAllocation.initialExecutors
These options can be configured to Spark application in 3 ways
1. From Spark submit with --conf <prop_name>=<prop_value>
spark-submit --master yarn-cluster \
--driver-cores 2 \
--driver-memory 2G \
--num-executors 10 \
--executor-cores 5 \
--executor-memory 2G \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=30 \
--conf spark.dynamicAllocation.initialExecutors=10 \ # same as --num-executors 10
--class com.spark.sql.jdbc.SparkDFtoOracle2 \
Spark-hive-sql-Dataframe-0.0.1-SNAPSHOT-jar-with-dependencies.jar
2. Inside Spark program with SparkConf
Set the properties in SparkConf then create SparkSession or SparkContext with it
val conf: SparkConf = new SparkConf()
conf.set("spark.dynamicAllocation.minExecutors", "5");
conf.set("spark.dynamicAllocation.maxExecutors", "30");
conf.set("spark.dynamicAllocation.initialExecutors", "10");
.....
3. spark-defaults.conf usually located in $SPARK_HOME/conf/
Place the same configurations in spark-defaults.conf to apply for all spark applications if no configuration is passed from command-line as well as code.
Spark - Dynamic Allocation Confs
I just did a small demo with Spark's dynamic resource allocation. The code is on my Github. Specifically, the demo is in this release.

Resources