I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.
I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.
Here is the latest configuration of the cluster:
spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"
These settings are set on a SparkContext when the Spark application is submitted to the cluster.
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application:
Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
Starting in Spark 1.4 it should be possible to configure this:
Setting: spark.executor.cores
Default: 1 in YARN mode, all the available cores on the worker in standalone mode.
Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior
Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:
[usr#lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker#localhost-exemple:34117
I hope that help you !
In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.
For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!
Related
what is the relationship between spark executor and yarn container when using spark on yarn?
For example, when I set executor-memory = 20G and yarn container memory = 10G, does 1 executor contains 2 containers?
Spark Executor Runs within a Yarn Container. A Yarn Container is provided by Resource Manager on demand. A Yarn container can have 1 or more Spark Executors.
Spark-Executors are the one which runs the Tasks.
Spark Executor will be started on a Worker Node(DataNode)
In your case when you set executor-memory = 20G -> This means you are asking for a Container of size 20GB in which your Executors will be running. Now you might have 1 or more Executors using this 20GB of Memory and this is Per Worker Node.
So for example if u have a Cluster to 8 nodes, it will be 8 * 20 GB of Total Memory for your Job.
Below are the 3 config options available in yarn-site.xml with which you can play around and see the differences.
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.resource.memory-mb
When running Spark on YARN, each Spark executor runs as a YARN container, This means the number of containers will always be the same as the executors created by a Spark application e.g. via --num-executors parameter in spark-submit.
https://stackoverflow.com/a/38348175/9605741
In YARN mode, each executor runs in one container. The number of executors is the same as the number of containers allocated from YARN(except in cluster mode, which will allocate another container to run the driver).
I was running an application on AWS EMR-Spark. Here, is the spark-submit job;-
Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
So, AWS uses YARN for resource management. I had a couple of doubts around this while I was observing the cloudwatch metrics :-
1)
What does container allocated imply here? I am using 1 master & 3 slave/executor nodes (all 4 are 8 cores CPU).
2)
I changed my query to:-
spark-submit --deploy-mode cluster --executor-cores 4 --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
Here the number of cores running is 3. Should it not be 3(number of executors)*4(number of cores) = 12?
1) Container allocated here basically represents the number of spark executors. Spark executor-cores are more like `executor-tasks meaning that you could have your app configured to run one executor per physical cpu and still ask it to have 3 executor-cores per cpu (think hyper-threading).
What happens by default on EMR, when you don't specify the number of spark-executors, is that dynamic allocation is assumed and Spark will only ask from YARN what it thinks it needs in terms of resources. Tried setting explicitly the number of executors to 10 and the containers allocated went upto 6 (max partitions of data). Also, under the tab "Application history", you can get a detailed view of YARN/Spark executors.
2) "cores" here refer to EMR core nodes and are not the same as spark executor cores. Same for "task" that in the monitoring tab refer to EMR task nodes. That is consistent with my setup, as I have 3 EMR slave nodes.
I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to have a single worker with multiple executors.
I'm using :
Standalone Spark 2.0
8 Cores
24gig RAM
windows server 2008
pyspark (although this appears unrelated)
This is just for pure proof of concept purposes but I want to have 8 executors, one per each core.
I've tried to follow the other threads on this topic but for some reason it's not working for me. IE:
Spark Standalone Number Executors/Cores Control
My configuration is as follows:
conf\spark-defaults.conf
spark.cores.max = 8
spark.executor.cores = 1
I have tried to also change my spark-env.sh file to no avail. Instead what is happening is that it shows that my 1 worker only has 1 executor on it. As you can see below, it still shows the standalone with 1 executor with 8 cores to it.
I believe you mixed up local and standalone modes:
Local mode is a development tool where all processes are executed inside a single JVM. Application is started in a local mode by setting master to local, local[*] or local[n]. spark.executor.cores and spark.executor.cores are not applicable in the local mode because there is only one embedded executor.
Standalone mode requires a standalone Spark cluster. It requires a master node (can be started using SPARK_HOME/sbin/start-master.sh script) and at least one worker node (can be started using SPARK_HOME/sbin/start-slave.sh script).
SparkConf should use master node address to create (spark://host:port).
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10
which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15
number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g
total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application: Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
I'm running Spark on EMR cluster, and I notice the computation resource is not fully used. Right now there is only 1 worker on each node (m3.xlarge), and only 1 executor on each worker.
I checked this Spark doc http://spark.apache.org/docs/latest/spark-standalone.html and there is a configuration SPARK_WORKER_INSTANCES by which I can set the worker number per node, but I can't find the executor number settings.
Maybe for YARN I can set --num-executors but I'm not using YARN.
Does any one know this?
I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.
I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.
Here is the latest configuration of the cluster:
spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"
These settings are set on a SparkContext when the Spark application is submitted to the cluster.
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application:
Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
Starting in Spark 1.4 it should be possible to configure this:
Setting: spark.executor.cores
Default: 1 in YARN mode, all the available cores on the worker in standalone mode.
Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior
Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:
[usr#lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker#localhost-exemple:34117
I hope that help you !
In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.
For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!