Spark standalone configuration having multiple executors - apache-spark

I'm trying to setup a standalone Spark 2.0 server to process an analytics function in parallel. To do this I want to have a single worker with multiple executors.
I'm using :
Standalone Spark 2.0
8 Cores
24gig RAM
windows server 2008
pyspark (although this appears unrelated)
This is just for pure proof of concept purposes but I want to have 8 executors, one per each core.
I've tried to follow the other threads on this topic but for some reason it's not working for me. IE:
Spark Standalone Number Executors/Cores Control
My configuration is as follows:
conf\spark-defaults.conf
spark.cores.max = 8
spark.executor.cores = 1
I have tried to also change my spark-env.sh file to no avail. Instead what is happening is that it shows that my 1 worker only has 1 executor on it. As you can see below, it still shows the standalone with 1 executor with 8 cores to it.

I believe you mixed up local and standalone modes:
Local mode is a development tool where all processes are executed inside a single JVM. Application is started in a local mode by setting master to local, local[*] or local[n]. spark.executor.cores and spark.executor.cores are not applicable in the local mode because there is only one embedded executor.
Standalone mode requires a standalone Spark cluster. It requires a master node (can be started using SPARK_HOME/sbin/start-master.sh script) and at least one worker node (can be started using SPARK_HOME/sbin/start-slave.sh script).
SparkConf should use master node address to create (spark://host:port).

You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10
which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15
number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g
total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application: Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")

Related

Should the number of executor core for Apache Spark be set to 1 in YARN mode?

My question: Is it true that running Apache Spark applications in YARN master, with deploy-mode as either client or cluster, the executor-cores should always be set to 1?
I am running an application processing millions of data on a cluster with 200 data nodes each having 14 cores. It runs perfect when I use 2 executor-cores and 150 executors on YARN, but one of the cluster admins is asking me to use 1 executor-core. He is adamant that Spark in YARN should be used with 1 executor core, because otherwise it will be stealing resources from other users. He points me to this page on Apache docs where it says the default value for executor-core is 1 for YARN.
https://spark.apache.org/docs/latest/configuration.html
So, is it true we should use only 1 for executor-cores?
If the executors use 1 core, aren't they single threaded?
Kind regards,
When we run spark application using a cluster manager like Yarn, there’ll be several daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker. So, while specifying num-executors, we need to make sure that we leave aside enough cores (~1 core per node) for these daemons to run smoothly.
ApplicationMaster is responsible for negotiating resources from the ResourceManager and working with the NodeManagers to execute and monitor the containers and their resource consumption. If we are running spark on yarn, then we need to budget in the resources that AM would need
Example
**Cluster Config:**
200 Nodes
14 cores per Node
Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 14-1 = 13
So, Total available of cores in cluster = 13 x 200 = 2600
Let’s assign 5 core per executors => --executor-cores = 5 (for good HDFS throughput)
Number of available executors = (total cores/num-cores-per-executor) = 2600/5 = 520
Leaving 1 executor for ApplicationManager => --num-executors = 519
Please note : This is just a sample recommended configuration , you
may wish to revise based upon the performance of your application.
Also A better practice is to monitor the node resources while you
execute your job , this gives a better picture of the resource
utilisation in your cluster

How do I run multiple spark applications in parallel in standalone master

Using Spark(1.6.1) standalone master, I need to run multiple applications on same spark master.
All application submitted after first one, keep on holding 'WAIT' state always. I also observed, the one running holds all cores sum of workers.
I already tried limiting it by using SPARK_EXECUTOR_CORES but its for yarn config, while I am running is "standalone master". I tried running many workers on same master but every time first submitted application consumes all workers.
I was having same problem on spark standalone cluster.
What I got is, Somehow it is utilising all the resources for one single job. We need to define the resources so that their will be space to run other job as well.
Below is the command I am using to submit spark job.
bin/spark-submit --class classname --master spark://hjvm1:6066 --deploy-mode cluster --driver-memory 500M --conf spark.executor.memory=1g --conf spark.cores.max=1 /data/test.jar
A crucial parameter for running multiple jobs in parallel on a Spark standalone cluster is spark.cores.max. Note that spark.executor.instances,
num-executors and spark.executor.cores alone won't allow you to achieve this on Spark standalone, all your jobs except a single active one will stuck with WAITING status.
Spark-standalone resource scheduling:
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications. However, to allow multiple concurrent
users, you can control the maximum number of resources each
application will use. By default, it will acquire all cores in the
cluster, which only makes sense if you just run one application at a
time. You can cap the number of cores by setting spark.cores.max ...
I am assuming you run all the workers on one server and try to simulate a cluster. The reason for this assumption is that if otherwise you could use one worker and master to run Standalone Spark cluster.
The executor cores are something completely different compared to the normal cores. To set the number of executors you will need YARN to be turned on as you earlier said. The executor cores are the number of Concurrent tasks as executor can run (when using hdfs it is advisable to keep this below 5) [1].
The number of cores you want to limit to make the workers run are the “CPU cores”. These are specified in the configuration of Spark 1.6.1 [2]. In Spark there is the option to set the amount of CPU cores when starting a slave [3]. This happens with -c CORES, --cores CORES . Which defines the total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker.
The command to start Spark would be something like this:
./sbin/start-all.sh --cores 2
Hope this helps
In the configuration settings add this line to "./conf/spark-env.sh " this file.
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
maximum cores now will limit to 1 for the master.
if multiple spark application is running then it will use only one core for the master. By then defining the amount of workers and give the workers the setting:
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
Each worker has then one core as well. Remember this has to be set for every worker in the configuration settings.

spark Dynamic Resource Allocation in a standalone

I have a question/problem regarding dynamic resource allocation.
I am using spark 1.6.2 with stand alone cluster manager.
I have one worker with 2 cores.
I set the the folllowing arguments in the spark-defaults.conf file on all my nodes:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.deploy.defaultCores 1
I run a sample application with many tasks.
I open port 4040 on the driver and i can verify that the above configuration exists.
My problem is that no matter what i do my application only gets 1 core even though the other core is available.
Is this normal or do i have a problem in my configuration?
The behaviour i want to get is this:
I have many users working with the same spark cluster.
I want that each application will get a fixed number of cores unless the rest of the clutser is pending.
In this case I want that the running applications will get the total amount of cores until a new application arrives...
Do I have to go to mesos for this?

Multiple executors on same worker spark standalone [duplicate]

I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.
I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.
Here is the latest configuration of the cluster:
spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"
These settings are set on a SparkContext when the Spark application is submitted to the cluster.
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application:
Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
Starting in Spark 1.4 it should be possible to configure this:
Setting: spark.executor.cores
Default: 1 in YARN mode, all the available cores on the worker in standalone mode.
Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior
Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:
[usr#lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker#localhost-exemple:34117
I hope that help you !
In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.
For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!

How to allocate more executors per worker in Standalone cluster mode?

I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.
I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.
Here is the latest configuration of the cluster:
spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"
These settings are set on a SparkContext when the Spark application is submitted to the cluster.
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application:
Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
Starting in Spark 1.4 it should be possible to configure this:
Setting: spark.executor.cores
Default: 1 in YARN mode, all the available cores on the worker in standalone mode.
Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior
Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:
[usr#lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker#localhost-exemple:34117
I hope that help you !
In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.
For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!

Resources