AWS EMR Spark- Cloudwatch - apache-spark

I was running an application on AWS EMR-Spark. Here, is the spark-submit job;-
Arguments : spark-submit --deploy-mode cluster --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
So, AWS uses YARN for resource management. I had a couple of doubts around this while I was observing the cloudwatch metrics :-
1)
What does container allocated imply here? I am using 1 master & 3 slave/executor nodes (all 4 are 8 cores CPU).
2)
I changed my query to:-
spark-submit --deploy-mode cluster --executor-cores 4 --class com.amazon.JavaSparkPi s3://spark-config-test/SWALiveOrderModelSpark-1.0.assembly.jar s3://spark-config-test/2017-08-08
Here the number of cores running is 3. Should it not be 3(number of executors)*4(number of cores) = 12?

1) Container allocated here basically represents the number of spark executors. Spark executor-cores are more like `executor-tasks meaning that you could have your app configured to run one executor per physical cpu and still ask it to have 3 executor-cores per cpu (think hyper-threading).
What happens by default on EMR, when you don't specify the number of spark-executors, is that dynamic allocation is assumed and Spark will only ask from YARN what it thinks it needs in terms of resources. Tried setting explicitly the number of executors to 10 and the containers allocated went upto 6 (max partitions of data). Also, under the tab "Application history", you can get a detailed view of YARN/Spark executors.
2) "cores" here refer to EMR core nodes and are not the same as spark executor cores. Same for "task" that in the monitoring tab refer to EMR task nodes. That is consistent with my setup, as I have 3 EMR slave nodes.

Related

How to restrict the maximum memory consumption of spark job in EMR cluster?

I ran several streaming spark jobs and batch spark jobs in the same EMR cluster. Recently, one batch spark job is programmed wrong, which consumed a lot of memory. It causes the master node not response and all other spark jobs stuck, which means the whole EMR cluster is basically down.
Are there some way that we can restrict the maximum memory that a spark job can consume? If the spark job consumes too much memory, it can be failed. However, we do not hope the whole EMR cluster is down.
The spark jobs are running in the client mode with spark submit cmd as below.
spark-submit --driver-memory 2G --num-executors 1 --executor-memory 2G --executor-cores 1 --class test.class s3://test-repo/mysparkjob.jar
'Classification':'yarn-site',
'Properties':{
'yarn.nodemanager.disk-health-checker.enable':'true',
'yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage':'95.0',
'yarn.nodemanager.localizer.cache.cleanup.interval-ms': '100000',
'yarn.nodemanager.localizer.cache.target-size-mb': '1024',
'yarn.nodemanager.pmem-check-enabled': 'false',
'yarn.nodemanager.vmem-check-enabled': 'false',
'yarn.log-aggregation.retain-seconds': '12000',
'yarn.log-aggregation-enable': 'true',
'yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds': '3600',
'yarn.resourcemanager.scheduler.class': 'org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler'
Thanks!
You can utilize yarn.nodemanager.resource.memory-mb
The total amount of memory that YARN can use on a given node.
Example : If your machine is having 16 GB Ram,
and you set this property to 12GB , maximum 6 executors or drivers will launched (since you are using 2gb per executor/driver) and 4 GB will be free and can be used for background processes.
Option 1:
You can run your spark-submit in cluster mode instead of client mode. In that way your master will be always free to execute other work. You can choose a smaller master instance if you want to save cost.
Advantage: As the spark driver will be created on CORE, you can add auto-scaling to it. And you will be able to use 100% cluster resources. Read more here Spark yarn cluster vs client - how to choose which one to use?
Option 2:
You can create yarn queue, and submit memory heavy jobs to separate queue.
So let's say you configure 2 queue, Q1 & Q2. And you configured Q1 to take max 80% of total resources, and you submit normal jobs to Q2 as there is no max limit to it. But in case of memory heavy jobs you choose queue Q1.
Cloudera Blog
AWS Blog
Seeing your requirement I think Option 1 suits you better. And it's easy to implement, no infra change.
But with Option 2 when we did it in emr-5.26.0 we faced many challenges configuring yarn queue.

How we can limit the usages of VCores during Spark-submit

I am writing a Spark structured streaming application in which data processed with Spark, needs be sink'ed to s3 bucket.
This is my development environment.
Hadoop 2.6.0-cdh5.16.1
Spark version 2.3.0.cloudera4
I want to limit the usages of VCores
As of now I have used spark2-submit to specify option as --conf spark.cores.max=4. However after submitting job I observed that job occupied maximum available VCores from cluster(my cluster has 12 VCores)
Next job is not getting started because of unavailability of VCores.
Which is the best way to limit the usages of VCores per job?
As of now I am doing some workaround as : I created Resource Pool in cluster and assigned some resource as
Min Resources : 4 Virtual Cores and 8 GB memory
Used these pool to assign a spark job to limit the usages of VCores.
e.g. spark2-submit --class org.apache.spark.SparkProgram.rt_app --master yarn --deploy-mode cluster --queue rt_pool_r1 /usr/local/abc/rt_app_2.11-1.0.jar
I want to limit the usages of VCores without any workaround.
I also tried with
spark2-shell --num-executors 1 --executor-cores 1 --jars /tmp/elasticsearch-hadoop-7.1.1.jar
and below is observation.
You can use the "--executor-cores" option it will assign the number of core to each of your executor.
can refer 1 and 2

Forcing driver to run on specific slave in spark standalone cluster running with "--deploy-mode cluster"

I am running a small spark cluster, with two EC2 instances (m4.xlarge).
So far I have been running the spark master on one node, and a single spark slave (4 cores, 16g memory) on the other, then deploying my spark (streaming) app in client deploy-mode on the master. Summary of settings is:
--executor-memory 16g
--executor-cores 4
--driver-memory 8g
--driver-cores 2
--deploy-mode client
This results in a single executor on my single slave running with 4 cores and 16Gb memory. The driver runs "outside" of the cluster on the master-node (i.e. it is not allocated its resources by the master).
Ideally I'd like to use cluster deploy-mode so that I can take advantage of the supervise option. I have started a second slave on the master node giving it 2 cores and 8g memory (smaller allocated resources so as to leave space for the master daemon).
When I run my spark job in cluster deploy-mode (using the same settings as above but with --deploy-mode cluster). Around 50% of the time I get the desired deployment which is that the driver runs through the slave running on the master node (which has the right resources of 2 cores & 8Gb) which leaves the original slave node free to allocate an executor of 4 cores & 16Gb. However the other 50% of the time the master runs the driver on the non-master slave node, which means I get an driver on that node with 2 cores & 8Gb memory, which then leaves no node with sufficient resources to start an executor (which requires 4 cores & 16Gb).
Is there any way to force the spark master to use a specific worker / slave for my driver? Given spark knows that there are two slave nodes, one with 2 cores and the other with 4 cores, and that my driver needs 2 cores, and my executor needs 4 cores it would ideally work out the right optimal placement, but this doesn't seem to be the case.
Any ideas / suggestions gratefully received!
Thanks!
I can see that this is an old question, but let me answer it still, someone might find it useful.
Add --driver-java-options="-Dspark.driver.host=<HOST>" option to spark-submit script, when submitting application, and Spark should deploy driver to specified host.

Multiple executors on same worker spark standalone [duplicate]

I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.
I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.
Here is the latest configuration of the cluster:
spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"
These settings are set on a SparkContext when the Spark application is submitted to the cluster.
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application:
Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
Starting in Spark 1.4 it should be possible to configure this:
Setting: spark.executor.cores
Default: 1 in YARN mode, all the available cores on the worker in standalone mode.
Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior
Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:
[usr#lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker#localhost-exemple:34117
I hope that help you !
In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.
For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!

How to allocate more executors per worker in Standalone cluster mode?

I use Spark 1.3.0 in a cluster of 5 worker nodes with 36 cores and 58GB of memory each. I'd like to configure Spark's Standalone cluster with many executors per worker.
I have seen the merged SPARK-1706, however it is not immediately clear how to actually configure multiple executors.
Here is the latest configuration of the cluster:
spark.executor.cores = "15"
spark.executor.instances = "10"
spark.executor.memory = "10g"
These settings are set on a SparkContext when the Spark application is submitted to the cluster.
You first need to configure your spark standalone cluster, then set the amount of resources needed for each individual spark application you want to run.
In order to configure the cluster, you can try this:
In conf/spark-env.sh:
Set the SPARK_WORKER_INSTANCES = 10 which determines the number of Worker instances (#Executors) per node (its default value is only 1)
Set the SPARK_WORKER_CORES = 15 # number of cores that one Worker can use (default: all cores, your case is 36)
Set SPARK_WORKER_MEMORY = 55g # total amount of memory that can be used on one machine (Worker Node) for running Spark programs.
Copy this configuration file to all Worker Nodes, on the same folder
Start your cluster by running the scripts in sbin (sbin/start-all.sh, ...)
As you have 5 workers, with the above configuration you should see 5 (workers) * 10 (executors per worker) = 50 alive executors on the master's web interface (http://localhost:8080 by default)
When you run an application in standalone mode, by default, it will acquire all available Executors in the cluster. You need to explicitly set the amount of resources for running this application:
Eg:
val conf = new SparkConf()
.setMaster(...)
.setAppName(...)
.set("spark.executor.memory", "2g")
.set("spark.cores.max", "10")
Starting in Spark 1.4 it should be possible to configure this:
Setting: spark.executor.cores
Default: 1 in YARN mode, all the available cores on the worker in standalone mode.
Description: The number of cores to use on each executor. For YARN and standalone mode only. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
http://spark.apache.org/docs/1.4.0/configuration.html#execution-behavior
Until nowaday, Apache Spark 2.2 Standalone Cluster Mode Deployment don't resolve the issue of the number of EXECUTORS per WORKER,.... but there is an alternative for this, which is: launch Spark Executors Manually:
[usr#lcl ~spark/bin]# ./spark-class org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#DRIVER-URL:PORT --executor-id val --hostname localhost-val --cores 41 --app-id app-20170914105902-0000-just-exemple --worker-url spark://Worker#localhost-exemple:34117
I hope that help you !
In stand-alone mode, by default, all the resources on the cluster are acquired as you launch an application. You need to specify the number of executors you need using the --executor-cores and the --total-executor-cores configs.
For example, if there is 1 worker (1 worker == 1 machine in your cluster, it's a good practice to have only 1 worker per machine) in your cluster which has 3 cores and 3G available in its pool (this is specified in spark-env.sh), when you submit an application with --executor-cores 1 --total-executor-cores 2 --executor-memory 1g, two executors are launched for the application with 1 core and 1g each. Hope this helps!

Resources