Apache Spark: setting executor instances - apache-spark

I run my Spark application on YARN with parameters:
in spark-defaults.conf:
spark.master yarn-client
spark.driver.cores 1
spark.driver.memory 1g
spark.executor.instances 6
spark.executor.memory 1g
in yarn-site.xml:
yarn.nodemanager.resource.memory-mb 10240
All other parameters are set to default.
I have a 6-node cluster and the Spark Client component is installed on each node.
Every time I run the application there are only 2 executors and 1 driver visible in the Spark UI. Executors appears on different nodes.
Why can't Spark create more executors? Why are only 2 instead of 6?
I found a very similar question: Apache Spark: setting executor instances does not change the executors, but increasing the memoty-mb parameter didn't help in my case.

The configuration looks OK at first glance.
Make sure that you have overwritten the proper spark-defaults.conf file.
Execute echo $SPARK_HOME for the current user and verify, if the modified spark-defaults file is in the $SPARK_HOME/conf/ directory. Otherwise Spark cannot see your changes.
I have modified the wrong spark-defaults.conf file. I had two users in my system and each user had a different $SPARK_HOME directory set (I didn't know it before). That's why I couldn't see any effect of my settings for one of the users.
You can run your spark-shell or spark-submit with an argument --num-executors 6 (if you want to have 6 executors). If Spark creates more executors than before, you will be sure, that it's not the memory issue but something with the unreadable configuration.

Related

Spark on EMR-5.32.0 not spawning requested executors

I am running into some problems in (Py)Spark on EMR (release 5.32.0). Approximately a year ago I ran the same program on an EMR cluster (I think the release must have been 5.29.0). Then I was able to configure my PySpark program using spark-submit arguments properly. However, now I am running the same/similar code, but the spark-submit arguments do not seem to have any effect.
My cluster configuration:
master node: 8 vCore, 32 GiB memory, EBS only storage EBS Storage:128 GiB
slave nodes: 10 x 16 vCore, 64 GiB memory, EBS only storage EBS Storage:256 GiB
I run the program with the following spark-submit arguments:
spark-submit --master yarn --conf "spark.executor.cores=3" --conf "spark.executor.instances=40" --conf "spark.executor.memory=8g" --conf "spark.driver.memory=8g" --conf "spark.driver.maxResultSize=8g" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.default.parallelism=480" update_from_text_context.py
I did not change anything in the default configurations on the cluster.
Below a screenshot of the Spark UI, which is indicating only 10 executors, whereas I expect to have 40 executors available...
I tried different spark-submit arguments in order to make sure that the error was unrelated to Apache Spark: setting executor instances does not change the executors. I tried a lot of things, and nothing seems to help.
I am a little lost here, could someone help?
UPDATE:
I ran the same code on EMR release label 5.29.0, and there the conf settings in the spark-submit argument seems to have effect:
Why is this happening?
Sorry for the confusion, but this is intentional. On emr-5.32.0, Spark+YARN will coalesce multiple executor requests that land on the same node into a larger executor container. Note how even though you had fewer executors than you expected, each of them had more memory and cores that you had specified. (There's one asterisk here, though, that I'll explain below.)
This feature is intended to provide better performance by default in most cases. If you would really prefer to keep the previous behavior, you may disable this new feature by setting spark.yarn.heterogeneousExecutors.enabled=false, though we (I am on the EMR team) would like to hear from you about why the previous behavior is preferable.
One thing that doesn't make sense to me, though, is that you should be ending up with the same total number of executor cores that you would have without this feature, but that doesn't seem to have occurred for the example you shared. You asked for 40 executors with 3 cores each but then got 10 executors with 15 cores each, which is a bit more in total. This may have to do with the way that your requested spark.executor.memory of 8g divides into the memory available on your chosen instance type, which I'm guessing is probably m5.4xlarge. One thing that may help you is to remove all of your overrides for spark.executor.memory/cores/instances and just use the defaults. Our hope is that defaults will give the best performance in most cases. If not, like I said above, please let us know so that we can improve further!
Ok, if someone is facing the same problem. As a workaround you can just revert back to a previous version of EMR. In my example I reverted back to EMR release label 5.29.0, which solved all my problems. Suddenly I was able to configure the Spark job again!
Still I am not sure why it doesn't work in EMR release label 5.32.0. So if someone has suggestions, please let me know!

How to increase the number of executors in Spark Standalone mode if spark.executor.instances and spark.cores.max aren't working

I looked everywhere but couldn't find the answer I need. I'm running Spark 1.5.2 in Standalone Mode, SPARK_WORKER_INSTANCES=1 because I only want 1 executor per worker per host. What I would like is to increase the number of hosts for my job and hence the number of executors. I've tried changing spark.executor.instances and spark.cores.max in spark-defaults.conf, still seeing the same number of executors. People suggest changing --num-executors, is that not the same as spark.executor.instances?
This Cloudera blog post
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ says "The --num-executors command-line flag or spark.executor.instances configuration property control the number of executors requested. Starting in CDH 5.4/Spark 1.3, you will be able to avoid setting this property by turning on dynamic allocation with the spark.dynamicAllocation.enabled property"
but I"m not sure if spark.dynamicAllocation.enabled only works for YARN.
Any suggestion on how to do this for Spark 1.5.2 is great appreciated!
I don't believe you need to setup SPARK_WORKER_INSTANCES! and if you want to use it, you need to set SPARK_WORKER_CORES environment variable, otherwise, you will end up with a worker consuming all the cores. Hence, the other workers can't be launched correctly!
I haven't seen spark.executor.instances used outside YARN Configuration with Spark
That said, I would definitely suggest using --num-executors having your cluster have multiple workers!

Default number of executors and cores for spark-shell

If I run a spark program in spark shell, is it possible that the program can hog the entire hadoop cluster for hours?
usually there is a setting called num-executors and executor-cores.
spark-shell --driver-memory 10G --executor-memory 15G --executor-cores 8
but if they are not specified and I just run "spark-shell"... will it consume the entire cluster? or are there reasonable defaults.
The default values for most configuration properties can be found in the Spark Configuration documentation. For the configuration properties on your example, the defaults are:
spark.driver.memory = 1g
spark.executor.memory = 1g
spark.executor.cores = 1 in YARN mode, all the available cores on the worker in standalone mode.
Additionally, you can override these defaults by creating the file$SPARK-HOME/conf/spark-defaults.conf with the properties you want (as described here). Then, if the file exists with the desired values, you don't need to pass them as arguments to the spark-shell command.

Why does vcore always equal the number of nodes in Spark on YARN?

I have a Hadoop cluster with 5 nodes, each of which has 12 cores with 32GB memory. I use YARN as MapReduce framework, so I have the following settings with YARN:
yarn.nodemanager.resource.cpu-vcores=10
yarn.nodemanager.resource.memory-mb=26100
Then the cluster metrics shown on my YARN cluster page (http://myhost:8088/cluster/apps) displayed that VCores Total is 40. This is pretty fine!
Then I installed Spark on top of it and use spark-shell in yarn-client mode.
I ran one Spark job with the following configuration:
--driver-memory 20480m
--executor-memory 20000m
--num-executors 4
--executor-cores 10
--conf spark.yarn.am.cores=2
--conf spark.yarn.executor.memoryOverhead=5600
I set --executor-cores as 10, --num-executors as 4, so logically, there should be totally 40 Vcores Used. However, when I check the same YARN cluster page after the Spark job started running, there are only 4 Vcores Used, and 4 Vcores Total
I also found that there is a parameter in capacity-scheduler.xml - called yarn.scheduler.capacity.resource-calculator:
"The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc."
I then changed that value to DominantResourceCalculator.
But then when I restarted YARN and run the same Spark application, I still got the same result, say the cluster metrics still told that VCores used is 4! I also checked the CPU and memory usage on each node with htop command, I found that none of the nodes had all 10 CPU cores fully occupied. What can be the reason?
I tried also to run the same Spark job in fine-grained way, say with --num executors 40 --executor-cores 1, in this ways I checked again the CPU status on each worker node, and all CPU cores are fully occupied.
I was wondering the same but changing the resource-calculator worked for me.This is how I set the property:
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
Check in the YARN UI in the application how many containers and vcores are assigned, with the change the number of containers should be executors+1 and the vcores should be: (executor-cores*num-executors) +1.
Without setting the YARN scheduler to FairScheduler, I saw the same thing. The Spark UI showed the right number of tasks, though, suggesting nothing was wrong. My cluster showed close to 100% CPU usage, which confirmed this.
After setting FairScheduler, the YARN Resources looked correct.
Executors take 10 cores each, 2 cores for Application Master = 42 Cores requested when you have 40 vCores total.
Reduce executor cores to 8 and make sure to restart each NodeManager
Also modify yarn-site.xml and set these properties:
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
yarn.scheduler.minimum-allocation-vcores
yarn.scheduler.maximum-allocation-vcores

Separate logs from Apache spark

I would like to have separate log files from workers, masters and jobs(executors, submits, don't know how call it). I tried configuration in log4j.properties like
log4j.appender.myAppender.File=/some/log/dir/${log4j.myAppender.FileName}
and than passing log4j.myAppender.FileName in SPARK_MASTER_OPTS, SPARK_WORKER_OPTS, spark.executor.extraJavaOptions and spark.driver.extraJavaOptions.
It works perfectly well with workers and masters but fails with executors and drivers. Here is example of how I use these:
./spark-submit ... --conf "\"spark.executor.extraJavaOptions=log4j.myAppender.FileName=myFileName some.other.option=foo\"" ...
I also tried putting log4j.myAppender.FileName with some default value in spark-defaults.conf but it doesn't work neither.
Is there some way to achieve what I want?
Logging for Executors and Drivers can be configured by conf/spark-defaults.conf by adding these entries (from my windows config)
spark.driver.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-driver.properties
spark.executor.extraJavaOptions -Dlog4j.configuration=file:C:/dev/programs/spark-1.2.0/conf/log4j-executor.properties
Note that each entry above references a different log4j.properties file so you can configure them independently.

Resources