In our Application ,We have submitted Spark job with the following configuration values:
'--num-executors' (or) 'spark.executor.instances' - Not set
'spark.executor.cores' - Not set
'spark.dynamicAllocation.enabled' -'True'
'spark.executor.memory' - 1g
(No.of worker nodes available - 3 having 4 vCores each)
In 'Environment' page of Spark Web UI, following values are observed :
'spark.executor.instances' - '3'
'spark.executor.cores' - '4'
Can we assume that the above values shown for 'spark.executor.instances' (3) and
'spark.executor.cores' (4) are the initial values only ?
The reason for this assumption is ,
From the 'Executors' page it can be observed that total '14' executors are used.
From the 'Event Timeline' , it can be observed that at one moment, maximum '8' executors' are running .Since total number of cores available are '12' (3 x 4) only , it looks like the number of cores used per executor also will not be constant during runtime. i.e. Initially it starts with '4' but will reduce when the number of executors increase!
You post cover 2 questions:
Are the initial value of spark.executor.instances and spark.executor.cores 3 and 4 respectively? It depends on which mode are you using. Based on the configurations that you provide, which you set the spark.dynamicAllocation.enabled be True, and you mentioned that you have 3 nodes with 4 cores each, it will scales your number of executors based on workload. Also, If you're running your spark application on YARN, the default value of spark.executor.cores should be 1. As you didn't mention about your mode and number of spark application that you run at the same time, I assume you're only running a single spark job and you're not running in YARN mode. You can check the spark config based on the option you input: https://spark.apache.org/docs/latest/configuration.html
Will the number of cores and executors differ from what you config your in spark-submit if your number of executors increase? No, once you submitted your spark application and create the sparkContext, the number of executors and cores will not change, unless you create the new one.
Related
Currently, I am running Spark job through EMR and working on Spark tuning. I read about number of executor per instance and memory calculation etc, and I got confused based on the current setup.
So currently it uses spark.dynamicAllocation.enabled as true by default from EMR and spark.executor.cores as 4 (not set by me, I assume it is by default). Also use one r6.xlarge (32 GiB of Memory, 4 vCPUs) for master and two for cores.
In this case, based on the formula: Number of executors per instance = (total number of virtual cores per instance - 1)/ spark.executors.cores , (4 - 1) / 4 = 0. Would it be correct?
When I check spark UI, it added many executors. What information did I miss from this..?
Say I have a total of 4 cores,
What happens if I define num of executors as 8..
Can we share a single core among 2 executors ?
Can the num of cores for a executor be a fraction ?
What is the impact on performance with this kind of config.
This is what I observed in spark standalone mode:
The total cores of my system are 4
if I execute spark-shell command with spark.executor.cores=2
Then 2 executors will be created with 2 core each.
But if I configure the no of executors more than available cores,
Then only one executor will be created, with the max core of the system.
The number of the core will never be of fraction value.
If you assign fraction value in the configuration, you will end up with exception:
Feel free to edit/correct the post if anything is wrong.
I'm running a spark job on a Google Dataproc cluster (3 nodes n1-highmem-4 so 4 cores and 26GB each, same type for the master).
I have a few questions about informations displayed on the Hadoop and the spark UI:
When I check the Hadoop UI I get this:
My question here is : my total RAM is supposed to be 84 (3x26) so why only 60Gb displayed here ? Is 24GB used for something else ?
2)
This is the screen showing currently launched executors.
My questions are:
Why only 10 cores are used ? Shouldn't we be able to launch a 6th executor using the 2 remaining cores since we have 12, and 2 seem be used per executor ?
Why 2 cores per executor ? Does it change anything if we run 12 executor with 1 core each instead ?
What is the "Input" column ? The total volume each executor received to analyze ?
3)
This is a screenshot of the "Storage" panel. I see the dataframe I'm working on.
I don't understand the "size in memory" column. Is it the total RAM used to cache the dataframe ? It seems very low compared to the size of row files I load into the Dataframe ( 500GB+ ). Is it a wrong interpretation ?
Thanks to anyone who will read this !
If you can take a look at this answer, it mostly answers your question 1 and 2.
To sum up, the total memory is less because some memory are reserved to run OS and system daemons or Hadoop daemons itself, e.g.Namenode, NodeManager.
Similar to cores, in your case it would be 3 nodes and each node runs 2 executors and each executor uses up 2 cores, except for the application master. For the node that application master lives in, there will be only one executor and the cores left are given to master. That's why you see only 5 executor and 10 cores.
For your 3rd question, that number should be the memory used up by the partitions in that RDD, which is approximately equal to memory allocated to each executor in your case it's ~13G.
Note that Spark doesn't load your 500G data at once instead it loads in data in partitions, the number of concurrently loaded partitions depend on the number of cores you have available.
Why does Spark UI show only 6 cores available per worker (not the number of cores used) while I have 16 on each of my 3 machines (8 sockets * 2 cores/socket) or even 32 if take in account the number of threads per core (2). I tried to set SPARK_WORKER_CORES in spark-env.sh file but it changes nothing (I made the changes on all 3 workers). I also comment the line to see if it changes something: number of cores available is always stuck at 6.
I'm using Spark 2.2.0 in standalone cluster:
pyspark --master spark://myurl:7077
result of lscpu command:
I've found that I simply had to stop the master and slaves and restart them so the parameter SPARK_WORKER_CORES is refreshed.
TL;DR
Spark UI shows different number of cores and memory than what I'm asking it when using spark-submit
more details:
I'm running Spark 1.6 in standalone mode.
When I run spark-submit I pass it 1 executor instance with 1 core for the executor and also 1 core for the driver.
What I would expect to happen is that my application will be ran with 2 cores total.
When I check the environment tab on the UI I see that it received the correct parameters I gave it, however it still seems like its using a different number of cores. You can see it here:
This is my spark-defaults.conf that I'm using:
spark.executor.memory 5g
spark.executor.cores 1
spark.executor.instances 1
spark.driver.cores 1
Checking the environment tab on the Spark UI shows that these are indeed the received parameters but the UI still shows something else
Does anyone have any idea on what might cause Spark to use different number of cores than what I want I pass it? I obviously tried googling it but didn't find anything useful on that topic
Thanks in advance
TL;DR
Use spark.cores.max instead to define the total number of cores available, and thus limit the number of executors.
In standalone mode, a greedy strategy is used and as many executors will be created as there are cores and memory available on your worker.
In your case, you specified 1 core and 5GB of memory per executor.
The following will be calculated by Spark :
As there are 8 cores available, it will try to create 8 executors.
However, as there is only 30GB of memory available, it can only create 6 executors : each executor will have 5GB of memory, which adds up to 30GB.
Therefore, 6 executors will be created, and a total of 6 cores will be used with 30GB of memory.
Spark basically fulfilled what you asked for. In order to achieve what you want, you can make use of the spark.cores.max option documented here and specify the exact number of cores you need.
A few side-notes :
spark.executor.instances is a YARN-only configuration
spark.driver.memory defaults to 1 core already
I am also working on easing the notion of the number of executors in standalone mode, this might get integrated into a next release of Spark and hopefully help figuring out exactly the number of executors you are going to have, without having to calculate it on the go.