I am doing a spark-submit like this
spark-submit --class com.mine.myclass --master yarn-cluster --num-executors 3 --executor-memory 4G spark-examples_2.10-1.0.jar
in the web ui, I can see indeed there are 3 executor nodes, but each has 2G of memory. When I set --executor-memory 2G, then ui shows 1G per node.
How did it figure to reduce my setting by 1/2?
The executor page of the Web UI is showing the amount of storage memory, which is equal to 54% of Java heap by default (spark.storage.safetyFraction 0.9 * spark.storage.memoryFraction 0.6)
Related
Using something as simple as:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 3g --executor-cores 1 --num-executors 1 examples/jars/spark-examples_2.11-2.4.7-amzn-0.jar 1000
I can see on Hadoop dashboard that "vCores Used" is greater than "vCores Total".
How could this happen?
This is because default resource calculator uses memory as the only unit and ignores cores setting.
See this:
When you use the default resource calculator (DefaultResourceCalculator) resources are allocated based on memory alone.
...and this:
The DefaultResourceCalculator only takes memory into account when doing its calculations. This is why CPU requirements are ignored when carrying out allocations in the CapacityScheduler by default. All the math of allocations is reduced to just examining the memory required by resource-requests and the memory available on the node that is being looked at during a specific scheduling-cycle.
i have a data size of 5 TB and if i decide to use r4.8xlarge EC2 machine which has memory = 244 GB per machine and CPU = 32
how can i now decide no. of cores and executors i have to use?
i tried few combination of cores and executors but spark job fails with heap space issue, i have listed it below excluding other parameters
--master yarn
--conf spark.yarn.executor.memoryOverhead=4000
--driver-memory 25G
--executor-memory 240G
--executor-cores 26
--num-executors 13
So I have a spark standalone server with 16 cores and 64GB of RAM. I have both the master and worker running on the server. I don't have dynamic allocation enabled. I am on Spark 2.0
What I dont understand is when I submit my job and specify:
--num-executors 2
--executor-cores 2
Only 4 cores should be taken up. Yet when the job is submitted, it takes all 16 cores and spins up 8 executors regardless, bypassing the num-executors parameter. But if I change the executor-cores parameter to 4 it will adjust accordingly and 4 executors will spin up.
Disclaimer: I really don't know if --num-executors should work or not in standalone mode. I haven't seen it used outside YARN.
Note: As pointed out by Marco --num-executors is no longer in use on YARN.
You can effectively control number of executors in standalone mode with static allocation (this works on Mesos as well) by combining spark.cores.max and spark.executor.cores where number of executors is determined as:
floor(spark.cores.max / spark.executor.cores)
For example:
--conf "spark.cores.max=4" --conf "spark.executor.cores=2"
I have a 10 node cluster, 8 DNs(256 GB, 48 cores) and 2 NNs. I have a spark sql job being submitted to the yarn cluster. Below are the parameters which I have used for spark-submit.
--num-executors 8 \
--executor-cores 50 \
--driver-memory 20G \
--executor-memory 60G \
As can be seen above executor-memory is 60GB, but when I check Spark UI is shows 31GB.
1) Can anyone explain me why it is showing 31GB instead of 60GB.
2) Also help in setting optimal values for parameters mentioned above.
I think,
Memory allocated gets divided into two parts:
1. Storage (caching dataframes/tables)
2. Processing (the one you can see)
31gb is the memory available for processing.
Play around with spark.memory.fraction property to increase/decrease the memory available for processing.
I would suggest to reduce the executor cores to about 8-10
My configuration :
spark-shell --executor-memory 40g --executor-cores 8 --num-executors 100 --conf spark.memory.fraction=0.2
I am using Apache Spark with Yarn client.
I have 4 worker PCs with 8 vcpus each and 30 GB of ram in my spark cluster.
Im set my executor memory to 2G and number of instances to 33.
My job is taking 10 hours to run and all machines are about 80% idle.
I dont understand the correlation between executor memory and executor instances. Should I have an instance per Vcpu? Should I set the executor memory to be memory of machine/#executors per machine?
I believe that you have to use the following command:
spark-submit --num-executors 4 --executor-memory 7G --driver-memory 2G --executor-cores 8 --class \"YourClassName\" --master yarn-client
Number of executors should be 4, since you have 4 workers. The executor memory should be close to the maximum memory that each yarn node has allocated, roughly ~5-6GB (I assume you have 30GB total RAM).
You should take a look on the spark-submit parameters and fully understand them.
We were using cassandra as our data source for spark. The problem was there were not enough partitions. We needed to split up the data more. Our mapping for # of cassandra partitions to spark partitions was not small enough and we would only generate 10 or 20 tasks instead of 100s of tasks.