Very few executors are running on the cluster - apache-spark

I have a lab environment of cdh5 with 6 nodes-node[1-6] and node7 as the nameNode.
node[1-5]: 8gb ram, 2 cores
node[6]: 32gb ram, 8 cores
I am new to spark and I am trying to simply count the number of lines in our data. I have uploaded the data on hdfs (5.3GB).
When I submit my spark job, it only runs 2 executors and I can see its splitting the task into 161 task (there are 161 files in the dir).
In the code, I am reading all the files and doing the count on them.
data_raw = sc.textFile(path)
print data_raw.count()
On CLI: spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1
It should run with 6 executors with 1 task running on them. But I only see 2 executors running. I am not able to figure the cause for it.
Any help would be greatly appreciated.

Correct way to submit the job is:
spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py
Now its showing all the other executors.

I suspect only 2 nodes are running spark. Go to cloudera manager -> clusters -> spark -> instances to confirm.

Related

Spark on Yarn Number of Cores in EMR Cluster

I have an Emr cluster for spark with below configuration of 2 Instances.
r4.2xlarge
8 vCore
So my total vCores is 16 and the same is reflected in yarn Vcores
I have submitted a spark streaming job with parameters --num-executors 2 --executor-cores 5. So I was assuming it will use up 2*5 total 10 vcores for executors, but what it's doing only using 2 cores in total from the cluster (+1 for the driver)
.
And in spark, the job is still running with parallel tasks of 10 (2*5). Seems like it's just running only 5 threads within each executor core.
I have read in different questions and in documentation --executor-cores uses actual vCores but here, it only running tasks as threads.
Is my understanding correct here?

Spark-submit command options --num-executors issue [duplicate]

So I have a spark standalone server with 16 cores and 64GB of RAM. I have both the master and worker running on the server. I don't have dynamic allocation enabled. I am on Spark 2.0
What I dont understand is when I submit my job and specify:
--num-executors 2
--executor-cores 2
Only 4 cores should be taken up. Yet when the job is submitted, it takes all 16 cores and spins up 8 executors regardless, bypassing the num-executors parameter. But if I change the executor-cores parameter to 4 it will adjust accordingly and 4 executors will spin up.
Disclaimer: I really don't know if --num-executors should work or not in standalone mode. I haven't seen it used outside YARN.
Note: As pointed out by Marco --num-executors is no longer in use on YARN.
You can effectively control number of executors in standalone mode with static allocation (this works on Mesos as well) by combining spark.cores.max and spark.executor.cores where number of executors is determined as:
floor(spark.cores.max / spark.executor.cores)
For example:
--conf "spark.cores.max=4" --conf "spark.executor.cores=2"

Why do I only have 2 Spark executors? [duplicate]

So I have a spark standalone server with 16 cores and 64GB of RAM. I have both the master and worker running on the server. I don't have dynamic allocation enabled. I am on Spark 2.0
What I dont understand is when I submit my job and specify:
--num-executors 2
--executor-cores 2
Only 4 cores should be taken up. Yet when the job is submitted, it takes all 16 cores and spins up 8 executors regardless, bypassing the num-executors parameter. But if I change the executor-cores parameter to 4 it will adjust accordingly and 4 executors will spin up.
Disclaimer: I really don't know if --num-executors should work or not in standalone mode. I haven't seen it used outside YARN.
Note: As pointed out by Marco --num-executors is no longer in use on YARN.
You can effectively control number of executors in standalone mode with static allocation (this works on Mesos as well) by combining spark.cores.max and spark.executor.cores where number of executors is determined as:
floor(spark.cores.max / spark.executor.cores)
For example:
--conf "spark.cores.max=4" --conf "spark.executor.cores=2"

Spark tuning job

I have a problem with tuning Spark jobs executing on Yarn cluster. I'm having a feeling that I'm not getting most of my cluster and additionally, my jobs fail (executors get removed all the time).
I have the following setup:
4 machines
each machine has 10GB of RAM
each machine has 8 cores
8GBs of RAM are allocated for yarn jobs
14 (of 16) virtual cores are allocated for yarn jobs
I have run my spark job (actually connected to a jupyter notebook) using different setups, e.g.
pyspark --master yarn --num-executors 7 --executor-cores 4 --executor-memory 3G
pyspark --master yarn --num-executors 7 --executor-cores 7 --executor-memory 2G
pyspark --master yarn --num-executors 11 --executor-cores 4 --executor-memory 1G
I've tried different combinations and none of them seems to be working as my executors get destroyed. Additionally, I've read somewhere that it is a good way to increase spark.yarn.executor.memoryOverhead to 600MB as a way not to loose executors (and I did that), but seems that doesn't help. How should I setup my job?
Additionally, it confuses me that when I look at the ResourceManager UI it says for my job vcores used 8 vcores total 56. It seems that I'm using a single core per executor, but I don't understand why?
One more thing, when I setup my job, how many partitions should I specify when I'm reading data from HDFS to get maximal performance?
Donald Knuth said premature optimisation is the root of all evil. I am sure faster running program which fails is on no use. Start by giving all the memory to one executor. Say 7GB/8GB and just 1 core. This is a complete wastage of cores, but if it works, it proves your application can possibly run on this hardware. If even this doesn't work, you should try getting bigger machines. Assuming it works, try increasing the number of cores, until it still works.
The gist of the argument is: your application requires certain memory per task. But the number of tasks running per executor is dependent on number of cores. First find the worst case memory per cores for you application and then you can set executor memory and cores to some multiple of this number.

Using all resources in Apache Spark with Yarn

I am using Apache Spark with Yarn client.
I have 4 worker PCs with 8 vcpus each and 30 GB of ram in my spark cluster.
Im set my executor memory to 2G and number of instances to 33.
My job is taking 10 hours to run and all machines are about 80% idle.
I dont understand the correlation between executor memory and executor instances. Should I have an instance per Vcpu? Should I set the executor memory to be memory of machine/#executors per machine?
I believe that you have to use the following command:
spark-submit --num-executors 4 --executor-memory 7G --driver-memory 2G --executor-cores 8 --class \"YourClassName\" --master yarn-client
Number of executors should be 4, since you have 4 workers. The executor memory should be close to the maximum memory that each yarn node has allocated, roughly ~5-6GB (I assume you have 30GB total RAM).
You should take a look on the spark-submit parameters and fully understand them.
We were using cassandra as our data source for spark. The problem was there were not enough partitions. We needed to split up the data more. Our mapping for # of cassandra partitions to spark partitions was not small enough and we would only generate 10 or 20 tasks instead of 100s of tasks.

Resources