DataProc YARN UI - wrong number of Vcores - apache-spark

im running a spark application on dataproc cluster (n1-standard-16) 4 machines (3 primary and 1 secondary)
in idle scenario i can see 16 vcores available which is expected.
but when my spark application is running it is going above 16 i.e 32.. like below, any idea why this is happening? is it because of threading concept?
if it is because of threading how to control it? and how i can make the maximum use out of it?
note: i have corrected the yarn ui scheduler type to fair scheduler already.
my spark request:
--properties=spark.submit.deployMode=cluster,spark.hadoop.hive.exec.dynamic.partition=true,spark.sql.hive.convertMetastoreOrc=true,spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict,spark.shuffle.service.enabled=true,spark.dynamicAllocation.enabled=true,spark.dynamicAllocation.minExecutors=30,spark.dynamicAllocation.maxExecutors=180,spark.dynamicAllocation.executorIdleTimeout=60s,spark.executor.instances=70,spark.executor.cores=3,spark.serializer=org.apache.spark.serializer.KryoSerializer,spark.sql.shuffle.partitions=220,spark.executor.memory=3g,spark.driver.memory=2g,spark.yarn.executor.memoryOverhead=1g,
thanks in advance.

Related

What is the benefit of using more then 1 driver core in spark yarn cluster mode?

what is the difference in using 1 vs 2 driver core in spark yarn cluster mode? If i use 2 driver cores in yarn cluster mode, then spark driver will be relaunched incase of failure? If so, how many retry if would do before failing?
Appreciate if anyone can share any article on this?
When you launch application in YARN cluster mode, it will create container for your driver.
This container - depending on your application - might need multiple cores and multiple gigs of memory. It all depends on how many sessions will connect to your Spark application at the same time and on complexity of your query.
If it looks like your query compiles slowly or your Spark Web UI/app hangs, it might be worth it to increase core count.
From the point of YARN, there is still only one driver container.

Get number of available executors

I'm spinning up an EMR 5.4.0 cluster with Spark installed. I have a job for which performance really degrades if it's scheduled on executors which aren't available (eg on a cluster w/ 2 m3.xlarge core nodes there are about 16 executors available).
Is there any way for my app to discover this number? I can discover the hosts by doing this:
sc.range(1,100,1,100).pipe("hostname").distinct().count(), but I'm hoping there's a better way of getting an understanding of the cluster that Spark is running on.

How to configure Yarn to use all vcores?

We are running a spark streaming job using yarn as cluster manager, i have dedicated 7 cores per node to each node ...via yarn-site.xml as shown in the pic below
when the job is running ..it's only using 2 vcores and 5 vcores are left alone and the job is slow with lot of batches queued up ..
how can we make it use all the 7 vcores ..that's available to it this is usage when running so that it speed's up our job
Would greatly appreciate if any of the experts in the community will help out as we are new to Yarn & Spark
I searched many answers for this question. Finally, it worked after changing a yarn config file: capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
Don't forget to restart your yarn
At spark level you can control yarn application master's cores by using parameters spark.yarn.am.cores.
For spark executors you need to pass --executor-cores to spark-submit.
However from spark, you cannot control what(vcores/memory) yarn chooses to allocate to the container that it spawns which is right, as you are running spark over yarn.
In order to control that you will need to change yarn vcore parameters like yarn.nodemanager.resource.cpu-vcores, yarn.scheduler.minimum-allocation-vcores. More you can find here https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html#configuring_in_cm

What is the minimum Hardware insfracture required for spark to run on spark standalone cluster mode?

I am running spark standalone cluster mode in my local computer .This is hardware information about my computer
Intel Core i5
Number of Processors: 1
Total Number of Cores: 2
Memory: 4 GB.
I am trying to run spark program from eclipse on spark standalone cluster .This is some part of my code .
String logFile = "/Users/BigDinosaur/Downloads/spark-2.0.1-bin-hadoop2.7 2/README.md"; //
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://BigDinosaur.local:7077"));
after running program in eclipse I am getting following warning message
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resource
This is my screen shot of web UI
After going through other people answer on similar problem it seems like hardware resource mismatch is the root cause.
I want to get more information on
What is Minimum Hardware insfracture required for spark standalone cluster to run application on it ?
It started running after i run following command
./start-slave.sh spark://localhost:7077 --cores 1 --memory 1g
I gave for core 1 and memory 1 g
As per I know. Spark allocates memory from whatever memory is available when spark job starts.
You may want to try with explicitely providing cores and executor memory when starting job.

Number of Executors is less than what is assigned for a Spark job

I am having two hadoop clusters containing 15(big) and 3(small) nodes respectively. Both are managed by cloudera manager. I am running a Spark job using yarn setting --num-executors to 6. The Spark UI of the big cluster is showing the 6 executors, but Spark UI of the small cluster is showing only 3 executors. What are the probable reasons for it? And also how to overcome the issue?
Thanks in advance.

Resources