Spark 2 on YARN is utilizing more cluster resource automatically - apache-spark

I am on CDH 5.7.0 and I could see a strange issue with spark 2 running on YARN cluster. Hereunder is my job submit command
spark2-submit --master yarn --deploy-mode cluster --conf "spark.executor.instances=8" --conf "spark.executor.cores=4" --conf "spark.executor.memory=8g" --conf "spark.driver.cores=4" --conf "spark.driver.memory=8g" --class com.learning.Trigger learning-1.0.jar
Even though I have limited the number of cluster resources my job can use, I could see the resource utilization is more than the allocated amount.
The job starts with basic memory consumption like 8G of memory and would eat us the whole cluster.
I do not have dynamic allocation set to true.
I am just triggering an INSERT OVERWRITE query on top of SparkSession.
Any pointers would be very helpful.

I created Resource Pool in cluster and assigned some resource as
Min Resources : 4 Virtual Cores and 8 GB memory
Used these pool to assign a spark job to limit the usages of resource (VCores and memory).
e.g. spark2-submit --class org.apache.spark.SparkProgram.rt_app --master yarn --deploy-mode cluster --queue rt_pool_r1 /usr/local/abc/rt_app_2.11-1.0.jar
If anyone has better options to archive the same please let us know.

Related

Spark Cluster resources ensure system memory/cpu

New to the spark world! I am trying to have a cluster that can be called upon to process applications of various sizes. Using spark-submit to summit and yarn to schedule and handle resource management.
The following is how I am submitting the applications and I believe this is saying for this application I am requesting 3 executors with 4g memory and 5 cores each. Is it correct that this is per application?
spark-submit --master yarn --driver-memory 4G --executor-memory 4G --driver-cores 5 --num-executors 3 --executor-cores 5 some.py
How do I ensure that yarn leaves enough memory and cores for GC, yarn(nodemanger,....) and that system has resources? Using yarn top I have seen there being 0 cores available and 0 gb of memory available. There must be settings for this. Isn't there?
To summarize:
Are core and memory requests on a spark-submit for an individual application run?
Is there any config to ensure the yarn and system has resources. Feel like I need to reserve memory and core for this.
TIA

Spark reads double sized data

I have a Spark cluster in my private network, and I have a job containing only one line of code:
spark.read.parquet('/path/to/data').count()
I tried to run the job with same data on a EMR Spark cluster and my private Spark cluster, both with the same parameters:
spark-submit --driver-memory 1G --driver-cores 1 --num-executors 1 --executor-memory 8G --executor-cores 2 --conf spark.dynamicAllocation.enabled=false dummy_job.py
On the spark monitoring web page, I saw that EMR read only 3GiB data while my private cluster read 6.1GiB data
Both Spark clusters prunes reading size a lot, but EMR reads much less data, this may indicates that our gzidc Spark cluster has a incorrect configuration related parquet I/O.
Any ideas? Thanks

How we can limit the usages of VCores during Spark-submit

I am writing a Spark structured streaming application in which data processed with Spark, needs be sink'ed to s3 bucket.
This is my development environment.
Hadoop 2.6.0-cdh5.16.1
Spark version 2.3.0.cloudera4
I want to limit the usages of VCores
As of now I have used spark2-submit to specify option as --conf spark.cores.max=4. However after submitting job I observed that job occupied maximum available VCores from cluster(my cluster has 12 VCores)
Next job is not getting started because of unavailability of VCores.
Which is the best way to limit the usages of VCores per job?
As of now I am doing some workaround as : I created Resource Pool in cluster and assigned some resource as
Min Resources : 4 Virtual Cores and 8 GB memory
Used these pool to assign a spark job to limit the usages of VCores.
e.g. spark2-submit --class org.apache.spark.SparkProgram.rt_app --master yarn --deploy-mode cluster --queue rt_pool_r1 /usr/local/abc/rt_app_2.11-1.0.jar
I want to limit the usages of VCores without any workaround.
I also tried with
spark2-shell --num-executors 1 --executor-cores 1 --jars /tmp/elasticsearch-hadoop-7.1.1.jar
and below is observation.
You can use the "--executor-cores" option it will assign the number of core to each of your executor.
can refer 1 and 2

Multiple executors for spark applcaition

Can one worker have multiple executors for the same Spark application in standalone and yarn mode? If no, then what is the reason for that (for both standalone and yarn mode).
Yes, you can specify resources which Spark will use. For example, you can use these properties for configuration:
--num-executors 3
--driver-memory 4g
--executor-memory 2g
--executor-cores 2
If your node has enough resources cluster assigns more than one executors to the same node.
You can read more information about Spark resources configuration here.

With spark-submit, do I need a worker running on my laptop using client mode?

I'm at a lost right now. I'm running a Spark standalone cluster with a worker on the same box. I'm doing this as a proof of concept so it's not ready for production yet.
Screenshot of console
I spun up a large box in EC2 with 4 cores and 16GB of RAM. I go to submit a job from a different node in the same subnet (Security Group confirms all ports are good between nodes).
/bin/spark-submit --class SimpleApp --total-executor-cores 4 \
--executor-memory 10G --master spark://spark-test:7077 \
--deploy-mode client src/main/scala/target/scala-2.11/simple-project_2.11-1.0.jar
I get the following error message.
17/05/03 20:06:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Resources