Hadoop-2.7.2: How manage resources - resources

I use a server with 16 cores, 64 GB ram, 2.5 TB disk and I want to execute a Giraph program. I have installed hadoop-2.7.2 and I don't know how can configure hadoop to use only a partial amount of server resources because the server used by many users.
Requirements: Hadoop must use max 12 cores (=> 4 cores for NameNode, DataNode, JobTracker, TaskTracker and max 8 for tasks) and max 28GB ram (i.e., 4*3GB + 8*2GB).
My Yarn-site resources configuration:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>28672</value>
<description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>12</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
</configuration>
When I try to execute Giraph program, in http://localhost:8088 Yarn Application state is: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
I think some configurations are missing from my Yarn-site.xml in order to adapt the above requirements.

Before assigning resources to the services take a look at Yarn tuning Guide file from Cloudera, you will get idea how much resources should be allocated to OS, Hadoop daemons, etc
As you mentioned
Yarn Application state is: ACCEPTED: waiting for AM container to be allocated, launched and register with RM
If there is no available resources for a job, then it will be in ACCEPTED state until it get resources. So in your case, check how many jobs are submitting at the same time and check the resources utilisation for those jobs.
If you want to configure no waiting for your jobs, you have to consider creating scheduler queues

Related

Memory configurations

I had a question about memory config. I am running a 3 node cluster with Spark, Cassandra, Hadoop, Thrift and Yarn. I want to store my files in hdfs, so i loaded hadoop. I am finding that i am running out of memory when running my queries. I was able to figure out how to restrict cassandra to run in less than 4gb. Is there such a setting for hadoop? How about Yarn? As i only use hadoop to load up my flat files, i think setting it to 1 or 2gb should be fine. My boxes have 32gb of ram and 16 cores each.
It is hard to say without the error message you are facing. But if you want to check about allocation of memory at your workers you can setup these two configurations at your yarn-site.xml:
<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value> <!-- 40 GB -->
<name>yarn.scheduler.maximum-allocation-mb</name> <!-Max RAM-per-container->
<value>8192</value>
You can see more details here in this question.

emr-5.4.0 (Spark executors memory allocation issue)

I created a spark cluster(learning so did not create high memory-cpu cluster) with 1 master node and 2 Core to run executors using below config
Master:Running1m4.large (2 Core , 8GB)
Core:Running2c4.large (2 core , 3.5 GB)
Hive 2.1.1, Pig 0.16.0, Hue 3.11.0, Spark 2.1.0, Sqoop 1.4.6, HBase 1.3.0
When pyspark is run getting below error
Required executor memory (1024+384 MB) is above the max threshold (896 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
Before trying to increase yarn-site.xml config , curious to understand why EMR is taking just 896MB as limit when master has 8GB and worker node has 3.5GB each.
And Resource manager URL (for master- http://master-public-dns-name:8088/) is showing 1.75 GB where as memory for vm is 8GB. Is hbase or other sws taking up too much memory?
If anyone encountered similar issue , please share your insight why it is EMR is setting low defaults. Thanks!
Before trying to increase yarn-site.xml config , curious to understand
why EMR is taking just 896MB as limit when master has 8GB and worker
node has 3.5GB each.
If you run spark jobs with yarn cluster mode (which you probably were using) , the executors will be run on core's and masters memory will not be used.
Now, all-though your CORE EC2 instance (c4.large) has 3.75 GB to use, EMR configures YARN not to use all this memory for running YARN containers or spark executors. This is because you gotta leave enough memory for other permanent daemons ( like HDFS's datanode , YARN's nodemanager , EMR's own daemons etc.. based on app's you provision)
EMR does publish this default YARN configuration it sets for all instance types on this page : http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html
c4.large
Configuration Option Default Value
mapreduce.map.java.opts -Xmx717m
mapreduce.map.memory.mb 896
yarn.scheduler.maximum-allocation-mb 1792
yarn.nodemanager.resource.memory-mb 1792
So, yarn.nodemanager.resource.memory-mb = 1792, which means 1792 MB is the physical memory that will be allocated to YARN containers on that core node having 3.75 actual memory. Also, check spark-defaults.xml where EMR has some defaults for spark executor memory. These are default's and of course you can change those before starting cluster using EMR's configurations API . But keep in mind that if you over provision memory for YARN containers , you might starve some other processes.
Given that it is important to understand YARN configs and how SPARK interacts with YARN .
https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
http://spark.apache.org/docs/latest/running-on-yarn.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
It's not really a property of EMR but rather of YARN, which is the resource manager running on EMR.
My personal take on YARN is that is really build for managing long running clusters that continuously take in a variety of jobs that it has to run simultaneously. In these cases it makes sense for YARN to only assign a small part of the available memory to each job.
Unfortunately, when it comes to specific-purpose clusters (like: "I will just spin up a cluster run my job and terminate the cluster again") these YARN-defaults are simply annoying, and you have to configure a bunch of stuff in order to make YARN utilise your resources optimally. But running on EMR it's what we are stuck with these days, so one has to live with that...

how does Spark limit the usage of cpu cores and memory?

How does Spark limit the usage of cpu cores and memory?Does it use cgroups? How about Yarn?
In standalone cluster Spark only manages application predefined resource configs with provided resource pool. Resource pool combined based on executors which added as salves to cluster.
Yarn uses containers and resource limitation applies config of container which defines minimum and maximum core and memory allocation.
In YARN NodeManager is monitoring spark executors' memory usage and killing them if they use above spark.executor.memory
In case of CPU, spark.executor.cores is the amount of concurrent tasks executor can run. More information on Spark Configuration Documentation
You can enable cgroups in yarn and limit CPU usage or YARN containers (spark executors).

Yarn shows more resources than cluster have

I start an EMR cluster with 3 m3.xlarge instance (1 master & 2 slaves) and i have some troubles.
From aws documentation a m3.xlarge instance has 4 vcpu ( https://aws.amazon.com/ec2/instance-types/ ) . What does it means? This means 4 threads or 4 core with 2 thread each core? I ask you that, because when i open hadoop UI(port 8088) appear to be 8 available vcore per instances, but from what i experienced, cluster behave like a 2 instances with 4 vcore per instances. Am i wrong? Or it's a bug from Amazon or yarn?
The value 8 vcores comes from the default Yarn property
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
<description>Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of physical cores used by YARN containers.</description>
</property>
Though it is defined to an higher value than the actual number of vcores in the instance, the containers will be created based on the number of vcores actually available per nodemanager instance.
Modify the value of this property in yarn-site.xml as per the instance vcores.

Spark driver taking whole resources on yarn cluster

we are submitting multiple sparks jobs in yarn-cluster mode to yarn queue simultaneously. The problem which we are facing currently is that drivers are getting initialized for all the jobs and no resources are left for the executor's initializations.
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.5</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls a number of concurrently running
applications.
</description>
</property>
According to this property, only 50% resources are available for application master but in our case, this is not working.
Any suggestions to tackle this problem.

Resources