EMR does not detect all the memory - apache-spark

I use EMR 5.18 to run Spark tasks. Here is the setup:
For any reason, EMR cannot detect all the memory available on the worker nodes. I added nothing to the EMR Configuration part, it's all default settings.
Any idea what is causing this? Thanks.
Edit: Regarding the value of yarn.nodemanager.resource.memory-mb. In the UI it says 28672 but in the yarn-site.xml it's 352768
And this is the list of Application installed:
Hive 2.3.3, Pig 0.17.0, Hue 4.2.0, Spark 2.3.2, Ganglia 3.7.2, Presto 0.210, Livy 0.5.0, Zeppelin 0.8.0, Oozie 5.0.0
Edit2: it seems the reason is that I have HBase installed but the question now is how to re-allocate memory back.

From RM screen, click on every node's HTTP Address link to go to each Node Manager's Web UI.
There, click on Tools > Configuration, and find yarn.nodemanager.resource.memory-mb setting. This should indicate how much memory is allocated to YARN NodeManager on this node.
EMR sets up the defaults that depend on EC2 instance type and whether HBase is installed or not. They are listed in Amazon's online documentation:
You can set configuration variables to tune the performance of your
MapReduce jobs... Default values vary based on the EC2 instance type
of the node used in the cluster. HBase is available when using Amazon
EMR release version 4.6.0 and later. Different defaults are used when
HBase is installed.
Another page provides several alternative ways of changing the default values on EMR clusters specifically.

The memory of spark on EMR is allocated by yarn because the EMR is not only for the yarn applications but it can have a lot of other applications which are not using yarn. So, by default, EMR didn't allow to use whole memory into yarn but it is around 75% of EMR instances. See THIS and THIS.
On the second link, one option is supported
Application Release label classification Valid properties When to use
Spark spark maximizeResourceAllocation Configure executors to utilize the maximum resources of each node.
which is what you want. With this option, you can use the maximized resource allocation. Set this value when you create the EMR in this way.
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
The effect is also noted by AWS:
Sets the maximizeResourceAllocation property to true or false. When true, Amazon EMR automatically configures spark-default properties based on cluster hardware configuration.

Related

Spark Standalone vs YARN

What features of YARN make it better than Spark Standalone mode for multi-tenant cluster running only Spark applications? Maybe besides authentication.
There are a lot of answers at Google, pretty much of them sounds wrong to me, so I'm not sure where is the truth.
For example:
DZone, Deep Dive Into Spark Cluster Management
Standalone is good for small Spark clusters, but it is not good for
bigger clusters (there is an overhead of running Spark daemons —
master + slave — in cluster nodes)
But other cluster managers also require running agents on cluster nodes. I.e. YARN's slaves are called node managers. They may consume even more memory than Spark's slaves (Spark default is 1 GB).
This answer
The Spark standalone mode requires each application to run an executor
on every node in the cluster; whereas with YARN, you choose the number
of executors to use
agains Spark Standalone # executor/cores control, that shows how you can specify number of consumed resources at Standalone mode.
Spark Standalone Mode documentation
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
Against the fact Standalone mode can use Dynamic Allocation, and you can specify spark.dynamicAllocation.minExecutors & spark.dynamicAllocation.maxExecutors. Also I haven't found a note about Standalone doesn't support FairScheduler.
This answer
YARN directly handles rack and machine locality
How does YARN may know anything about data locality in my job? Suppose, I'm storing file locations at AWS Glue (used by EMR as Hive metastore). Inside Spark job I'm querying some-db.some-table. How YARN may know what executor is better for job assignment?
UPD: found another mention about YARN and data locality https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-data-locality.html. Still doesn't matter in case of S3 for example.

How to setup YARN with Spark in cluster mode

I need to setup spark cluster (1 Master and 2 slaves nodes) on centos7 along with resource manager as YARN. I am new to all this and still exploring. Can somebody share me detailed steps of setting up Spark with Yarn in cluster mode.
Afterwards i have to integrate Livy too(an open source REST interface for using Spark from anywhere).
Inputs are welcome.Thanks
YARN is part of Hadoop. So, a Hadoop installation is necessary to run Spark on YARN.
Check out the page on the Hadoop Cluster Setup.
Then you can utilize the this documentation to learn about Spark on YARN.
Another method to quickly learn about Hadoop, YARN and Spark is to utilize Cloudera Distribution of Hadoop (CDH). Read the CDH 5 Quick Start Guide.
We are currently using the similar setup in aws. AWS EMR is costly hence
we setup our own cluster using ec2 machines with the help of Hadoop Cookbook. The cookbook supports multiple distributions, however we choose HDP.
The setup included following.
Master Setup
Spark (Along with History server)
Yarn Resource Manager
HDFS Name Node
Livy server
Slave Setup
Yarn Node Manager
HDFS Data Node
More information on manually installing can be found in HDP Documentation
You can see the part of that automation in here.

Datastax Spark Sql Thriftserver with Spark Application

I have an analytics node running, with Spark Sql Thriftserver running on it. Now I can't run another Spark Application with spark-submit.
It says it doesn't have resources. How to configure the dse node, to be able to run both?
The SparkSqlThriftServer is a Spark application like any other. This means it requests and reserves all resources in the cluster by default.
There are two options if you want to run multiple applications at the same time:
Allocate only part of your resources to each application.
This is done by setting spark.cores.max to a smaller value than the max resources in your cluster.
See Spark Docs
Dynamic Allocation
Which allows applications to change the amount of resources they use depending on how much work they are trying to do.
See Spark Docs

Is it worth deploying Spark on YARN if I have no other cluster software?

I have a Spark cluster running in standalone mode. I am currently executing code on using Jupyter notebook calling pyspark. Is there a benefit to using YARN as the cluster manager, assuming that the machines are not doing anything else?
Would I get better performance using YARN? If so, why?
Many thanks,
John
I'd say YES by considering these points.
Why Run on YARN?
Using YARN as Spark’s cluster manager confers a few benefits over Spark standalone:
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
Any how Spark standalone mode also requires worker for slave activity which can not run non Spark applications, where as with YARN, this is isolated in containers, so adoption of another compute framework should be a code change instead of infra + code. So the cluster can be shared among different frameworks.
YARN is the only cluster manager for Spark that supports security. With
YARN, Spark can run against Kerberized Hadoop clusters and uses
secure authentication between its processes.
YARN allows you to dynamically share and centrally configure the same
pool of cluster resources between all frameworks that run on YARN.
You can throw your entire cluster at a MapReduce job, then use some
of it on an Impala query and the rest on Spark application, without
any changes in configuration.
I would say 1,2 and 3 are suitable for mentioned scenarios but not point 4 as we assumed no other frameworks are going to use the cluster.
souce

YARN vs Spark processing engine based on real time application?

I understood YARN and Spark. But I want to know when I need to use Yarn and Spark processing engine. What are the different case studies in that I can identify the difference between YARN and Spark?
You cannot compare Yarn and Spark directly per se. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. It just happens that Hadoop Map Reduce is a feature that ships with Yarn, when Spark is not.
If you mean comparing Map Reduce and Spark, I suggest reading this other answer.
Apache Spark can be run on YARN, MESOS or StandAlone Mode.
Spark in StandAlone mode - it means that all the resource management and job scheduling are taken care Spark inbuilt.
Spark in YARN - YARN is a resource manager introduced in MRV2, which not only supports native hadoop but also Spark, Kafka, Elastic Search and other custom applications.
Spark in Mesos - Spark also supports Mesos, this is one more type of resource manager.
Advantages of Spark on YARN
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN schedulers can be used for spark jobs, Only With YARN, Spark can run against Kerberized Hadoop clusters and uses secure authentication between its processes.
Link for more documentation on YARN, Spark.
We can conclude saying this, if you want to build a small and simple cluster independent of everything go for standalone. If you want to use existing hadoop cluster go for YARN/Mesos.

Resources