I am using Cloudera distribution of Hadoop and the Spark2 version that is used is 2.2. I was searching for memory management parameters that are present inside "Memory Management" section of the below link:
https://spark.apache.org/docs/2.2.0/configuration.html
However, I don't see these configuration parameters inside spark2> Configuration link of Cloudera Manager.
I think there is some gap in my understanding. Please suggest where to look for if these parameters are to be manually changed.
Hi,
I have provided the parameters of memory tuning in this way in spark configuration of cloudera Manager. What happens is, if you save this then the spark-defaults.conf file gets updated in the server and upon restart of Spark2 service you get the benefit of the configuration change.
Hope it helps.
Related
As part performance one suggestion in spark documentation is to make pointers 4 bytes instead of 8 as shown in the figure.
I'm working on Azure databricks. Now where do I add this config?
I tried adding in advanced options of a cluster under spark config, the following parameter:
jvm -XX:+UseCompressedOops
I'm I adding this config in right location? If not where should I add?
Edit:
Document link
https://spark.apache.org/docs/latest/tuning.html
Edit Databricks cluster. Go to Advanced options below and then to Environment variables. Add there:
JAVA_OPTS="$JAVA_OPTS -XX:+UseCompressedOops"
Please notice that there is some example, when you start to create a brand new cluster.
I am trying to detect the drive failure in Datanode in a Hadoop Cluster. Cloudera Manager API don't have any specific API for that. CM API are only related to Name node or restart services. Are there any suggestions here? Thanks a lot!
If you have access to NameNode UI, the JMX page will give you this information. If you hit the JMX page directly it'll be a JSON formatted page, which can be parsed easily.
We use HortonWorks primarily, haven't touched Cloudera in a long time, but I assume that can be made available somehow.
I use EMR 5.18 to run Spark tasks. Here is the setup:
For any reason, EMR cannot detect all the memory available on the worker nodes. I added nothing to the EMR Configuration part, it's all default settings.
Any idea what is causing this? Thanks.
Edit: Regarding the value of yarn.nodemanager.resource.memory-mb. In the UI it says 28672 but in the yarn-site.xml it's 352768
And this is the list of Application installed:
Hive 2.3.3, Pig 0.17.0, Hue 4.2.0, Spark 2.3.2, Ganglia 3.7.2, Presto 0.210, Livy 0.5.0, Zeppelin 0.8.0, Oozie 5.0.0
Edit2: it seems the reason is that I have HBase installed but the question now is how to re-allocate memory back.
From RM screen, click on every node's HTTP Address link to go to each Node Manager's Web UI.
There, click on Tools > Configuration, and find yarn.nodemanager.resource.memory-mb setting. This should indicate how much memory is allocated to YARN NodeManager on this node.
EMR sets up the defaults that depend on EC2 instance type and whether HBase is installed or not. They are listed in Amazon's online documentation:
You can set configuration variables to tune the performance of your
MapReduce jobs... Default values vary based on the EC2 instance type
of the node used in the cluster. HBase is available when using Amazon
EMR release version 4.6.0 and later. Different defaults are used when
HBase is installed.
Another page provides several alternative ways of changing the default values on EMR clusters specifically.
The memory of spark on EMR is allocated by yarn because the EMR is not only for the yarn applications but it can have a lot of other applications which are not using yarn. So, by default, EMR didn't allow to use whole memory into yarn but it is around 75% of EMR instances. See THIS and THIS.
On the second link, one option is supported
Application Release label classification Valid properties When to use
Spark spark maximizeResourceAllocation Configure executors to utilize the maximum resources of each node.
which is what you want. With this option, you can use the maximized resource allocation. Set this value when you create the EMR in this way.
[
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
The effect is also noted by AWS:
Sets the maximizeResourceAllocation property to true or false. When true, Amazon EMR automatically configures spark-default properties based on cluster hardware configuration.
Although there are some examples (and questions) on how to submit Spark-Jobs via the YARN-REST-API, there are none, that address the specific changes required to make it work with Spark2. I'm currently basing my work off this example and accompanying documentation, but one thing already is quite clear: Spark2 no longer requires a Spark-assembly jar on HDFS, as far as I can tell. Instead on HDP, there's a spark2-hdp-yarn-archive.tar.gz deployed in HDFS.
Now, I wonder how I would have to configure the local-resources in am-container-spec to make the container Spark2 compatible:
How would I build the classpath for REST-compatbility (in particular __spark.jar__)?
Can I reduce the overall amount of duplicate configuration (as suggested here, for the Java YARN-API)?
I've used the Windows version of HDInsight before, and that has a tab where you can set the number of cores and ram per worker node for Zeppelin.
I followed this tutorial to get Zeppelin working:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-zeppelin-notebook/
The Linux version of HDInsight uses Ambari to manage the resources, but I can't seem to find a way to change the settings for Zeppelin.
Zeppelin is not selectable as a separate service in the list of services on the left. It also seems like it isn't available to be added when I choose 'add service' in actions.
I tried editing the general spark configs in Ambari by using override, then adding the worker nodes to my new config group and increasing the number of cores and RAM in custom spark-defaults. (Then clicked save and restarted all affected services.)
I tried editing the spark settings using
vi /etc/spark/conf/spark-defaults.conf
on the headnode, but that wasn't picked up by Ambari.
The performance in Zeppelin seems to stay the same for a query that takes about 1000-1100 seconds every time.
Zeppelin is not a service so it shouldn't show up in Ambari. If you are committed to managing it that way, you may be able to get this to work
https://github.com/tzolov/zeppelin-ambari-plugin
To edit via ssh you'll need edit the zeppelin-env.sh file. First give yourself edit perms.
sudo chmod u+w /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
and then edit zeppelin configs using
vi /usr/hdp/current/incubator-zeppelin/conf/zeppelin-env.sh
Here you can configure the ZEPPELIN_JAVA_OPTS variable, adding:
-Dspark.executor.memory=1024m -Dspark.executor.cores=16
All that being said... any reason you can't just use a Jupyter notebook instead?