Enabling dynamic allocation on spark on YARN mode - apache-spark

This question is similar to this but there was no answer.
I am trying to enable dynamic allocation in Spark in YARN mode. I have 11 node cluster with 1 master node and 10 worker nodes. I am following below link for instructions:
For setup in YARN:
http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
Config variables needs to be set in spark-defaults.conf: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
I have also taken reference from below link and few other resources:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html#spark.dynamicAllocation.testing
Here are the steps I am doing:
Setting up config variables in spark-defaults.conf.
My spark-defaults.conf related to dynamic allocation and shuffle service is as:
spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
Making changes in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/client/*,$HADOOP_MAPRED_HOME/share/hadoop/client/lib/*,/home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar </value>
</property>
All these steps are replicated in all worker nodes i.e spark-defaults.conf has the above mentioned values and yarn-site.xml has these properties. I have made sure that /home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar exists in all worker nodes.
Then I am running $SPARK_HOME/sbin/start-shuffle-service.sh in worker nodes and master node. In master node, I am restarting the YARN using stop-yarn.sh and then start-yarn.sh
Then I am doing YARN node -list -all to see the worker nodes but I am not able to see any node
When I am disabling the property
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
I can see all the worker nodes as normal so it seems like shuffle service is not properly configured.

Related

failed to connect hadoop spark

I am a beginner in spark job and spark configuration
I try to submit a spark job, after a few minutes (job is accepted and running during few minutes) the job fails with a connection refused.
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 2
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to my.domain.com/myIp:portNumber
I also have this error with jobs success
ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
On my computer, with intellij Idea my job turn, this is not a code mistake
I try several time to change configuration in yarn-site.xml and mapred-site.xml
This is a hadoop hdfs cluster, 3 nodes, 2 cores on each node, 8GB RAM on each node, I try to submit with this command line :
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.3 --class MyClass --master yarn --deploy-mode cluster myJar.jar
mapred-site.xml :
<property>
<value>yarn</value>
<name>mapreduce.framework.name</name>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2000</value>
</property>
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ipadress</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2000</value>
</property>
spark-default.conf
spark.master yarn
spark.driver.memory 1g
spark.history.fs.update.interval 30s
spark.history.ui.port port
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 2
spark.executor.memory 2g
spark.cores.max 2
spark.executor.cores 2

Yarn nodemanager not starting up. Getting no errors

I have Hadoop 2.7.4 installed on Ubuntu 16.04. I'm trying to run it in Pseudo Mode.
I have a '/hadoop' partition mounted for all my hadoop files, NameNode and DataNode files.
My core-site.xml is:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
My hdfs-site.xml is:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/hadoop/nodes/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/nodes/datanode</value>
</property>
</configuration>
My mapred-site.xml is:
<configuration>
<property>
<name>Map-Reduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
My yarn-site.xml is:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>Map-Reduce_shuffle</value>
</property>
</configuration>
After running
$ start-dfs.sh
$ start-yarn.sh
$ jps
I get the following daemons running.
2800 ResourceManager
2290 NameNode
4242 Jps
2440 DataNode
2634 SecondaryNameNode
start-yarn.sh gives me:
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-resourcemanager-abdy-hadoop.out
localhost: starting nodemanager, logging to /hadoop/hadoop-2.7.4/logs/yarn-abdy-nodemanager-abdy-hadoop.out
The nodemanager daemon does not seem to start at all.
I've tried for 2 days to fix this issue but I cannot seem to find a fix. Someone please guide me.
If your going to start hadoop daemons for the first time.
First you have to format your namenode :
hadoop namenode -format
Before formatting namenode make sure you delete existing
/hadoop/nodes/namenode and /hadoop/nodes/datanode folders
Then you execute:
hadoop namenode -format
Once formatting of namenode is done.
you execute following commands.
start-dfs.sh
start-yarn.sh

oozie spark action - how to specify spark-opts

I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.
Here's from oozie doc:
Example:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstsparkjob">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>local[*]</master>
<mode>client<mode>
<name>Spark Example</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
<spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
<arg>inputpath=hdfs://localhost/input/file.txt</arg>
<arg>value=2</arg>
</spark>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
In above spark-opts are specified as --executor-memory 20G --num-executors 50
while on the same page in description it says:
"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"
so according to document it should be --conf executor-memory=20G
which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.
<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>
<spark-opts>--driver-memory 10g</spark-opts>
None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.
reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts> because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration in oozie workflow:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.cpu.vcores</name>
<value>24</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.
<spark-opts>--executor-memory 20G</spark-opts> should work ideally.
Also, try using:
<master>yarn-cluster</master>
<mode>cluster</mode>
"Spark configuration options can be passed by specifying '--conf key=value' here " is probably referring the configuration tag.
For Ex:
--conf mapred.compress.map.output=true would translate to:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
try changing <master>local[*]</master> to <master>yarn</master>

SparkR-submit core not allocated

Hi I'm working on SparkR on yarn mode.
When I submit an application in this way:
./spark-submit --master yarn-client --packages com.databricks:spark-
csv_2.10:1.0.3 --driver-memory 6g --num-executors 8 --executor-memory 6g
--total-executor-cores 32 --executor-cores 8 /home/sentiment/Scrivania/test3.R
One node start as AM (I think is chosen randomly) and take 1gb of Memory and 1 Vcore.
After that ALL nodes has 7Gb of Memory and 1 Vcore for each one. (Except for the node who starts AM that has 8gb and 2core)
Why nodes do not acquire 4 cores as configuration/spark submit says?
spark-default
spark.master spark://server1:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
spark.akka.frameSize 1000
spark.yarn.am.cores 4
spark.kryoserializer.buffer.max 700m
spark.kryoserializer.buffer 100m
Yarn-manager
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>server1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>server1:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>server1:8050</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>server1:8088</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>4</value>
</property>
</configuration>
Update1:
Read from old post that I needed to change the value of this property below from Default to Dominant at capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
Added at Spark-env
SPARK_EXECUTOR_CORES=4
Nothing changed.
Update2:
I read this from spark official page, so 1 core for each executor in Yarn mode is the maximum value?
spark.executor.cores The number of cores to use on each executor. For
YARN and standalone mode only. In standalone mode, setting this
parameter allows an application to run multiple executors on the same
worker, provided that there are enough cores on that worker.
Otherwise, only one executor per application will run on each worker.

Error in Configuring Spark/Shark on DSE

, I have installed
1) scala-2.10.3
2) spark-1.0.0
Changed spark-env.sh with below variables
export SCALA_HOME=$HOME/scala-2.10.3
export SPARK_WORKER_MEMORY=16g
I can see Spark master.
3) shark-0.9.1-bin-hadoop1
Changed shark-env.sh with below variables
export SHARK_MASTER_MEM=1g
SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS
export HIVE_HOME=/usr/share/dse/hive
export HIVE_CONF_DIR="/etc/dse/hive"
export SPARK_HOME=/home/ubuntu/spark-1.0.0
export SPARK_MEM=16g
source $SPARK_HOME/conf/spark-env.sh
4) In DSE, Hive version is Hive 0.11
Existing Hive-site.xml is
<configuration>
<!-- Hive Execution Parameters -->
<property>
<name>hive.exec.mode.local.auto</name>
<value>false</value>
<description>Let hive determine whether to run in local mode automatically</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>cfs:///user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>lib/hive-hwi.war</value>
<description>This sets the path to the HWI war file, relative to ${HIVE_HOME}</description>
</property>
<property>
<name>hive.metastore.rawstore.impl</name>
<value>com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore</value>
<description>Use the Apache Cassandra Hive RawStore implementation</description>
</property>
<property>
<name>hadoop.bin.path</name>
<value>${dse.bin}/dse hadoop</value>
</property>
<!-- Set this to true to enable auto-creation of Cassandra keyspaces as Hive Databases -->
<property>
<name>cassandra.autoCreateHiveSchema</name>
<value>true</value>
</property>
</configuration>
5) while running Shark shell getting error:
Unable to instantiate Org.apache.hadoop.hive.metastore.HiveMetaStoreClient
And
6) While running shark shell with -skipRddReload - I'm able to get Shark shell but not able to connect hive and not able execute any commands.
shark> DESCRIVE mykeyspace;
and getting error message:
FAILED: Error in metastore: java.lang.RuntimeException: Unable to instantiate org.apache.haddop.hive.metastore.HiveMataStoreClient.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.q1.exec.DDLTask.
Please provide details how to configure spark/shark on Datastax enterprise (Cassandra).

Resources