Spark on Yarn Failed to send RPC and Slave lost - apache-spark

I want to deploy spark2.3.2 on Yarn, Hadoop2.7.3.
But when I run:
spark-shell
Always raise ERROR:
ERROR TransportClient:233 - Failed to send RPC 4858956348523471318 to /10.20.42.194:54288: java.nio.channels.ClosedChannelException
...
ERROR YarnScheduler:70 - Lost executor 1 on dc002: Slave lost
Both dc002 and dc003 will raise ERRORs Failed to send RPC and Slave lost.
I have one master node and two slave node server. They all are:
CentOS Linux release 7.5.1804 (Core) with 40 cpu and 62.6GB memory and 31.4 GB swap.
My HADOOP_CONF_DIR:
export HADOOP_CONF_DIR=/home/spark-test/hadoop-2.7.3/etc/hadoop
My /etc/hosts:
10.20.51.154 dc001
10.20.42.194 dc002
10.20.42.177 dc003
In Hadoop and Yarn Web UI, I can see both dc002 and dc003 node, and I can run simple mapreduce task on yarn in hadoop.
But when I run spark-shell or SparkPi example program by
./spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.2.jar 10
, ERRORs always raise.
I really want to why those errors happened.

I fixed this problem by changing the yarn-site.xml conf file:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

Try this parameter in you code-
spark.conf.set("spark.dynamicAllocation.enabled", "false")
Secondly while doing spark submit, define parameters like --executor-memory and --num-executors
sample:
spark2-submit --executor-memory 20g --num-executors 15 --class com.executor mapping.jar

Related

Spark Remote execution to Cluster fails - HDFS connection Refused at 8020

I am having issues submitting a spark-submit remote job from a machine outside from the Spark Cluster running on YARN.
Exception in thread "main" java.net.ConnectionException: Call from remote.dev.local/192.168.10.65 to target.dev.local:8020 failed on connection exception: java.net.ConnectionException: Connection Refused
In my core-site.xml:
<property>
<name>fs.defaultFS</name>
<value>hdfs://target.dev.local:8020</value>
<property>
Also at my hdfs-site.xml in the cluster I have disbled permissions checking for HDFS:
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
<property>
Also, when I telnet from the machine outside the cluster:
telnet target.dev.local 8020
I am getting
telnet: connect to address 192.168.10.186: Connection Refused
But, when I
telnet target.dev.local 9000
it says Connected.
Also when I ping target.dev.local it works.
My spark-submit script from the remote machine is:
export HADOOP_CONF_DIR=/<path_to_conf_dir_copied_from_cluster>/
spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 5g \
--executor-memory 50g \
--executor-cores 5 \
--queue default \
<path to jar>.jar \
10
What am I missing here?
Turns out I had to change
<property>
<name>fs.defaultFS</name>
<value>hdfs://target.dev.local:8020</value>
<property>
to
<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:8020</value>
<property>
to allow connections form the outside since target.dev.local sits in a private network switch.

Setting yarn shuffle for spark makes spark-shell not start

I have a 4 ubuntu 14.04 machines cluster where I am setting up spark 2.1.0 prebuilt for hadoop 2.7 to run on top of hadoop 2.7.3 and I am configuring it to work with yarn. Running jps in each node I get:
node-1
22546 Master
22260 ResourceManager
22916 Jps
21829 NameNode
22091 SecondaryNameNode
node-2
12321 Worker
12485 Jps
11978 DataNode
node-3
15938 Jps
15764 Worker
15431 DataNode
node-4
12251 Jps
12075 Worker
11742 DataNode
Without yarn shuffle configuration
./bin/spark-shell --master yarn --deploy-mode client
starts just fine when called in my node-1.
In order to configure a External Shuffle Service, I read this: http://spark.apache.org/docs/2.1.0/running-on-yarn.html#configuring-the-external-shuffle-service
And what I have done is:
Added the following properties to yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/spark/spark-2.1.0-bin-hadoop2.7/yarn/spark-2.1.0-yarn-shuffle.jar</value>
</property>
I do have other properties in this file. Leaving these 3 properties out, as I said, let spark-shell --master yarn --deploy-mode client start normally.
My spark-default.conf is:
spark.master spark://singapura:7077
spark.executor.memory 4g
spark.driver.memory 2g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://singapura:8020/spark/logs
spark.history.fs.logDirectory hdfs://singapura:8020/spark/logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.scheduler.mode FAIR
spark.yarn.stagingDir hdfs://singapura:8020/spark
spark.yarn.jars=hdfs://singapura:8020/spark/jars/*.jar
spark.yarn.am.memory 2g
spark.yarn.am.cores 4
All nodes have the same paths. singapura is my node-1. It's already set in my /etc/hosts and nslookup gets the correct ip. The machine name is not the issue here.
So, What happens to me is: when I add these 3 properties to my yarn-site.xml and start spark shell, it gets stuck without much output.
localuser#singapura:~$ /usr/local/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-shell --master yarn --deploy-mode client
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I wait, wait and wait and nothing more is printed out. I have to kill it and erase the staging directory (if I don't erase it, I get WARN yarn.Client: Failed to cleanup staging dir the next time I call it).

Erro spark-assembly-1.4.1-hadoop2.6.0.jar does not exist

I'm trying to submit a Spark app from local machine Terminal to my Cluster. I'm using --master yarn-cluster. I need to run the driver program on my Cluster too, not on the machine I do submit the application i.e my local machine
I'm using
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
and getting error
Diagnostics: java.io.FileNotFoundException: File
file:/Users/nish1013/Dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar
does not exist
I can see in my service list ,
YARN + MapReduce2 2.7.1.2.3 Apache Hadoop NextGen MapReduce (YARN)
Spark 1.4.1.2.3 Apache Spark is a fast and general engine for
large-scale data processing.
already installed.
My spark-env.sh in local machine
export HADOOP_CONF_DIR=/Users/nish1013/Dev/hadoop-2.7.1/etc/hadoop
Has anyone encountered similar before ?
I think the right command to call is like following:
bin/spark-submit
--class com.my.application.XApp
--master yarn-cluster --executor-memory 100m
--num-executors 50 --conf spark.yarn.jars=hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
1000
or you can add
spark.yarn.jars hdfs://name.node.server:8020/user/root/x-service-1.0.0-201512141101-assembly.jar
in your spark.default.conf file

Cannot submit Spark app to cluster, stuck on "UNDEFINED"

I use this command to summit spark application to yarn cluster
export YARN_CONF_DIR=conf
bin/spark-submit --class "Mining"
--master yarn-cluster
--executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar
In Web UI, it stuck on UNDEFINED
In console, it stuck to
<code>14/11/12 16:37:55 INFO yarn.Client: Application report from ASM:
application identifier: application_1415704754709_0017
appId: 17
clientToAMToken: null
appDiagnostics:
appMasterHost: example.com
appQueue: default
appMasterRpcPort: 0
appStartTime: 1415784586000
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://example.com:8088/proxy/application_1415704754709_0017/
appUser: rain
</code>
Update:
Dive into Logs for container in Web UI http://example.com:8042/node/containerlogs/container_1415704754709_0017_01_000001/rain/stderr/?start=0, I found this
14/11/12 02:11:47 WARN YarnClusterScheduler: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient memory
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain sending #24418
14/11/12 02:11:47 DEBUG Client: IPC Client (1211012646) connection to
spark.mvs.vn/192.168.64.142:8030 from rain got value #24418
I found this problem have had solution here http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/
The Hadoop cluster must have sufficient memory for the request.
For example, submitting the following job with 1GB memory allocated for
executor and Spark driver fails with the above error in the HDP 2.1 Sandbox.
Reduce the memory asked for the executor and the Spark driver to 512m and
re-start the cluster.
I'm trying this solution and hopefully it will work.
Solutions
Finally I found that it caused by memory problem
It worked when I change yarn.nodemanager.resource.memory-mb to 3072 (its value was 2048) in Web UI of interface and restarted cluster.
I'm very happy to see this
With 3GB in yarn nodemanager, my summit is
bin/spark-submit
--class "Mining"
--master yarn-cluster
--executor-memory 512m
--driver-memory 512m
--num-executors 2
--executor-cores 1
./target/scala-2.10/mining-assembly-0.1.jar`

how to : spark yarn cluster

I have set up a hadoop cluster with 3 machines one master and 2 slave
In the master i have installed spark
SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true sbt/sbt clean assembly
Added HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-env.sh
Then i ran SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.4.0.jar
I checked localhost:8088 and i saw application SparkPi running..
Is it just this or i should install spark in the 2 slave machines..
How can i get all the machine started?
Is there any help doc out there.. I feel like i am missing something..
In spark standalone more we start the master and worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
i also wanted to know how to get more than one worked running in this case as well
and i know we can can configure slaves in conf/slave but can anyone share an example
Please help i am stuck
Assuming you're using Spark 1.1.0, as it says in the documentation (http://spark.apache.org/docs/1.1.0/submitting-applications.html#master-urls), for the master parameter you can use values yarn-cluster or yarn-client. You do not need to use deploy-mode parameter in that case.
You do not have to install Spark on all the YARN nodes. That is what YARN is for: to distribute your application (in this case Spark) over a Hadoop cluster.

Resources