Setting yarn shuffle for spark makes spark-shell not start - apache-spark

I have a 4 ubuntu 14.04 machines cluster where I am setting up spark 2.1.0 prebuilt for hadoop 2.7 to run on top of hadoop 2.7.3 and I am configuring it to work with yarn. Running jps in each node I get:
node-1
22546 Master
22260 ResourceManager
22916 Jps
21829 NameNode
22091 SecondaryNameNode
node-2
12321 Worker
12485 Jps
11978 DataNode
node-3
15938 Jps
15764 Worker
15431 DataNode
node-4
12251 Jps
12075 Worker
11742 DataNode
Without yarn shuffle configuration
./bin/spark-shell --master yarn --deploy-mode client
starts just fine when called in my node-1.
In order to configure a External Shuffle Service, I read this: http://spark.apache.org/docs/2.1.0/running-on-yarn.html#configuring-the-external-shuffle-service
And what I have done is:
Added the following properties to yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/spark/spark-2.1.0-bin-hadoop2.7/yarn/spark-2.1.0-yarn-shuffle.jar</value>
</property>
I do have other properties in this file. Leaving these 3 properties out, as I said, let spark-shell --master yarn --deploy-mode client start normally.
My spark-default.conf is:
spark.master spark://singapura:7077
spark.executor.memory 4g
spark.driver.memory 2g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://singapura:8020/spark/logs
spark.history.fs.logDirectory hdfs://singapura:8020/spark/logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.scheduler.mode FAIR
spark.yarn.stagingDir hdfs://singapura:8020/spark
spark.yarn.jars=hdfs://singapura:8020/spark/jars/*.jar
spark.yarn.am.memory 2g
spark.yarn.am.cores 4
All nodes have the same paths. singapura is my node-1. It's already set in my /etc/hosts and nslookup gets the correct ip. The machine name is not the issue here.
So, What happens to me is: when I add these 3 properties to my yarn-site.xml and start spark shell, it gets stuck without much output.
localuser#singapura:~$ /usr/local/spark/spark-2.1.0-bin-hadoop2.7/bin/spark-shell --master yarn --deploy-mode client
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I wait, wait and wait and nothing more is printed out. I have to kill it and erase the staging directory (if I don't erase it, I get WARN yarn.Client: Failed to cleanup staging dir the next time I call it).

Related

Spark config, org.apache.spark.shuffle.FetchFailedException Failed to connect

I installed hadoop 3.1.0 and spark 2.4.7 on 4 virtual machines. In total I have 32 cores, 128G memory. I have been running spark-shell test
[hadoop#hadoop1 bin]$hadoop fs -mkdir -p /user/hadoop/testdata
[hadoop#hadoop1 bin]$hadoop fs -put /app/hadoop/hadoop-2.2.0/etc/hadoop/core-site.xml /user/hadoop/testdata
[hadoop#hadoop1 bin]$ spark-shell --master spark://hadoop1:7077
scala>val rdd=sc.textFile("hdfs://hadoop1:9000/user/hadoop/testdata/core-site.xml")
scala>rdd.cache()
scala>val wordcount=rdd.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_)
scala>wordcount.take(10)
scala>val wordsort=wordcount.map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1))
scala>wordsort.take(10)
I have been playing with the following parameters
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 4
spark.driver.memory 6g
spark.executor.memory 6g
spark.cores.max 21
spark.executor.cores 3
and bumped into org.apache.spark.shuffle.FetchFailedException Failed to connect 192.168.0.XXX
or WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Is there a general guide to fine-tune these and any other parameters?

Spark on Yarn Failed to send RPC and Slave lost

I want to deploy spark2.3.2 on Yarn, Hadoop2.7.3.
But when I run:
spark-shell
Always raise ERROR:
ERROR TransportClient:233 - Failed to send RPC 4858956348523471318 to /10.20.42.194:54288: java.nio.channels.ClosedChannelException
...
ERROR YarnScheduler:70 - Lost executor 1 on dc002: Slave lost
Both dc002 and dc003 will raise ERRORs Failed to send RPC and Slave lost.
I have one master node and two slave node server. They all are:
CentOS Linux release 7.5.1804 (Core) with 40 cpu and 62.6GB memory and 31.4 GB swap.
My HADOOP_CONF_DIR:
export HADOOP_CONF_DIR=/home/spark-test/hadoop-2.7.3/etc/hadoop
My /etc/hosts:
10.20.51.154 dc001
10.20.42.194 dc002
10.20.42.177 dc003
In Hadoop and Yarn Web UI, I can see both dc002 and dc003 node, and I can run simple mapreduce task on yarn in hadoop.
But when I run spark-shell or SparkPi example program by
./spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.2.jar 10
, ERRORs always raise.
I really want to why those errors happened.
I fixed this problem by changing the yarn-site.xml conf file:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Try this parameter in you code-
spark.conf.set("spark.dynamicAllocation.enabled", "false")
Secondly while doing spark submit, define parameters like --executor-memory and --num-executors
sample:
spark2-submit --executor-memory 20g --num-executors 15 --class com.executor mapping.jar

Apache Spark 2.4 is not working with hadoop 2.8.3?

I have installed Hadoop version 2.8.3 in my windows 10 environment and its working fine. Now when i try to install Apache Spark(version 2.4.0) with yarn as cluster manager and its not working. When i try to submit a spark job using spark-submit for testing , so its coming under ACCEPTED tab in YARN UI after that it fails.
I have attached the image of yarn container logs.
This is my spark-defaults.conf
spark.master yarn
spark.driver.memory 512m
spark.yarn.am.memory 512m
spark.executor.memory 2G
spark.executor.cores 1
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://localhost:9000/spark-logs
spark.history.fs.update.interval 10s
spark.history.ui.port 18080

Spark ignores SPARK_WORKER_MEMORY?

I'm using standalone cluster mode, 1.5.2.
Even though I'm setting SPARK_WORKER_MEMORY in spark-env.sh, it looks like this setting is ignored.
I can't find any indications at the scripts under bin/sbin that -Xms/-Xmx are set.
If I use ps command the worker pid, it looks like memory set to 1G:
[hadoop#sl-env1-hadoop1 spark-1.5.2-bin-hadoop2.6]$ ps -ef | grep 20232
hadoop 20232 1 0 02:01 ? 00:00:22 /usr/java/latest//bin/java
-cp /workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/workspace/
3rd-party/hadoop/2.6.3//etc/hadoop/ -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker
--webui-port 8081 spark://10.52.39.92:7077
spark-defaults.conf:
spark.master spark://10.52.39.92:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 2g
spark.executor.cores 1
spark-env.sh:
export SPARK_MASTER_IP=10.52.39.92
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=12g
Am I missing something?
Thanks.
When using spark-shell or spark-submit, use the --executor-memory option.
When configuring it for a standalone jar, set the system property programmatically before creating the spark context.
System.setProperty("spark.executor.memory", executorMemory)
You are using wrong setting in cluster mode.
SPARK_EXECUTOR_MEMORY is the right option to set Executor memory in cluster mode.
SPARK_WORKER_MEMORY works only in standalone deploy mode.
Otherway to set executor memory from command line : -Dspark.executor.memory=2g
Have a loook at one more related SE question regarding these settings :
Spark configuration, what is the difference of SPARK_DRIVER_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_WORKER_MEMORY?
This is my configuration on cluster mode, on spark-default.conf
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
Did have something like this?
If you don't add this code (with your options) Spark executor will get 1gb of Ram as default.
Otherwise you can add these options on ./spark-submit like this :
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Try to check on master(ip/name of master):8080 when you run an application if resources have been allocated correctly.
I've encountered the same problem as yours. The reason is that, in standalone mode, spark.executor.memory is actually ignored. What has an effect is spark.driver.memory, because the executor is living in the driver.
So what you can do is to set spark.driver.memory as high as you want.
This is where I've found the explanation:
How to set Apache Spark Executor memory

Running a simple Spark script on Mesos with Zookeeper

I want to run a simple spark program, but i am restricted by some errors.
My Environment is:
CentOS:6.6
Java: 1.7.0_51
Scala: 2.10.4
Spark: spark-1.4.0-bin-hadoop2.6
Mesos: 0.22.1
All are installed and nodes are up.Now i have one Mesos master and Mesos slave node. My spark properties are below:
spark.app.id 20150624-185838-2885789888-5050-1291-0005
spark.app.name Spark shell
spark.driver.host 192.168.1.172
spark.driver.memory 512m
spark.driver.port 46428
spark.executor.id driver
spark.executor.memory 512m
spark.executor.uri http://192.168.1.172:8080/spark-1.4.0-bin-hadoop2.6.tgz
spark.externalBlockStore.folderName spark-91aafe3b-01a8-4c86-ac3b-999e278807c5
spark.fileserver.uri http://192.168.1.172:51240
spark.jars
spark.master mesos://zk://192.168.1.172:2181/mesos
spark.mesos.coarse true
spark.repl.class.uri http://192.168.1.172:51600
spark.scheduler.mode FIFO
Now when I started spark, it comes to scala prompt(scala>).
After that I am getting following error: mesos task 1 is now TASK_FAILED, blacklisting mesos slave value due to too many failures is Spark installed on it
How to resolve this.
With only 900MB and spark.driver.memory = 512m, you will be able to launch the scheduler/REPL, but you won't have enough memory for spark.executor.memory = 512m, so any tasks will fail. Either increasing your VM memory size or reducing the driver/executor memory requirements will help you get around these memory limits.
Could you check the mesos slave logs/ task information for more output on why the task failed. You could have a look at :5050.
Probably unrelated question: Do you actually have zookeeper:
spark.master mesos://zk://192.168.1.172:2181/mesos
running (as you mentioned you only have one master)?

Resources