SparkR-submit core not allocated - apache-spark

Hi I'm working on SparkR on yarn mode.
When I submit an application in this way:
./spark-submit --master yarn-client --packages com.databricks:spark-
csv_2.10:1.0.3 --driver-memory 6g --num-executors 8 --executor-memory 6g
--total-executor-cores 32 --executor-cores 8 /home/sentiment/Scrivania/test3.R
One node start as AM (I think is chosen randomly) and take 1gb of Memory and 1 Vcore.
After that ALL nodes has 7Gb of Memory and 1 Vcore for each one. (Except for the node who starts AM that has 8gb and 2core)
Why nodes do not acquire 4 cores as configuration/spark submit says?
spark-default
spark.master spark://server1:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
spark.akka.frameSize 1000
spark.yarn.am.cores 4
spark.kryoserializer.buffer.max 700m
spark.kryoserializer.buffer 100m
Yarn-manager
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>server1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>server1:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>server1:8050</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>server1:8088</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>4</value>
</property>
</configuration>
Update1:
Read from old post that I needed to change the value of this property below from Default to Dominant at capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
Added at Spark-env
SPARK_EXECUTOR_CORES=4
Nothing changed.
Update2:
I read this from spark official page, so 1 core for each executor in Yarn mode is the maximum value?
spark.executor.cores The number of cores to use on each executor. For
YARN and standalone mode only. In standalone mode, setting this
parameter allows an application to run multiple executors on the same
worker, provided that there are enough cores on that worker.
Otherwise, only one executor per application will run on each worker.

Related

failed to connect hadoop spark

I am a beginner in spark job and spark configuration
I try to submit a spark job, after a few minutes (job is accepted and running during few minutes) the job fails with a connection refused.
User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 2
Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failed to connect to my.domain.com/myIp:portNumber
I also have this error with jobs success
ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
On my computer, with intellij Idea my job turn, this is not a code mistake
I try several time to change configuration in yarn-site.xml and mapred-site.xml
This is a hadoop hdfs cluster, 3 nodes, 2 cores on each node, 8GB RAM on each node, I try to submit with this command line :
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.3 --class MyClass --master yarn --deploy-mode cluster myJar.jar
mapred-site.xml :
<property>
<value>yarn</value>
<name>mapreduce.framework.name</name>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1000</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2000</value>
</property>
yarn-site.xml
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ipadress</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2000</value>
</property>
spark-default.conf
spark.master yarn
spark.driver.memory 1g
spark.history.fs.update.interval 30s
spark.history.ui.port port
spark.core.connection.ack.wait.timeout 600s
spark.default.parallelism 2
spark.executor.memory 2g
spark.cores.max 2
spark.executor.cores 2

Spark on Yarn Failed to send RPC and Slave lost

I want to deploy spark2.3.2 on Yarn, Hadoop2.7.3.
But when I run:
spark-shell
Always raise ERROR:
ERROR TransportClient:233 - Failed to send RPC 4858956348523471318 to /10.20.42.194:54288: java.nio.channels.ClosedChannelException
...
ERROR YarnScheduler:70 - Lost executor 1 on dc002: Slave lost
Both dc002 and dc003 will raise ERRORs Failed to send RPC and Slave lost.
I have one master node and two slave node server. They all are:
CentOS Linux release 7.5.1804 (Core) with 40 cpu and 62.6GB memory and 31.4 GB swap.
My HADOOP_CONF_DIR:
export HADOOP_CONF_DIR=/home/spark-test/hadoop-2.7.3/etc/hadoop
My /etc/hosts:
10.20.51.154 dc001
10.20.42.194 dc002
10.20.42.177 dc003
In Hadoop and Yarn Web UI, I can see both dc002 and dc003 node, and I can run simple mapreduce task on yarn in hadoop.
But when I run spark-shell or SparkPi example program by
./spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.2.jar 10
, ERRORs always raise.
I really want to why those errors happened.
I fixed this problem by changing the yarn-site.xml conf file:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Try this parameter in you code-
spark.conf.set("spark.dynamicAllocation.enabled", "false")
Secondly while doing spark submit, define parameters like --executor-memory and --num-executors
sample:
spark2-submit --executor-memory 20g --num-executors 15 --class com.executor mapping.jar

How to configure spark thrift user and password

i am trying to connect spark thrift server by beeline, and i started spark thrift as below:
start-thriftserver.sh --master yarn-client --num-executors 2 --conf spark.driver.memory=2g --executor-memory 3g
and the spark conf/hive-site.xml as below:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:node001:3306/hive?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
</property>
<property>
<name>hive.server2.thrift.client.user</name>
<value>root</value>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>123456</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
</property>
<property>
<name>hive.security.authorization.enabled</name>
<value>true</value>
<description>enableor disable the hive clientauthorization</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
when i use beeline cli to access spark thrift server, it hint me input username and passwd, but i didn't input anything, just click enter(no input the username and password configured in hive-site.xml), i also can access spark. How can make the configuration work well.
Thanks :)
beeline> !connect jdbc:hive2://node001:10001
Connecting to jdbc:hive2://node001:10001
Enter username for jdbc:hive2://node001:10001:
Enter password for jdbc:hive2://node001:10001:
18/12/06 16:13:44 INFO Utils: Supplied authorities: node001:10001
18/12/06 16:13:44 INFO Utils: Resolved authority: node001:10001
18/12/06 16:13:45 INFO HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://node001:10001
Connected to: Spark SQL (version 2.4.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://node001:10001> show databases;
+-------------------------+--+
| databaseName |
+-------------------------+-- |
| default |
| test |
+-------------------------+--+
7 rows selected (0.142 seconds)
0: jdbc:hive2://node001:10001>

Enabling dynamic allocation on spark on YARN mode

This question is similar to this but there was no answer.
I am trying to enable dynamic allocation in Spark in YARN mode. I have 11 node cluster with 1 master node and 10 worker nodes. I am following below link for instructions:
For setup in YARN:
http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
Config variables needs to be set in spark-defaults.conf: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
I have also taken reference from below link and few other resources:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html#spark.dynamicAllocation.testing
Here are the steps I am doing:
Setting up config variables in spark-defaults.conf.
My spark-defaults.conf related to dynamic allocation and shuffle service is as:
spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
Making changes in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/client/*,$HADOOP_MAPRED_HOME/share/hadoop/client/lib/*,/home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar </value>
</property>
All these steps are replicated in all worker nodes i.e spark-defaults.conf has the above mentioned values and yarn-site.xml has these properties. I have made sure that /home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar exists in all worker nodes.
Then I am running $SPARK_HOME/sbin/start-shuffle-service.sh in worker nodes and master node. In master node, I am restarting the YARN using stop-yarn.sh and then start-yarn.sh
Then I am doing YARN node -list -all to see the worker nodes but I am not able to see any node
When I am disabling the property
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
I can see all the worker nodes as normal so it seems like shuffle service is not properly configured.

oozie spark action - how to specify spark-opts

I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.
Here's from oozie doc:
Example:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstsparkjob">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>local[*]</master>
<mode>client<mode>
<name>Spark Example</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
<spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
<arg>inputpath=hdfs://localhost/input/file.txt</arg>
<arg>value=2</arg>
</spark>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
In above spark-opts are specified as --executor-memory 20G --num-executors 50
while on the same page in description it says:
"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"
so according to document it should be --conf executor-memory=20G
which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.
<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>
<spark-opts>--driver-memory 10g</spark-opts>
None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.
reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts> because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration in oozie workflow:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.cpu.vcores</name>
<value>24</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.
<spark-opts>--executor-memory 20G</spark-opts> should work ideally.
Also, try using:
<master>yarn-cluster</master>
<mode>cluster</mode>
"Spark configuration options can be passed by specifying '--conf key=value' here " is probably referring the configuration tag.
For Ex:
--conf mapred.compress.map.output=true would translate to:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
try changing <master>local[*]</master> to <master>yarn</master>

Resources