How can starting time of a Spark application on a YARN cluster be shorten? - apache-spark

I'm managing a system that submits a Spark application to a YARN cluster in a client mode. It takes about 30 seconds for a submitted Spark application to turn to be ready on a YARN cluster and it seems a bit slow for me.
Are there any ways for shortening starting time of a Spark application on a YARN cluster in a client mode?
Here are specs of my environment:
- Spark version: 2.1.1 (the actual version is 2.1.1.2.6.2.0-205 which is provided by Hortonworks)
- YARN version: 2.7.3 (the actual version is 2.7.3.2.6.2.0-205 which is provided by Hortonworks)
- Number of RM: 2 (HA)
- Number of NM: 300
Here is the sample code and result for checking starting time of a Spark application:
$ /usr/hdp/current/spark2-client/bin/spark-shell -v --master yarn --deploy-mode client --driver-memory 4g --conf spark.default.parallelism=180 --conf spark.executor.cores=6 --conf spark.executor.instances=30 --conf spark.executor.memory=6g --conf spark.yarn.am.cores=4 --conf spark.yarn.containerLauncherMaxThreads=30 --conf spark.yarn.dist.archives=/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr --conf spark.yarn.dist.files=/etc/spark2/conf/hive-site.xml --proxy-user xxxx
Using properties file: /usr/hdp/current/spark2-client/conf/spark-defaults.conf
Adding default property: spark.history.kerberos.keytab=/etc/security/keytabs/spark.headless.keytab
Adding default property: spark.history.fs.logDirectory=hdfs:///spark2-history/
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
Adding default property: spark.yarn.queue=default
Adding default property: spark.yarn.historyServer.address=xxxx
Adding default property: spark.history.kerberos.principal=xxxx#xxxx
Adding default property: spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
Adding default property: spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
Adding default property: spark.eventLog.dir=hdfs:///spark2-history/
Adding default property: spark.history.ui.port=18081
Adding default property: spark.history.kerberos.enabled=true
Parsed arguments:
master yarn
deployMode client
executorMemory 6g
executorCores 6
totalExecutorCores null
propertiesFile /usr/hdp/current/spark2-client/conf/spark-defaults.conf
driverMemory 4g
driverCores null
driverExtraClassPath null
driverExtraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
driverExtraJavaOptions null
supervise false
queue null
numExecutors 30
files null
pyFiles null
archives null
mainClass org.apache.spark.repl.Main
primaryResource spark-shell
name Spark shell
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true
Spark properties used, including those specified through
--conf and those from the properties file /usr/hdp/current/spark2-client/conf/spark-defaults.conf:
(spark.history.kerberos.enabled,true)
(spark.yarn.queue,default)
(spark.default.parallelism,180)
(spark.executor.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.history.kerberos.principal,xxxx#xxxx)
(spark.executor.memory,6g)
(spark.driver.memory,4g)
(spark.driver.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.executor.instances,30)
(spark.yarn.historyServer.address,xxxx)
(spark.eventLog.enabled,true)
(spark.yarn.dist.files,/etc/spark2/conf/hive-site.xml)
(spark.history.ui.port,18081)
(spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider)
(spark.history.fs.logDirectory,hdfs:///spark2-history/)
(spark.yarn.am.cores,4)
(spark.yarn.containerLauncherMaxThreads,30)
(spark.history.kerberos.keytab,/etc/security/keytabs/spark.headless.keytab)
(spark.yarn.dist.archives,/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr)
(spark.eventLog.dir,hdfs:///spark2-history/)
(spark.executor.cores,6)
Main class:
org.apache.spark.repl.Main
Arguments:
System properties:
(spark.yarn.queue,default)
(spark.history.kerberos.enabled,true)
(spark.default.parallelism,180)
(spark.executor.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.history.kerberos.principal,xxxx#xxxx)
(spark.driver.memory,4g)
(spark.executor.memory,6g)
(spark.executor.instances,30)
(spark.driver.extraLibraryPath,/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64)
(spark.yarn.historyServer.address,xxxx)
(spark.eventLog.enabled,true)
(spark.yarn.dist.files,file:/etc/spark2/conf/hive-site.xml)
(spark.history.ui.port,18081)
(SPARK_SUBMIT,true)
(spark.history.provider,org.apache.spark.deploy.history.FsHistoryProvider)
(spark.app.name,Spark shell)
(spark.history.fs.logDirectory,hdfs:///spark2-history/)
(spark.yarn.am.cores,4)
(spark.yarn.containerLauncherMaxThreads,30)
(spark.jars,)
(spark.history.kerberos.keytab,/etc/security/keytabs/spark.headless.keytab)
(spark.submit.deployMode,client)
(spark.yarn.dist.archives,file:/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr)
(spark.eventLog.dir,hdfs:///spark2-history/)
(spark.master,yarn)
(spark.executor.cores,6)
Classpath elements:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xxxx:4040
Spark context available as 'sc' (master = yarn, app id = application_1519898793082_6925).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.2.0-205
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Thanks.

Related

Can't create spark session using yarn inside kubernetes pod

I have a kubernetes pod with spark client installed.
bash-4.2# spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.2.0-205
/_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
Branch HEAD
Compiled by user jenkins on 2017-08-26T09:32:23Z
Revision a2efc34efde0fd268a9f83ea1861bd2548a8c188
Url git#github.com:hortonworks/spark2.git
Type --help for more information.
bash-4.2#
I can submit a spark job successfully under client and cluster mode using these commands:
${SPARK_HOME}/bin/spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYTHONPATH:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.4-src.zip --master yarn --deploy-mode client --num-executors 50 --executor-cores 4 --executor-memory 3G --driver-memory 6G my_python_script.py --config=configurations/sandbox.yaml --startdate='2019-01-01' --enddate='2019-08-01'
${SPARK_HOME}/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 ${SPARK_HOME}/lib/spark-examples*.jar 10
But whenever I start a session using any of these:
spark-shell --master yarn
pyspark --master yarn
It hangs and times out with this error:
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
We have another python script that needs to create a spark session. The code on that script is:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setAll(configs.items())
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
Not sure where else to check. This is the first time we are initiating a spark connection from inside a kubernetes cluster. Getting a spark session inside a normal virtual machine works fine. Not sure what is the difference in terms of network connection. It also puzzles me that I was able to submit a spark job above but unable to create a spark session.
Any thoughts and ideas is highly appreciated. Thanks in advance.
In client mode Spark Driver process is running on your machine and Executors run on Yarn nodes (spark-shell and pyspark submit client mode sessions). Driver and Executor processes to communicate should be able to connect to each other via network in both directions.
Since submitting jobs in cluster mode works for you and you can reach the Yarn master from the Kubernetes Pod network, that route is fine.
Most probably you don't have network access from the Yarn cluster network to the Pod, which most probably lives within Kubernetes private network unless exposed explicitly. This is the first thing I would recommend you to check, as well as Yarn logs.
After you expose the Pod to be accessible from the Yarn cluster network you may want to refer the following spark configs to setup bindings:
- spark.driver.host
- spark.driver.port
- spark.driver.bindAddress
- spark.blockManager.port
Find their descriptions in docs.

Apache Spark installation error.

I am able to install Apache spark with the given set of commands on ubuntu 16:
dpkg -i scala-2.12.1.deb
mkdir /opt/spark
tar -xvf spark-2.0.2-bin-hadoop2.7.tgz
cp -rv spark-2.0.2-bin-hadoop2.7/* /opt/spark
cd /opt/spark
executing spark shell worked well
./bin/spark-shell --master local[2]
return this output on the shell:
jai#jaiPC:/opt/spark$ ./bin/spark-shell --master local[2]
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/05/15 19:00:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/05/15 19:00:55 WARN Utils: Your hostname, jaiPC resolves to a loopback address: 127.0.1.1; using 172.16.16.46 instead (on interface enp4s0)
18/05/15 19:00:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/05/15 19:00:55 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.16.16.46:4040
Spark context available as 'sc' (master = local[2], app id = local-1526391055793).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_171)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
but when I tried to access
Spark context Web UI available at http://172.16.16.46:4040
it shows
The page cannot be displayed
How can I resolve this problem
Please help:
Thanks and Regards

I want to run spark shell in client mode?

Spark context available as 'sc' (master = yarn, app id = application_1519491124804_0002).
I need master = yarn-client
error:
Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/24 22:27:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/24 22:27:29 WARN Utils: Your hostname, suraj resolves to a loopback address: 127.0.1.1; using 192.168.43.193 instead (on interface wlan0)
18/02/24 22:27:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/02/24 22:27:32 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://192.168.43.193:4040 Spark context available as 'sc' (master
= yarn, app id = application_1519491124804_0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161) Type in expressions to have them evaluated. Type :help for more information.
I need master = yarn-client
In Spark 2.x master = yarn-client is deprecated.
spark-shell --master yarn --deploy-mode client is the correct way to run the shell
The default deploy-mode is client

spark-shell, dependency jars and class not found exception

I'm trying to run my spark app on spark shell. Here is what I tried and many more variants after hours of reading on this error...but none seem to work.
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar,/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
What is get instead is
java.lang.ClassNotFoundException: my_home.myhome.RecommendMatch
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Any ideas please? Thanks!
UPDATE:
Found that the jars must be colon(:) separated and not comma(,) separated as described in several articles/docs
spark-shell --class my_home.myhome.RecommendMatch —jars /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar
However, now the errors have changed. Note ls -la finds the paths although the following lines complain that don't exit. Bizarre..
Warning: Local jar /Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar does not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
java.lang.SecurityException: Invalid signature file digest for Manifest main attributes
at sun.security.util.SignatureFileVerifier.processImpl(SignatureFileVerifier.java:314)
at sun.security.util.SignatureFileVerifier.process(SignatureFileVerifier.java:268)
UPDATE 2:
spark-shell —class my_home.myhome.RecommendMatch —-jars “/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar:/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar”
The above command yields the following on spark-shell.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/05/16 01:19:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/05/16 01:19:13 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.0.101:4040
Spark context available as 'sc' (master = local[*], app id = local-1494877749685).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :load my_home.myhome.RecommendMatch
That file does not exist
scala> :load RecommendMatch
That file does not exist
scala> :load my_home.myhome.RecommendMatch.scala
That file does not exist
scala> :load RecommendMatch.scala
That file does not exist
The jars don't seem to be loaded :( based on what I see at http://localhost:4040/environment/
The URLs supplied to --jars must be separated by commas. Your first command is correct.
You also have to add the jar at last param to spark-submit. Lets say my_home.myhome.RecommendMatch is part of myhome-0.0.1-SNAPSHOT.jar jar file.
spark-submit --class my_home.myhome.RecommendMatch \
—jars "/Users/anon/Documents/Works/sparkworkspace/myhome/target/original-myhome-0.0.1-SNAPSHOT.jar" \
/Users/anon/Documents/Works/sparkworkspace/myhome/target/myhome-0.0.1-SNAPSHOT.jar

NoSuchMethodError using Databricks Spark-Avro 3.2.0

I have a spark master & worker running in Docker containers with spark 2.0.2 and hadoop 2.7. I'm trying to submit a job from pyspark from a different container (same network) by running
df = spark.read.json("/data/test.json")
df.write.format("com.databricks.spark.avro").save("/data/test.avro")
But I'm getting this error:
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
It makes no difference if I try interactively or with spark-submit. These are my loaded packages in spark:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.7 from central in [default]
org.apache.avro#avro;1.8.1 from central in [default]
org.apache.commons#commons-compress;1.8.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
org.tukaani#xz;1.5 from central in [default]
org.xerial.snappy#snappy-java;1.1.1.3 from central in [default]
spark-submit --version output:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.2
/_/
Branch
Compiled by user jenkins on 2016-11-08T01:39:48Z
Revision
Url
Type --help for more information.
scala version is 2.11.8
My pyspark command:
PYSPARK_PYTHON=ipython /usr/spark-2.0.2/bin/pyspark --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
My spark-submit command:
spark-submit script.py --master spark://master:7077 --packages com.databricks:spark-avro_2.11:3.2.0,org.apache.avro:avro:1.8.1
I've read here that this can be caused by "an older version of avro being used" so I tried using 1.8.1, but I keep getting the same error. Reading avro works fine. Any help?
The cause of this error is that a apache avro version 1.7.4 is included in hadoop by default, and if the SPARK_DIST_CLASSPATH env variable includes the hadoop common ($HADOOP_HOME/share/common/lib/ ) before the ivy2 jars, the wrong version can get used instead of the version required by spark-avro (>=1.7.6) and installed in ivy2.
To check if this is the case, open a spark-shell and run
sc.getClass().getResource("/org/apache/avro/generic/GenericData.class")
This should tell you the location of the class like so:
java.net.URL = jar:file:/lib/ivy/jars/org.apache.avro_avro-1.7.6.jar!/org/apache/avro/generic/GenericData.class
If that class is pointing to $HADOOP_HOME/share/common/lib/ then you must simply include your ivy2 jars before the hadoop common in the SPARK_DIST_CLASSPATH env variable.
For example, in a Dockerfile
ENV SPARK_DIST_CLASSPATH="/home/root/.ivy2/*:$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: /home/root/.ivy2 is the default location for ivy2 jars, you can manipulate that by setting spark.jars.ivy in your spark-defaults.conf, which is probably a good idea.
I have encountered a similar problem before.
Try using --jars {path to spark-avro_2.11-3.2.0.jar} option in spark-submit

Resources