Can't create spark session using yarn inside kubernetes pod

Can't create spark session using yarn inside kubernetes pod - apache-spark

I have a kubernetes pod with spark client installed.
bash-4.2# spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.1.2.6.2.0-205
/_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
Branch HEAD
Compiled by user jenkins on 2017-08-26T09:32:23Z
Revision a2efc34efde0fd268a9f83ea1861bd2548a8c188
Url git#github.com:hortonworks/spark2.git
Type --help for more information.
bash-4.2#
I can submit a spark job successfully under client and cluster mode using these commands:
${SPARK_HOME}/bin/spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYTHONPATH:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.4-src.zip --master yarn --deploy-mode client --num-executors 50 --executor-cores 4 --executor-memory 3G --driver-memory 6G my_python_script.py --config=configurations/sandbox.yaml --startdate='2019-01-01' --enddate='2019-08-01'
${SPARK_HOME}/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 ${SPARK_HOME}/lib/spark-examples*.jar 10
But whenever I start a session using any of these:
spark-shell --master yarn
pyspark --master yarn
It hangs and times out with this error:
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
We have another python script that needs to create a spark session. The code on that script is:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setAll(configs.items())
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
Not sure where else to check. This is the first time we are initiating a spark connection from inside a kubernetes cluster. Getting a spark session inside a normal virtual machine works fine. Not sure what is the difference in terms of network connection. It also puzzles me that I was able to submit a spark job above but unable to create a spark session.
Any thoughts and ideas is highly appreciated. Thanks in advance.

In client mode Spark Driver process is running on your machine and Executors run on Yarn nodes (spark-shell and pyspark submit client mode sessions). Driver and Executor processes to communicate should be able to connect to each other via network in both directions.
Since submitting jobs in cluster mode works for you and you can reach the Yarn master from the Kubernetes Pod network, that route is fine.
Most probably you don't have network access from the Yarn cluster network to the Pod, which most probably lives within Kubernetes private network unless exposed explicitly. This is the first thing I would recommend you to check, as well as Yarn logs.
After you expose the Pod to be accessible from the Yarn cluster network you may want to refer the following spark configs to setup bindings:
- spark.driver.host
- spark.driver.port
- spark.driver.bindAddress
- spark.blockManager.port
Find their descriptions in docs.

Related

dse spark-submit to specific work pool instead of "default"

I am able to successfully build the example project from https://github.com/datastax/SparkBuildExamples/tree/master/scala/sbt/dse/src/main/scala/com/datastax/spark/example
I am also successful in submitting dse spark-submit. The program runs fine and results are good as expected
dse spark-submit --class com.datastax.spark.example.WriteRead target/writeRead-0.1.jar
I now wish to submit it the above job to an existing pool as configured in dse.yaml
resource_manager_options:
worker_options:
cores_total: 6
memory_total: 32G
workpools:
- name: alwayson_sql
cores: 2
memory: 4G
- name: pool_1
cores: 2
memory: 16G
I am unable to determine how/what changes in code or spark-submit that I should do in order to submit the application to the pool "pool_1"
The application is submitted to the default pool and I am unable to submit it to "pool_1".
Please help.

After some additional research I figured out the correct way to dse spark-submit to use the pool "pool_1"
bin/dse spark-submit \
--master dse://?workpool=pool_1 \
--conf spark.network.timeout=500 \
--class com.datastax.spark.example.WriteRead target/writeRead-0.1.jar
(Per input from Alex)DSE Documentation:
Documentation link

What is happening when starting a Spark application on Kubernetes

I read this: Running Spark on Kubernetes.
I want to know more details about the interaction between Kubernetes Controller/Scheduler and Spark runtime when launching a Spark job on K8s.
Specially, assuming we launch an Spark app by :
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--..............
My question is: the K8s may not be able to allocate 5 executors (or called containers/pods) immediately due to unavailability of cluster resources at the moment the Spark app is launched. Which way does Spark app take? (1) Spark starts running tasks as soon as possible when there is at least one executor is allocated. (2) Spark won't launch any tasks until all of the 5 executors have been allocated.
If you know Hadoop YARN, it would be great if you could also answer the question in the scenario of running Spark app on Hadoop YARN(DynamicAllocation Disabled) and point out the difference.

Spark YARN on EMR - JavaSparkContext - IllegalStateException: Library directory does not exist

I have Java Spark job that works on manually deployed Spark 1.6.0 in standalone mode on an EC2.
I am spark-submitting this job to a EMR 5.3.0 cluster on the master using YARN but it fails.
Spark-submit line is,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
The "yarn-client" is the first argument to the x.jar application and is fed to the SparkContext as setMaster,
conf.setMaster(args[0]);
When I submit it, it starts out running fine, until I initialize the JavaSparkContext from a SparkConf,
JavaSparkContext sc = new JavaSparkContext(conf);
... and then Spark crashes.
In the YARN log, I can see the following,
yarn logs -applicationId application_1487325147456_0051
...
17/02/17 16:27:13 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/17 16:27:13 INFO Client: Deleted staging directory hdfs://ip-172-31-8-237.eu-west-1.compute.internal:8020/user/ec2-user/.sparkStaging/application_1487325147456_0052
17/02/17 16:27:13 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/ec2-user/appcache/application_1487325147456_0051/container_1487325147456_0051_01_000001/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
...
Noting the WARN of spark.yarn.jars flag missing, I found a spark yarn JAR file in
/usr/lib/spark/jars/
... and uploaded it to HDFS per Cloudera's guide on how to run YARN applications on Spark and tried to add that conf, so this became my spark-submit line,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --conf spark.yarn.jars=hdfs://`hostname -f`:8020/sparkyarnlibs/spark-yarn_2.11-2.1.0.jar --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
But that did not work and gave this:
Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I am really puzzled as to what that Library error is caused by and how to proceed onwards from here.

You have specified "--deploy-mode cluster" and yet are calling conf.setMaster("yarn-client") from the code. Using a master URL of "yarn-client" means "use YARN as the master, and use client mode (not cluster mode)", so I wouldn't be surprised if this is somehow confusing Spark because on one hand you're telling it to use cluster mode and on the other you're telling it to use client mode.
By the way, using a master URL like "yarn-client" or "yarn-cluster" is actually deprecated because the "-client" or "-cluster" part is not really part of the Master but rather is the deploy mode. That is, "--master yarn-client" is really more of a shortcut/alias for "--master yarn --deploy-mode client", and similarly "--master yarn-cluster" just means "--master yarn --deploy-mode cluster".
My recommendation would be to not call conf.setMaster() from your code, since the master is already set to "yarn" automatically in /etc/spark/conf/spark-defaults.conf. For this reason, you also don't need to pass "--master yarn" to spark-submit.
Lastly, it sounds like you need to decide whether you really want to use client deploy mode or cluster deploy mode. With client deploy mode, the driver runs on the master instance, and with cluster deploy mode, the driver runs in a YARN container on one of the core/task instances. See https://spark.apache.org/docs/latest/running-on-yarn.html for more information.
If you want to use client deploy mode, you don't need to pass anything extra because it's already the default. If you want to use cluster deploy mode, pass "--deploy-mode cluster".

Spark-submit in Spark stand alone - all memory gone to the drivers

I have setup a Spark standalone cluster, where I can submit jobs with spark-submit:
spark-submit \
--class blah.blah.MyClass \
--master spark://myaddress:6066 \
--executor-memory 8G \
--deploy-mode cluster \
--total-executor-cores 12 \
/path/to/jar/myjar.jar
Problem is when I send multiple jobs at the same time, say over 20 in one go, the first few finished successfully. All the others are now stuck waiting for resources. I noticed all the available memory has gone to the drivers, so in the drivers section they are all running but in the running application section they all are in WAITING state.
How can I tell spark stand alone to first allocate memory to the WAITING executors instead of the SUBMITTED drivers?
thank you
Below is an extract of my spark-defaults.conf
spark.master spark://address:7077
spark.eventLog.enabled true
spark.eventLog.dir /path/tmp/sparkEventLog
spark.driver.memory 5g
spark.local.dir /path/tmp
spark.ui.port xxx

how to : spark yarn cluster

I have set up a hadoop cluster with 3 machines one master and 2 slave
In the master i have installed spark
SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true sbt/sbt clean assembly
Added HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop spark-env.sh
Then i ran SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop ./bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.4.0.jar
I checked localhost:8088 and i saw application SparkPi running..
Is it just this or i should install spark in the 2 slave machines..
How can i get all the machine started?
Is there any help doc out there.. I feel like i am missing something..
In spark standalone more we start the master and worker
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
i also wanted to know how to get more than one worked running in this case as well
and i know we can can configure slaves in conf/slave but can anyone share an example
Please help i am stuck

Assuming you're using Spark 1.1.0, as it says in the documentation (http://spark.apache.org/docs/1.1.0/submitting-applications.html#master-urls), for the master parameter you can use values yarn-cluster or yarn-client. You do not need to use deploy-mode parameter in that case.
You do not have to install Spark on all the YARN nodes. That is what YARN is for: to distribute your application (in this case Spark) over a Hadoop cluster.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string