spark-submit error : failed in initilizing sparkContext for non driver program vms - apache-spark

Cluster Specifications : Apache Spark on top of Mesos with 5 Vms and HDFS as storage.
spark-env.sh
export SPARK_LOCAL_IP=192.168.xx.xxx #to set the IP address Spark binds to on this node
enter code here`export MESOS_NATIVE_JAVA_LIBRARY="/home/xyz/tools/mesos-1.0.0/build/src/.libs/libmesos-1.0.0.so" #to point to your libmesos.so if you use Mesos
export SPARK_EXECUTOR_URI="hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz"
HADOOP_CONF_DIR="/usr/local/tools/hadoop" #To point Spark towards Hadoop configuration files
spark-defaults.conf
spark.executor.uri hdfs://vm8:9000/spark-2.0.0-bin-hadoop2.7.tgz
spark.driver.host 192.168.xx.xxx
spark.rpc netty
spark.rpc.numRetries 5
spark.ui.port 48888
spark.driver.port 48889
spark.port.maxRetries 32
I did some experiments with submitting word-count scala application in cluster mode, I observed that it executes successfully only when it finds driver program (containing main method) from the Vm it was submitted. As per my knowledge scheduling of resources (VMs) is handled by Mesos. for example if i submit my application from vm12 and coincidently if Mesos also schedules vm12 for executing application then it will execute successfully.In contrast it will fail if mesos scheduler decides to allocate let's say vm15.I checked logs in stderr of mesos UI and found error..
16/09/27 11:15:49 ERROR SparkContext: Error initializing SparkContext.
Besides I tried looking for configuration aspects of spark in following link.
[http://spark.apache.org/docs/latest/configuration.html][1] I tried setting rpc as it seemed necessary to keep driver program near to worker-node in LAN.
But couldn't get much insights.
I also tried uploading my code (application) in HDFS and submitting application jar file from HDFS.The same observations I received.
While connecting apache-spark with Mesos according to the documentation in
following link http://spark.apache.org/docs/latest/running-on-mesos.html
I also tried configuring spark-defaults.conf, spark-env.sh in other VM's in order to check if it successfully runs at least from 2 Vm's. That also didn't workout.
Am I missing any conceptual clarity here.?
So how can I make my application run successfully regardless of Vm's I'm submitting from ?

Related

Does Spark 2.4.4 support forwarding Delegation Tokens when master is k8s?

I'm currently in the process of setting up a Kerberized environment for submitting Spark Jobs using Livy in Kubernetes.
What I've achieved so far:
Running Kerberized HDFS Cluster
Livy using SPNEGO
Livy submitting Jobs to k8s and spawning Spark executors
KNIME is able to interact with Namenode and Datanodes from outside the k8s Cluster
To achieve this I used the following Versions for the involved components:
Spark 2.4.4
Livy 0.5.0 (The currently only supported version by KNIME)
Namenode and Datanode 2.8.1
Kubernetes 1.14.3
What I'm currently struggling with:
Accessing HDFS from the Spark executors
The error message I'm currently getting, when trying to access HDFS from the executor is the following:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "livy-session-0-1575455179568-exec-1/10.42.3.242"; destination host is: "hdfs-namenode-0.hdfs-namenode.hdfs.svc.cluster.local":8020;
The following is the current state:
KNIME connects to HDFS after having successfully challenged against the KDC (using Keytab + Principal) --> Working
KNIME puts staging jars to HDFS --> Working
KNIME requests new Session from Livy (SPNEGO challenge) --> Working
Livy submits Spark Job with k8s master / spawns executors --> Working
KNIME submits tasks to Livy which should be executed by the executors --> Basically working
When trying to access HDFS to read a file the error mentioned before occurs --> The problem
Since KNIME is placing jar files on HDFS which have to be included in the dependencies for the Spark Jobs it is important to be able to access HDFS. (KNIME requires this to be able to retrieve preview data from DataSets for example)
I tried to find a solution to this but unfortunately, haven't found any useful resources yet.
I had a look at the code an checked UserGroupInformation.getCurrentUser().getTokens().
But that collection seems to be empty. That's why I assume that there are not Delegation Tokens available.
Has anybody ever achieved running something like this and can help me with this?
Thank you all in advance!
For everybody struggeling with this:
It took a while to find the reason on why this is not working, but basically it is related to Spark's Kubernetes implementation as of 2.4.4.
There is no override defined for CoarseGrainedSchedulerBackend's fetchHadoopDelegationTokens in KubernetesClusterSchedulerBackend.
There has been the pull request which will solve this by passing secrets to executors containing the delegation tokens.
It was already pulled into master and is available in Spark 3.0.0-preview but is not, at least not yet, available in the Spark 2.4 branch.

How does Spark prepare executors on Hadoop YARN?

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.
Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.
and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.
ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

Submitting Spark Jobs to Spark Cluster

I am a complete novice in Spark and just started exploring more on this. I have chosen the longer path by not installing hadoop using any CDH distribution and i have installed Hadoop from Apache website and setting the config file myself to understand more on the basics.
I have setup a 3 node cluster (All node are VM machine created from ESX server).
I have setup High Availability for both Namenode and ResourceManager by using zookeeper mechanism. All three nodes are being used as DataNode as well.
The Following Daemons are Running across All three Nodes
Daemon in Namenode 1 Daemon In Namenode 2 Daemon in Datanode
8724 QuorumPeerMain 22896 QuorumPeerMain 7379 DataNode
13652 Jps 23780 ResourceManager 7299 JournalNode
9045 DFSZKFailoverController 23220 DataNode 7556 NodeManager
9175 DataNode 23141 NameNode 7246 QuorumPeerMain
9447 NodeManager 27034 Jps 9705 Jps
8922 NameNode 23595 NodeManager
8811 JournalNode 22955 JournalNode
9324 ResourceManager 23055 DFSZKFailoverController
I have setup HA for NN and RM in NameNode 1 & 2.
The Nodes are of very minimum Hardware configuration (4GM RAM each and 20GB Disk Space) , But these are just for testing purpose. so i guess its ok.
I have installed Spark (Compatible version to my installed Hadoop 2.7) in NameNode 1. I am able to start Spark-shell locally and perform basic scala command to create RDD and perform some Actions over it. I also manage to test run SparkPi example as Yarn-Cluster and Yarn-Client deployment Mode. All works well and good.
Now my problem is , In real time scenario , We are going to write (Java, scala or py) based code in our local machine (Not in the nodes which form the Hadoop Cluster). Lets say i have another machine in same network as my HA Cluster is.How do i submit my Job to the Yarn-Cluster (Lets say i want to try submitting SparkPi) example from a host not in HA to the Yarn RM , How do i do this ?
I believe , SPARK has to be installed in machine where i am writing my code from (Is my assumption correct) and No spark needs to be installed in the HA Cluster. I also want to get the output of the submitted job back to the Host from where it was submitted. I have no idea what needed to be done to make this work.
I have heard of Spark JobServer , Is this what i need to get this all up and running ? I believe you guys can help me out with this confusion. I just could not find any document which clearly specify the steps to follow to get this done. Can i submit a job from Windows based machine to my HA cluster setup in unix environment ?
Spark JobServer provides rest interface for your requirement. Apart from that there are other features.
See https://github.com/spark-jobserver/spark-jobserver for more information.
In order to submit spark jobs to the cluster your machine have to become a "gateway node". That basically means you have hadoop binaries/libraries/configs installed on that machine, but there are no hadoop daemons running on it.
Once you have it setup, you should be able to run hdfs commands against your cluster from that machine (like hdfs dfs -ls /), submit yarn applications to the cluster (yarn jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar pi 3 100).
After that step you can install spark on your gateway machine and start submitting spark jobs. If you are going to use Spark-on-yarn, this is the only machine spark needs to be installed on.
You (your code) is the one responsible for getting the output of the job. You could choose to save the result in HDFS (the most common choice), print it to the console, etc... Spark's History Server is for debugging purposes.

Force the Spark Master to distribute code (not the submitter)

I'm trying to submit a spark job to a remote master from my notebook. I've got a local spark installation, so I can run
./bin/spark-submit --class "a.b.C" --master spark://198.51.100.1:7077 app.jar (...)
Due to firewall policy, nat, etc. I can reach the spark master (198.51.100.1) from my notebook (192.168.0.1), but not the other way around.
The problem is that my local spark installation tries to distribute code to the workers
SparkContext: Added JAR file:/path/to/app.jar at http://192.168.0.1:52605/jars/app.jar with timestamp 1439369933876
which must fail, because the workers have no route to my notebook
WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver#192.168.0.1:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters.
So, how can I submit my application to the master and force the master to distribute my code to the workers?
Or did I get this all wrong and there's another reason for my problem here?
You can upload you app.jar to a location that is visible inside you cluster (e.g. HDFS) and use cluster deploy mode when launching your app:
./bin/spark-submit --deploy-mode cluster .... hdfs://path/to.jar
See Submitting Applications for more details.

Unable to add a new service with Cloudera Manager within Cloudera Quickstart VM 5.3.0

I'm using Cloudera Quickstart VM 5.3.0 (running in Virtual Box 4.3 on Windows 7) and I wanted to learn Spark (on YARN).
I started Cloudera Manager. In the sidebar I can see all the services, there is Spark but in standalone mode. So I click on "Add a new service", select "Spark". Then I have to select the set of dependencies for this service, I have no choices I must pick HDFS/YARN/zookeeper.
Next step I have to choose a History Server and a Gateway, I run the VM in local mode so I can only choose localhost.
I click on "Continue" and this error occures (+ 69 traces) :
A server error as occurred. Send the following information to
Cloudera.
Path : http://localhost:7180/cmf/clusters/1/add-service/reviewConfig
Version: Cloudera Express 5.3.0 (#155 built by jenkins on
20141216-1458 git: e9aae1d1d1ce2982d812b22bd1c29ff7af355226)
org.springframework.web.bind.MissingServletRequestParameterException:Required
long parameter 'serviceId' is not present at
AnnotationMethodHandlerAdapter.java line 738 in
org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter$ServletHandlerMethodInvoker
raiseMissingParameterException()
I don't know if an internet connection is needed but I precise that I can't connect to the internet with the VM. (EDIT : Even with an internet connection I get the same error)
I have no ideas how to add this service, I tried with or without gateway, many network options but it never worked. I checked the known issues; nothing...
Someone knows how I can solve this error or how I can work around ? Thanks for any help.
Julien,
Before I answer your question I'd like to make some general notes about Spark in Cloudera Distribution of Hadoop 5 (CDH5):
Spark runs in three different formats: (1) local, (2) Spark's own stand-alone manager, and (3) other cluster resource managers like Hadoop YARN, Apache Mesos, and Amazon EC2.
Spark works out-of-the-box with CHD 5 for (1) and (2). You can initiate a local
interactive spark session in Scala using the spark-shell command
or pyspark for Python without passing any arguments. I find the interactive Scala and Python
interpreters help learning to program with Resilient Distributed
Datasets (RDDs).
I was able to recreate your error on my CDH 5.3.x distribution. I didn't mean to take credit for the bug you discovered, but I posted to the Cloudera developer community for feedback.
In order to use Spark in the QuickStart pseudo-distributed environment, see if all of the Spark daemons are running using the following command (you can do this inside the Cloudera Manager (CM) UI):
[cloudera#quickstart simplesparkapp]$ sudo service --status-all | grep -i spark
Spark history-server is not running [FAILED]
Spark master is not running [FAILED]
Spark worker is not running [FAILED]
I've manually stopped all of the stand-alone Spark services so we can try to submit the Spark job within Yarn.
In order to run Spark inside a Yarn container on the quick start cluster, we have to do the following:
Set the HADOOP_CONF_DIR to the root of the directory containing the yarn-site.xml configuration file. This is typically /etc/hadoop/conf in CHD5. You can set this variable using the command export HADOOP_CONF_DIR="/etc/hadoop/conf".
Submit the job using spark-submit and specify you are using Hadoop YARN.
spark-submit --class CLASS_PATH --master yarn JAR_DIR ARGS
Check the job status in Hue and compare to the Spark History server. Hue should show the job placed in a generic Yarn container and Spark History should not have a record of the submitted job.
References used:
Learning Spark, Chapter 7
Sandy Ryza's Blog Post on Spark and CDH5
Spark Documentation for Running on Yarn

Resources