How to spark-submit to ZooKeeper-managed Mesos cluster (gives java.net.UnknownHostException: zk for mesos://zk:// master URL)? - apache-spark

I'm running Spark 2.0.2 and Mesos 0.28.2.
I'm attempting to submit an application to Spark, using a ZooKeeper-managed Mesos cluster as the master:
$SPARK_HOME/bin/spark-submit --verbose \
--conf spark.mesos.executor.docker.image=$DOCKER_IMAGE \
--conf spark.mesos.executor.home=$SPARK_HOME \
--conf spark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so \
--deploy-mode cluster \
--master mesos://zk://<ip 1>:2181,<ip 2>:2181,<ip 3>:2181/mesos \
--class $APP_MAIN_CLASS \
file://$APP_JAR_PATH
(<ip 1>, <ip 2>, and <ip 3> are IPv4 addresses in the 10.0.0.0/8 block)
According to the documentation, I seem to have the right format for the master:
The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host1:2181,host2:2181,host3:2181/mesos for a multi-master Mesos cluster using ZooKeeper.
However, it appears that Spark is reading the mesos://zk://... string then attempting to connect to zk:
17/04/07 20:10:06 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://zk://<ip 1>:2181,<ip 2>:2181,<ip 3>:2181/mesos.
Exception in thread "main" java.net.UnknownHostException: zk
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1316)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1291)
at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:214)
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:89)
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:85)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:85)
at org.apache.spark.deploy.rest.RestSubmissionClient$.run(RestSubmissionClient.scala:417)
at org.apache.spark.deploy.rest.RestSubmissionClient$.main(RestSubmissionClient.scala:430)
at org.apache.spark.deploy.rest.RestSubmissionClient.main(RestSubmissionClient.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
How do I get Spark to recognize that it should be using the three ZooKeeper nodes rather than trying to connect to a non-existent zk host?

tl;dr It won't work unless you either change --deploy-mode to client or use a master URL with a single Mesos host, e.g. mesos://host:port.
The following line gives the hint where to find the relevant code.
17/04/07 20:10:06 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://zk://:2181,:2181,:2181/mesos.
It looks that the message is only printed out for --deploy-mode cluster with Spark Standalone and Apache Mesos. Change it to the default client and the deployment path will change and hopefully accept the master URL.
See yourself the code that's responsible for the cluster deployment -- RestSubmissionClient.
Here RestSubmissionClient says:
private val supportedMasterPrefixes = Seq("spark://", "mesos://")
which proves mesos:// URLs are covered, but here you see the following:
private val masters: Array[String] = if (master.startsWith("spark://")) {
Utils.parseStandaloneMasterUrls(master)
} else {
Array(master)
}
that is printed out here as the above INFO message that shows the URL can only be a single Mesos master.

Related

Starting up Spark History Server to write to minIO

I'm trying to get Spark History Server to run on my cluster that is running on Kubernetes, and I'd like the logs to get written to minIO. I'm also using minIO as storage of the input and output of my spark-submit jobs, which is working already.
Currectly working spark-submit jobs
My working spark-submit job looks something like the following:
spark-submit \
--conf spark.hadoop.fs.s3a.access.key=XXXX \
--conf spark.hadoop.fs.s3a.secret.key=XXXX \
--conf spark.hadoop.fs.s3a.endpoint=https://someIpv4 \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=true \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.default.name="s3a:///" \
--conf spark.driver.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
--conf spark.executor.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
...
As you can see, I'm using SSL to connect to minIO and to read/write files.
What am I trying
I'm trying to spin up the history server with minIO as storage without using SSL.
To start up the history server, I'm using the already present start-history-server.sh script with some configs to define the log storage location with the ./start-history-server.sh --properties-file my_conf_file command. my_conf_file looks like this:
spark.eventLog.enabled=true
spark.eventLog.dir=s3a://myBucket/spark-events
spark.history.fs.logDirectory=s3a://myBucket/spark-events
spark.hadoop.fs.s3a.access.key=XXXX
spark.hadoop.fs.s3a.secret.key=XXXX
spark.hadoop.fs.s3a.endpoint=http://someIpv4
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false
So you see I'm not adding any SSL parameters. But when I run ./start-history-server.sh --properties-file my_conf_file, I'm getting this error:
INFO AmazonHttpClient: Unable to execute HTTP request: Connection refused (Connection refused)
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:121)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:296)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
What have I tried/found on the internet
This person had a very similar problem to mine, but it seems like they solved it using spark.hadoop.fs.s3a.path.style.access, which I'm already using
I was able to spin up History server using the local filesystem, so that seems to be working correctly
I have seen people, like in this post, using the spark.hadoop.fs.s3a.impl key with org.apache.hadoop.fs.s3a.S3AFileSystem as value. When I do this, however, It seems like this class doesn't exist within my AWS jars.
I have the following AWS jars at my disposal: aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
Since my spark-submit jobs are running fine, reading/writing away files to minIO, and I'm not supplying that spark.hadoop.fs.s3a.impl parameter in them I would think that that parameter is not needed?
Does anyone have an idea of where I should be looking/what I'm doing wrong?
My problem was that actually my minIO did not accept http requests. My already working spark submit job was using https using SSL, so I added the needed parameters to $SPARK_DAEMON_JAVA_OPTS and it was working.

Submit Spark Application on Kubernetes in Cluster mode : Configured service account doesn't have access

I try to submit a Spark application to a Kubernetes cluster (Minikube).
When running my spark submit in client mode, everything goes well. 3 executors are created in 3 pods, and the code is executed. Here is my submit command :
[MY_PATH]/bin/spark-submit \
--master k8s://https://[API_SERVER_IP]:8443 \
--deploy-mode client \
--name [Name] \
--class [MyClass] \
--conf spark.kubernetes.container.image=spark:2.4.0 \
--conf spark.executor.instances=3 \
[PATH/TO/MY/JAR].jar
Now, I adapted it to run in cluster mode :
[MY_PATH]/bin/spark-submit \
--master k8s://https://[API_SERVER_IP]:8443 \
--deploy-mode cluster \
--name [Name] \
--class [MyClass] \
--conf spark.kubernetes.container.image=spark:2.4.0 \
--conf spark.executor.instances=3 \
local://[PATH/TO/MY/JAR].jar
This time, a driver pod is created as well as a driver service, and then the driver pod fail. On the Kubernetes I can see the following error :
MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "sparkpi-1555314081444-driver-conf-map" not found
And in the pod logs I have the error :
Forbidden!Configured service account doesn't have access.
Service account may have been revoked.
pods "sparkpi-1555314081444-driver" is forbidden: User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default".
Here is the full stacktrace :
org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/default/pods/sparkpi-1555314081444-driver. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "sparkpi-1555314081444-driver" is forbidden: User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default".
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:407)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:312)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:295)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:783)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:217)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
... 20 more
What should I do to make it work ?
You have to create an authorized service account: https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
And then pass it as a parameter to the submit
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

Spark Streaming - java.io.IOException: Lease timeout of 0 seconds expired

I have spark streaming application using checkpoint writing on HDFS.
Has anyone know the solution?
Previously we were using the kinit to specify principal and keytab and got the suggestion to specify these via spark-submit command instead kinit but still this error and cause spark streaming application down.
spark-submit --principal sparkuser#HADOOP.ABC.COM --keytab /home/sparkuser/keytab/sparkuser.keytab --name MyStreamingApp --master yarn-cluster --conf "spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC --conf "spark.eventLog.enabled=true" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.streaming.stopGracefullyOnShutdown=true" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC --class com.abc.DataProcessor myapp.jar
I see multiple occurrences of following exception in logs and finally SIGTERM 15 that kills the executor and driver. We are using CDH 5.5.2
2016-10-02 23:59:50 ERROR SparkListenerBus LiveListenerBus:96 -
Listener EventLoggingListener threw an exception
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:148)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:148)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:148)
at org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:184)
at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:50)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1135)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
Caused by: java.io.IOException: Lease timeout of 0 seconds expired.
at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:2370)
at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:964)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:932)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
at java.lang.Thread.run(Thread.java:745)

How to give dependent jars to spark submit in cluster mode

I am running spark using cluster mode for deployment . Below is the command
JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\
$JARS_HOME/rabbitmq-0.1.0-RELEASE.jar,\
$JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\
$JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\
$JARS_HOME/zkclient-0.3.jar,$JARS_HOME/protobuf-java-2.4.0a.jar
dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--executor-memory 512M \
--total-executor-cores 3 \
--deploy-mode "cluster" \
--master spark://$MASTER:7077 \
--jars=$JARS \
--supervise \
--class "com.testclass" $APP_JAR input.json \
--files "/home/test/input.json"
The above command is working fine in client mode. But when I use it in cluster mode I get class not found exception
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
In client mode the dependent jars are getting copied to the /var/lib/spark/work directory whereas in cluster mode it is not. Please help me in getting this solved.
EDIT:
I am using nfs and I have mounted the same directory on all the spark nodes under same name. Still I get the error. How it is able to pick the application jar which is also under same directory but not the dependent jars ?
In client mode the dependent jars are getting copied to the
/var/lib/spark/work directory whereas in cluster mode it is not.
In Cluster mode, driver pragram is running in the cluster not in local(compared to client mode) and dependent jars should be accessible in cluster, otherwise driver program and executor will throw "java.lang.NoClassDefFoundError" exception.
Actually When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.
Your extra jars could be added to --jars, they will be copied to cluster automatically.
please refer to "Advanced Dependency Management" section in below link:
http://spark.apache.org/docs/latest/submitting-applications.html
As spark documentation says,
Keep all jars and dependencies in same local path in all nodes in cluster or
Keep the jar is distributed files system where all nodes have access to.
Spark properties

JDBC Driver not found - On submitting to YARN from Spark

Trying to read all rows from a DB table and write the same to another empty target table. So when I issue the following command at the main node, it works as expected -
$./bin/spark-submit --class cs.TestJob_publisherstarget --driver-class-path ./lib/mysql-connector-java-5.1.35-bin.jar --jars ./lib/mysql-connector-java-5.1.35-bin.jar,./lib/univocity-parsers-1.5.6.jar,./lib/commons-csv-1.1.1-SNAPSHOT.jar ./lib/uber-ski-spark-job-0.0.1-SNAPSHOT.jar
(Where: uber-ski-spark-job-0.0.1-SNAPSHOT.jar is the packaged jar in ../spark/lib folder and cs.TestJob_publisherstarget is the class)
The above command works perfectly for the code and reads all rows from a table in MySQL and dumps all roes to target table, using the JDBC driver mentioned with --jars option.
Here is the issue:
Everything remaining the same as above, when I submit the same job to YARN, it fails with en exception indicating - can't find the driver
$./bin/spark-submit --verbose --class cs.TestJob_publisherstarget --master yarn-cluster --driver-class-path ./lib/mysql-connector-java-5.1.35-bin.jar --jars ./lib/mysql-connector-java-5.1.35-bin.jar ./lib/uber-ski-spark-job-0.0.1-SNAPSHOT.jar
Exception in YARN Console:
Error: application failed with exception
org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:625)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:650)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:577)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
EXCEPTION AT LOG:
5/10/12 20:38:59 ERROR yarn.ApplicationMaster: User class threw exception: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root
java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:96)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:133)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at com.cambridgesemantics.application.sdi.compiler.spark.DataSource.getDataFrame(DataSource.scala:20)
at cs.TestJob_publisherstarget$.main(TestJob_publisherstarget.scala:29)
at cs.TestJob_publisherstarget.main(TestJob_publisherstarget.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:484)
15/10/12 20:38:59 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root)
Anyway: Where am I supposed to put the JDBC driver jar file? I have copied it over to the lib of each child node, still no luck!
I was having the same issue, it was working in local mode but not in yarn-client.
I added to spark submit:
--conf "spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.34.jar
and that worked for me
For Spark 1.6, I have the issue to store DataFrame to Oracle by using org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable
In yarn-cluster mode, I put these options in the submit script:
--conf "spark.driver.extraClassPath=$HOME/jdbc-11.2.0.3.0.jar" \
--conf "spark.executor.extraClassPath=$HOME/jdbc-11.2.0.3.0.jar" \
I also have to put Class.forName("..") like below before the saving line:
try {
Class.forName("oracle.jdbc.OracleDriver");
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, "RD_SPARK_DTL_INCL_HY ", p);
} catch (Exception e) {....
Of course, you have to copy the lib to each node. Not pretty, but it works. Hope someone can come up better solution later.
I do strongly recommend to use this API -- amazingly convenient and fast.

Resources