How to give dependent jars to spark submit in cluster mode

How to give dependent jars to spark submit in cluster mode - apache-spark

I am running spark using cluster mode for deployment . Below is the command
JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\
$JARS_HOME/rabbitmq-0.1.0-RELEASE.jar,\
$JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\
$JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\
$JARS_HOME/zkclient-0.3.jar,$JARS_HOME/protobuf-java-2.4.0a.jar
dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--executor-memory 512M \
--total-executor-cores 3 \
--deploy-mode "cluster" \
--master spark://$MASTER:7077 \
--jars=$JARS \
--supervise \
--class "com.testclass" $APP_JAR input.json \
--files "/home/test/input.json"
The above command is working fine in client mode. But when I use it in cluster mode I get class not found exception
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
In client mode the dependent jars are getting copied to the /var/lib/spark/work directory whereas in cluster mode it is not. Please help me in getting this solved.
EDIT:
I am using nfs and I have mounted the same directory on all the spark nodes under same name. Still I get the error. How it is able to pick the application jar which is also under same directory but not the dependent jars ?

In client mode the dependent jars are getting copied to the
/var/lib/spark/work directory whereas in cluster mode it is not.
In Cluster mode, driver pragram is running in the cluster not in local(compared to client mode) and dependent jars should be accessible in cluster, otherwise driver program and executor will throw "java.lang.NoClassDefFoundError" exception.
Actually When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.
Your extra jars could be added to --jars, they will be copied to cluster automatically.
please refer to "Advanced Dependency Management" section in below link:
http://spark.apache.org/docs/latest/submitting-applications.html

As spark documentation says,
Keep all jars and dependencies in same local path in all nodes in cluster or
Keep the jar is distributed files system where all nodes have access to.
Spark properties

Related

spark-submit does not work with my jar located in hdfs

Here is my situation:
Apache spark version 2.4.4
Hadoop version 2.7.4
My application jar is located in hdfs.
My spark-submit looks like this:
/software/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \
--class com.me.MyClass --master spark://host2.local:7077 \
--deploy-mode cluster \
hdfs://host2.local:9000/apps/myapps.jar
I get this error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.tracing.SpanReceiverHost.get(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/hadoop/tracing/SpanReceiverHost;
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:634)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:139)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:139)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveAndDownloadJars$1.apply(DependencyUtils.scala:61)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveAndDownloadJars$1.apply(DependencyUtils.scala:64)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.DependencyUtils$.resolveAndDownloadJars(DependencyUtils.scala:60)
at org.apache.spark.deploy.worker.DriverWrapper$.setupDependencies(DriverWrapper.scala:96)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:60)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Any pointer how to solve this, please?
Thank you.

There is no need to transfer the jar into cluster, you can run your jar from your local id itself with executable permission.
Once your application is build transfer the .jar to your unix user account and give it executable permissions. Have a look at the below spark submit:-
spark-submit --master yarn --deploy-mode cluster --queue default
--files "full path of your properties file" --driver-memory 4G
--num-executors 8 --executor-cores 1 --executor-memory 4G
--class "main class name"
"full path of the jar which you have transferred to your local unix id"
You can use other spark submit configuration parameters if you want. Please note that in some version you have to use spark2-submit instead of spark-submit if there are multiple spark version involved.

--deploy-mode cluster will help in this case. taking the jars to cluster will be taken care by yarn cluster.

Spark's example throws FileNotFoundException in client mode

I have: Ubuntu 14.04, Hadoop 2.7.7, Spark 2.2.0.
I just installed everything.
When I try to run the the Spark's example:
bin/spark-submit --deploy-mode client \
--class org.apache.spark.examples.SparkPi \
examples/jars/spark-examples_2.11-2.2.0.jar 10
I get the following error:
INFO yarn.Client:
client token: N/A
diagnostics: Application application_1552490646290_0007 failed 2 times due to AM Container for
appattempt_1552490646290_0007_000002 exited with exitCode: -1000 For
more detailed output, check application tracking
page:http://ip-123-45-67-89:8088/cluster/app/application_1552490646290_0007 Then,
click on links to logs of each attempt. Diagnostics: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist java.io.FileNotFoundException: File
file:/tmp/spark-f5879f52-6777-481a-8ecf-bbb55e376901/__spark_libs__6948713644593068670.zip
does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:421)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
I get the same error both in client mode and cluster mode.

It seems that it fails by loading the spark libs. As Daniel points out, it could be related to your read rights. Besides, it could be related to running out of disk space.
However, in our case to avoid transfer latencies to the master and read/write rights in the local machine, we put the spark-libs in the HDFS of the Yarn cluster and then, we point them in the spark.yarn.archive property.
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
export HADOOP_USER_NAME=hadoop
hadoop fs -mkdir -p /apps/spark/
hadoop fs -put -f ${SPARK_HOME}/spark-libs.jar /apps/spark/
# spark-defaults.conf
spark.yarn.archive hdfs:///apps/spark/spark-libs.jar

First, Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
Second , If you run on YARN mode you have point your master to yarn submitting-applications, and put your jar file in hdfs
# Run on a YARN cluster
# Connect to a YARN cluster in client or cluster mode depending on the value
# of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR
# or YARN_CONF_DIR variable.
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
hdfs://path/to/spark-examples.jar
1000

spark-submit classpath issue with --repositories --packages options

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic

I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

Spark SQL Thrift Server on CDH 5.3.0

I am trying to use CDH 5.3.0 to run Spark's Thrift Server. I'm trying to follow the Spark SQL instructions, but I can't even get the --help option to run successfully. In the output below, it dies because it can't find the HiveServer2 class.
$ /usr/lib/spark/sbin/start-thriftserver.sh --help
Usage./sbin/start-thriftserver [options] [thrift server options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
YARN-only:
--executor-cores NUM Number of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
Thrift server options:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hive/service/server/HiveServer2
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.hive.service.server.HiveServer2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more

As indicated by the error, the class is not in the classpath. Unfortunately, setting the CLASSPATH environment variable won't work. The only solution that I could find was to edit /usr/lib/spark/bin/compute-classpath.sh and add this line (it can go just about anywhere, but put it one line from the end to make it clear that it's an addition):
CLASSPATH="$CLASSPATH:/usr/lib/hive/lib/*"
Cloudera's release notes for 5.3.0 explicitly state "Spark SQL remains an experimental and unsupported feature in CDH", so it's not surprising that tweaks like this may be needed. Also, this response to a similar problem in CDH 5.2 suggests that the Hive jars are deliberately excluded by Cloudera for size reasons.

I have faced the same problem but I solved it in another way.
The cloudera CDH version was not 5.3.0 it was some version prior to that version so you will find the paths little different.
Simply the solution was to replace the spark-assembly-**.jar file that shipped with cloudera CDH by another version.
I downloaded spark from its official download page. The version I have downloaded was built for hadoop 2.4 and later. Extracting the downloaded file and look for spark-assembly-**.jar.
In the cloudera installation, I looked for the same file and I found it under that path /usr/lib/spark/libe/spark-assembly--.jar
The previous path actually was a symlink to the actual file. I uploaded the jar from spark download to the same path and make the symlink point to the new jar (ln -f -s target link).
Every thing works fine with me.

/usr/lib/spark/bin/compute-classpath.sh sets CLASSPATH="$SPARK_CLASSPATH". On CDH using parcels you can add the hive jars to SPARK_CLASSPATH like this:
SPARK_CLASSPATH=$(ls -1 /opt/cloudera/parcels/CDH/lib/hive/lib/*.jar | sed -e :a -e 'N;s/\n/:/;ta') /opt/cloudera/parcels/CDH/lib/spark/sbin/start-thriftserver.sh --help

Instructions from Cloudera Community forum
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-5-does-not-have-Spark-Thrift-Server/m-p/41849#M1758 :
git clone https://github.com/cloudera/spark.git
cd spark
./make-distribution.sh -DskipTests \
-Dhadoop.version=2.6.0-cdh5.7.0 \
-Phadoop-2.6 \
-Pyarn \
-Phive -Phive-thriftserver \
-Pflume-provided \
-Phadoop-provided \
-Phbase-provided \
-Phive-provided \
-Pparquet-provided
-Phive and -Phive-thriftserver are the key pieces there.
There is a request to add Spark Thrift Server
https://issues.cloudera.org/browse/DISTRO-817
please vote up if you want to see that in CDH.

Spark throws ClassNotFoundException when using --jars option

I was trying to follow the Spark standalone application example described here
https://spark.apache.org/docs/latest/quick-start.html#standalone-applications
The example ran fine with the following invocation:
spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
However, when I tried to introduce some third-party libraries via --jars, it throws ClassNotFoundException.
$ spark-submit --jars /home/linpengt/workspace/scala-learn/spark-analysis/target/pack/lib/* \
--class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.ClassNotFoundException: SimpleApp
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:300)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Removing the --jars option and the program runs again (I didn't actually start using those libraries yet). What's the problem here? How should I add the external jars?

According to spark-submit's --help, the --jars option expects a comma-separated list of local jars to include on the driver and executor classpaths.
I think that what's happening here is that /home/linpengt/workspace/scala-learn/spark-analysis/target/pack/lib/* is expanding into a space-separated list of jars and the second JAR in the list is being treated as the application jar.
One solution is to use your shell to build a comma-separated list of jars; here's a quick way of doing it in bash, based on this answer on StackOverflow (see that answer for more complex approaches that handle filenames that contain spaces):
spark-submit --jars $(echo /dir/of/jars/*.jar | tr ' ' ',') \
--class "SimpleApp" --master local[4] path/to/myApp.jar

Is your SimpleApp class in any specific package? It seems that you need to include the full package name in the command line. So, if the SimpleApp class is located in com.yourcompany.yourpackage, you'd have to submit the Spark job with --class "com.yourcompany.yourpackage.SimpleApp" instead of --class "SimpleApp". I had the same problem and changing the name to the full package and class name fixed it. Hope that helps!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string