spark-submit classpath issue with --repositories --packages options - apache-spark

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic

I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

Related

Pyspark - spark-submit to an AWS EMR

I have created an EMR cluster (emr-5.36.0) in AWS with the default sparks components (Spark 2.4.8, Hive 2.3.9).
I have installed Pyspark (3.3.0) on an EC2, in an python virtual environment.
From there, I would like to run "spark-submit" commands to the EMR cluster.
To test the command, I am using python the code at the bottom of this page
To configured the YARN_CONF_DIR environment variable on the EC2, I copied the yarn-site.xml file from /etc/hadoop/conf.empty/ on the EMR's master node to a folder on the EC2.
But now, on the EC2, when I try to run spark-submit, I get:
$ export YARN_CONF_DIR=/home/me/spark/
$ spark-submit --master yarn --deploy-mode cluster spark_test.py
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 13 more 22/07/18 18:36:25 INFO ShutdownHookManager: Shutdown hook called
And from here I am basically lost. I tried to google the error but I am still not clear what the error is about. Did I miss a step? An environment variable maybe?
Ultimately, I want to use the SparkSubmitOperator in Airflow, but I figured I should get the "native" command to work first before using the operator (which is just a wrapper).
If you do YARN_CONF_DIR=/etc/hadoop_files/ locally, the content of the folder hadoop_files needs to be the content of the EMR's /etc/hadoop/ folder, not /etc/hadoop/conf.empty/.

spark2-submit throwing error with multiple packages (--packages)

I'm trying to submit following Spark2 job on CDH 5.16 cluster and it's only taking first parameter of --packages option and throwing error for second parameter
spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1, com.databricks:spark-csv_2.11:1.5.0 /path/to/python-script
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR com.databricks:spark-csv_2.11:1.5.0 with URI com.databricks. Please specify a class through --class.
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:224)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$1.<init>(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:911)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:81)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Running this job in CDH5.16 cluster and spark installed with Spark2 CSD
Thanks in advance.
Don't give space between packages
spark2-submit --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.11:1.5.0 /path/to/python-script

How to give dependent jars to spark submit in cluster mode

I am running spark using cluster mode for deployment . Below is the command
JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\
$JARS_HOME/rabbitmq-0.1.0-RELEASE.jar,\
$JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\
$JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\
$JARS_HOME/zkclient-0.3.jar,$JARS_HOME/protobuf-java-2.4.0a.jar
dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--executor-memory 512M \
--total-executor-cores 3 \
--deploy-mode "cluster" \
--master spark://$MASTER:7077 \
--jars=$JARS \
--supervise \
--class "com.testclass" $APP_JAR input.json \
--files "/home/test/input.json"
The above command is working fine in client mode. But when I use it in cluster mode I get class not found exception
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$
In client mode the dependent jars are getting copied to the /var/lib/spark/work directory whereas in cluster mode it is not. Please help me in getting this solved.
EDIT:
I am using nfs and I have mounted the same directory on all the spark nodes under same name. Still I get the error. How it is able to pick the application jar which is also under same directory but not the dependent jars ?
In client mode the dependent jars are getting copied to the
/var/lib/spark/work directory whereas in cluster mode it is not.
In Cluster mode, driver pragram is running in the cluster not in local(compared to client mode) and dependent jars should be accessible in cluster, otherwise driver program and executor will throw "java.lang.NoClassDefFoundError" exception.
Actually When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster.
Your extra jars could be added to --jars, they will be copied to cluster automatically.
please refer to "Advanced Dependency Management" section in below link:
http://spark.apache.org/docs/latest/submitting-applications.html
As spark documentation says,
Keep all jars and dependencies in same local path in all nodes in cluster or
Keep the jar is distributed files system where all nodes have access to.
Spark properties

Spark SQL Thrift Server on CDH 5.3.0

I am trying to use CDH 5.3.0 to run Spark's Thrift Server. I'm trying to follow the Spark SQL instructions, but I can't even get the --help option to run successfully. In the output below, it dies because it can't find the HiveServer2 class.
$ /usr/lib/spark/sbin/start-thriftserver.sh --help
Usage./sbin/start-thriftserver [options] [thrift server options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
YARN-only:
--executor-cores NUM Number of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
Thrift server options:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hive/service/server/HiveServer2
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.hive.service.server.HiveServer2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
As indicated by the error, the class is not in the classpath. Unfortunately, setting the CLASSPATH environment variable won't work. The only solution that I could find was to edit /usr/lib/spark/bin/compute-classpath.sh and add this line (it can go just about anywhere, but put it one line from the end to make it clear that it's an addition):
CLASSPATH="$CLASSPATH:/usr/lib/hive/lib/*"
Cloudera's release notes for 5.3.0 explicitly state "Spark SQL remains an experimental and unsupported feature in CDH", so it's not surprising that tweaks like this may be needed. Also, this response to a similar problem in CDH 5.2 suggests that the Hive jars are deliberately excluded by Cloudera for size reasons.
I have faced the same problem but I solved it in another way.
The cloudera CDH version was not 5.3.0 it was some version prior to that version so you will find the paths little different.
Simply the solution was to replace the spark-assembly-**.jar file that shipped with cloudera CDH by another version.
I downloaded spark from its official download page. The version I have downloaded was built for hadoop 2.4 and later. Extracting the downloaded file and look for spark-assembly-**.jar.
In the cloudera installation, I looked for the same file and I found it under that path /usr/lib/spark/libe/spark-assembly--.jar
The previous path actually was a symlink to the actual file. I uploaded the jar from spark download to the same path and make the symlink point to the new jar (ln -f -s target link).
Every thing works fine with me.
/usr/lib/spark/bin/compute-classpath.sh sets CLASSPATH="$SPARK_CLASSPATH". On CDH using parcels you can add the hive jars to SPARK_CLASSPATH like this:
SPARK_CLASSPATH=$(ls -1 /opt/cloudera/parcels/CDH/lib/hive/lib/*.jar | sed -e :a -e 'N;s/\n/:/;ta') /opt/cloudera/parcels/CDH/lib/spark/sbin/start-thriftserver.sh --help
Instructions from Cloudera Community forum
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-5-does-not-have-Spark-Thrift-Server/m-p/41849#M1758 :
git clone https://github.com/cloudera/spark.git
cd spark
./make-distribution.sh -DskipTests \
-Dhadoop.version=2.6.0-cdh5.7.0 \
-Phadoop-2.6 \
-Pyarn \
-Phive -Phive-thriftserver \
-Pflume-provided \
-Phadoop-provided \
-Phbase-provided \
-Phive-provided \
-Pparquet-provided
-Phive and -Phive-thriftserver are the key pieces there.
There is a request to add Spark Thrift Server
https://issues.cloudera.org/browse/DISTRO-817
please vote up if you want to see that in CDH.

Spark throws ClassNotFoundException when using --jars option

I was trying to follow the Spark standalone application example described here
https://spark.apache.org/docs/latest/quick-start.html#standalone-applications
The example ran fine with the following invocation:
spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
However, when I tried to introduce some third-party libraries via --jars, it throws ClassNotFoundException.
$ spark-submit --jars /home/linpengt/workspace/scala-learn/spark-analysis/target/pack/lib/* \
--class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread "main" java.lang.ClassNotFoundException: SimpleApp
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:300)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Removing the --jars option and the program runs again (I didn't actually start using those libraries yet). What's the problem here? How should I add the external jars?
According to spark-submit's --help, the --jars option expects a comma-separated list of local jars to include on the driver and executor classpaths.
I think that what's happening here is that /home/linpengt/workspace/scala-learn/spark-analysis/target/pack/lib/* is expanding into a space-separated list of jars and the second JAR in the list is being treated as the application jar.
One solution is to use your shell to build a comma-separated list of jars; here's a quick way of doing it in bash, based on this answer on StackOverflow (see that answer for more complex approaches that handle filenames that contain spaces):
spark-submit --jars $(echo /dir/of/jars/*.jar | tr ' ' ',') \
--class "SimpleApp" --master local[4] path/to/myApp.jar
Is your SimpleApp class in any specific package? It seems that you need to include the full package name in the command line. So, if the SimpleApp class is located in com.yourcompany.yourpackage, you'd have to submit the Spark job with --class "com.yourcompany.yourpackage.SimpleApp" instead of --class "SimpleApp". I had the same problem and changing the name to the full package and class name fixed it. Hope that helps!

Resources