JDBC Driver not found - On submitting to YARN from Spark - apache-spark

Trying to read all rows from a DB table and write the same to another empty target table. So when I issue the following command at the main node, it works as expected -
$./bin/spark-submit --class cs.TestJob_publisherstarget --driver-class-path ./lib/mysql-connector-java-5.1.35-bin.jar --jars ./lib/mysql-connector-java-5.1.35-bin.jar,./lib/univocity-parsers-1.5.6.jar,./lib/commons-csv-1.1.1-SNAPSHOT.jar ./lib/uber-ski-spark-job-0.0.1-SNAPSHOT.jar
(Where: uber-ski-spark-job-0.0.1-SNAPSHOT.jar is the packaged jar in ../spark/lib folder and cs.TestJob_publisherstarget is the class)
The above command works perfectly for the code and reads all rows from a table in MySQL and dumps all roes to target table, using the JDBC driver mentioned with --jars option.
Here is the issue:
Everything remaining the same as above, when I submit the same job to YARN, it fails with en exception indicating - can't find the driver
$./bin/spark-submit --verbose --class cs.TestJob_publisherstarget --master yarn-cluster --driver-class-path ./lib/mysql-connector-java-5.1.35-bin.jar --jars ./lib/mysql-connector-java-5.1.35-bin.jar ./lib/uber-ski-spark-job-0.0.1-SNAPSHOT.jar
Exception in YARN Console:
Error: application failed with exception
org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:625)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:650)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:577)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
EXCEPTION AT LOG:
5/10/12 20:38:59 ERROR yarn.ApplicationMaster: User class threw exception: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root
java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:96)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:133)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at com.cambridgesemantics.application.sdi.compiler.spark.DataSource.getDataFrame(DataSource.scala:20)
at cs.TestJob_publisherstarget$.main(TestJob_publisherstarget.scala:29)
at cs.TestJob_publisherstarget.main(TestJob_publisherstarget.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:484)
15/10/12 20:38:59 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root)
Anyway: Where am I supposed to put the JDBC driver jar file? I have copied it over to the lib of each child node, still no luck!

I was having the same issue, it was working in local mode but not in yarn-client.
I added to spark submit:
--conf "spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.34.jar
and that worked for me

For Spark 1.6, I have the issue to store DataFrame to Oracle by using org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable
In yarn-cluster mode, I put these options in the submit script:
--conf "spark.driver.extraClassPath=$HOME/jdbc-11.2.0.3.0.jar" \
--conf "spark.executor.extraClassPath=$HOME/jdbc-11.2.0.3.0.jar" \
I also have to put Class.forName("..") like below before the saving line:
try {
Class.forName("oracle.jdbc.OracleDriver");
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, "RD_SPARK_DTL_INCL_HY ", p);
} catch (Exception e) {....
Of course, you have to copy the lib to each node. Not pretty, but it works. Hope someone can come up better solution later.
I do strongly recommend to use this API -- amazingly convenient and fast.

Related

Spark submit in cluster mode, exception in thread "main"

My spark submit syntax is:
spark-submit --queue regular --deploy-mode cluster --conf spark.locality.wait=5000000ms --num-executors 100 --executor-memory 40G job.py
And the exception happened after job run successfully for a while, which is:
Application diagnostics message: User application exited with status 1
Exception in thread "main" org.apache.spark.SparkException: Application application_1635856758535_5228470 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1150)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1530)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
More importantly, in the job.py script, there's no local file read or write. Only parquet read or write.
And I cannot even track the application UI as the application was closed because of the exception. If anyone else encountered similar issue and has any ideas, please advice any solutions. Thanks!

Spark job not running when jar is in HDFS

I am trying to run a spark job in standalone mode but the command is not picking up the jar from HDFS.The jar is present in the HDFS location and Its working fine when I run it in local mode.
Below is the command I am using
spark-submit --deploy-mode client --master yarn --class com.main.WordCount /spark/wc.jar
Below is my program:
val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
val spark = new SparkContext(conf)
val file = spark.textFile(args(0))
val count = file.flatMap(f=>f.split(" ")).map(word=>(word,1)).reduceByKey(_+_).collect
count.foreach(println)
And I am getting below error:
Warning: Local jar /spark/wc.jar does not exist, skipping.
java.lang.ClassNotFoundException: com.main.WordCount
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:693)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
But If i use deploy mode cluster I am getting below error:
Exception in thread "main" java.io.FileNotFoundException: File file:/spark/wc.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:340)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:530)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:529)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:529)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Could you please clarify what is local mode. There are only two deploy mode client and cluster, the only difference is in client mode Driver program will run on the system and in cluster mode driver program will run from random node in the cluster.
For spark submit command:
When you execute spark submit command spark will pull all the local resources/files defined with --files , --py-files argument as well as Spark Main Jar to temporary HDFS location/directory, which is created by that particular spark application with the application name. when you give HDFS location, it will fail to location the Jar on local machine. It is mandatory to keep the Jar on local.

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.
Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)
What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD
There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

AWS EMR no host: hdfs:///var/log/spark/apps

I am trying to use AWS EMR (emr-4.3.0) Spark 1.6.0, Hadoop 2.7.0
I created EMR cluster, and added Step(in AWS ERM web) with my sample jar.
It's SpringBoot application and written by Java(1.8) (I installed JDK8 in the box)
It run with following command
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class org.springframework.boot.loader.JarLauncher s3://my-test/SparkForSpring-S1.2014.jar
I created SparkContext as following code.
SparkConf conf = new SparkConf().setAppName("SparkForSpring");
return new JavaSparkContext(conf);
but it fails with following error, I feel like it's something like not related to my application, I am new to Spark, Yarn though.
Caused by: org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public org.apache.spark.api.java.JavaSparkContext com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext()] threw exception; nested exception is java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:188)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:586)
... 49 more
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:66)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:547)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext(SparkConfig.java:35)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.CGLIB$javaSparkContext$0(<generated>)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b$$FastClassBySpringCGLIB$$10b15a77.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:312)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.javaSparkContext(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:166)
... 50 more
I read some document but quite not sure what should I do to fix this error. A hint will be greatly helpful.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
I solved this problem by not using SpringBoot's executable jar rather I use maven shade plugin to package only spring related jar files in one jar and using system classloader. Here is full pom.xml
I got a hint from this question's answer
apache-spark 1.3.0 and yarn integration and spring-boot as a container

spark-submit classpath issue with --repositories --packages options

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic
I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

Resources