An Apache Beam pipeline on Azure HDInsight's SparkRunner

An Apache Beam pipeline on Azure HDInsight's SparkRunner - apache-spark

I try to get a Beam pipeline to run on Azure's HDInsight SparkRunner.
I tried first with a cluster based on Spark 2.3.0/Hadoop 2.7 (HDI 3.6) and then also 2.3.1/Hadoop 3.0 (HDI 4.0 Preview).
I tried using Apache Beam 2.2.0 and next 2.10.0-SNAPSHOT.
The spark-submit command is (for Beam 2.10.0):
JARS="wasbs:///dependency/hadoop-azure-3.1.1.3.0.2.0-50.jar,wasbs:///dependency/azure-storage-7.0.0.jar,wasbs:///dependency/beam-model-fn-execution-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-model-job-management-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-model-pipeline-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-core-construction-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-core-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-direct-java-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-runners-spark-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-core-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-fn-execution-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-sdks-java-io-hadoop-file-system-2.10.0-SNAPSHOT.jar,wasbs:///dependency/beam-vendor-grpc-1_13_1-0.1.jar"
spark-submit --conf spark.yarn.maxAppAttempts=1 --deploy-mode cluster --master yarn --jars $JARS --class example.MinimalWordCountJava8 wasbs:///mavenproject1-1.0-SNAPSHOT.jar --runner=SparkRunner
(initially -jars was not given the hadoop-azure and azure-storage jars, but that did not make any difference).
The main() looks like this:
public static void main(String[] args) {
JavaSparkContext ct = new JavaSparkContext();
Configuration config = ct.hadoopConfiguration();
config.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasb");
config.set("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasbs");
config.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");
config.set("fs.azure.account.key." + account + ".blob.core.windows.net", key);
config.set("fs.defaultFS", "wasb://" + container + "#" + account + ".blob.core.windows.net");
System.out.println("### hello.txt content:");
JavaRDD<String> content = ct.textFile("wasbs:///hello.txt");
System.out.println(content.toString());
System.out.println("### MinimalWordCountJava8");
PipelineOptions options = PipelineOptionsFactory.create();
SparkContextOptions sparkContextOptions = options.as(SparkContextOptions.class);
sparkContextOptions.setUsesProvidedSparkContext(true);
sparkContextOptions.setProvidedSparkContext(ct);
sparkContextOptions.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(sparkContextOptions);
p.apply(TextIO.read().from("hello.txt"))
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"))))
.apply(Filter.by((String word) -> !word.isEmpty()))
.apply(Count.<String>perElement())
.apply(MapElements
.into(TypeDescriptors.strings())
.via((KV<String, Long> wordCount) -> wordCount.getKey() + ": " + wordCount.getValue()))
// CHANGE 3/3: The Google Cloud Storage path is required for outputting the results to.
.apply(TextIO.write().to("output"));
p.run().waitUntilFinish();
It fails when calling Pipeline.create(options); with this exception trace:
18/12/09 14:47:10 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:59)
at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:489)
at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:479)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:145)
at io.aptly.mavenproject1.MinimalWordCountJava8.main(MinimalWordCountJava8.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "wasbs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:3377)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:530)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:542)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.<init>(HadoopFileSystem.java:82)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:56)
... 10 more
18/12/09 14:47:10 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.IllegalArgumentException: Failed to construct Hadoop filesystem with configuration Configuration: /usr/hdp/3.0.2.0-50/hadoop/conf/core-site.xml, /usr/hdp/3.0.2.0-50/hadoop/conf/hdfs-site.xml
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:59)
at org.apache.beam.sdk.io.FileSystems.verifySchemesAreUnique(FileSystems.java:489)
at org.apache.beam.sdk.io.FileSystems.setDefaultPipelineOptions(FileSystems.java:479)
at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:47)
at org.apache.beam.sdk.Pipeline.create(Pipeline.java:145)
at io.aptly.mavenproject1.MinimalWordCountJava8.main(MinimalWordCountJava8.java:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "wasbs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:3377)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:530)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:542)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystem.<init>(HadoopFileSystem.java:82)
at org.apache.beam.sdk.io.hdfs.HadoopFileSystemRegistrar.fromOptions(HadoopFileSystemRegistrar.java:56)
The submit works (the wasps:// is recognised) and reading the small wasps:///hello.txt does not fail. These cases indicate that using wasps:// is fine until that point.
It's early inside Beam, that it seems to fail.
Because of this I passed the JavaSparkContext with the PipelineOptions (with dynamic hadoop configurations that were suggested by other SO question/answers). But this did not make a difference for me.
Anyone who can guide on how to get around this issue?

From quickly digging through code and bug trackers, it looks like Azure is supported as a Hadoop filesystem starting with Hadoop 3.2.0 (code, Jira). Currently Beam is pinned to version 2.7.3. This would explain the failure in Beam's HadoopFilesystem.
It may be that spark-submit succeeded because wasbs:// is supported via a different mechanism than Hadoop's libraries or using a bundled and newer version of Hadoop.

Related

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

I'm trying to launch a cluster using AWS Cli. I use the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
The cluster is created successfully. Then I add this command:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Name=SparkSubmit,Jar="command-runner.jar",Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/scalaProgram.jar,s3://tracceale/params/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
After some time, the step failed. This is the LOG file:
17/02/22 11:00:07 INFO RMProxy: Connecting to ResourceManager at ip-172-31- 31-190.us-west-2.compute.internal/172.31.31.190:8032
17/02/22 11:00:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
17/02/22 11:00:08 INFO Client: Verifying our application has not requested
Exception in thread "main" org.apache.spark.SparkException: Application application_1487760984275_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1132)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1175)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/02/22 11:01:02 INFO ShutdownHookManager: Shutdown hook called
17/02/22 11:01:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-27baeaa9-8b3a-4ae6-97d0-abc1d3762c86
Command exiting with ret '1'
Locally (on SandBox Hortonworks HDP 2.5) I run:
./spark-submit --class Traccia2014 --master local[*] --executor-memory 2G /usr/hdp/current/spark2-client/ScalaProjects/ScripRapportoBatch2.1/target/scala-2.11/traccia-22-ottobre_2.11-1.0.jar "/home/tracce/configHDFS.txt" 30 300 3
and everything works fine.
I've already read something related to my problem, but I can't figure it out.
UPDATE
Checked into Application Master, I get this error:
17/02/22 15:29:54 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:91)
at scala.io.Source$.fromFile(Source.scala:76)
at scala.io.Source$.fromFile(Source.scala:54)
at Traccia2014$.main(Rapporto.scala:40)
at Traccia2014.main(Rapporto.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
17/02/22 15:29:55 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.io.FileNotFoundException: s3:/tracceale/params/configS3.txt (No such file or directory))
I pass the path mentioned "s3://tracceale/params/configS3.txt" from S3 to the function 'fromFile' like this:
for(line <- scala.io.Source.fromFile(logFile).getLines())
How could I solve it? Thanks in advance.

Because you are using cluster deploy mode, the logs you have included are not useful at all. They just say that the application failed but not why it failed. To figure out why it failed, you at least need to look at the Application Master logs, since that is where the Spark driver runs in cluster deploy mode, and it will probably give a better hint as to why the application failed.
Since you have configured your cluster with a --log-uri, you will find the logs for the Application Master underneath s3://aws-logs-813591802533-us-west-2/elasticmapreduce/<CLUSTER ID>/containers/<YARN Application ID>/ where the YARN Application ID is (based on the logs you included above) application_1487760984275_0001, and the container ID should be something like container_1487760984275_0001_01_000001. (The first container for an application is the Application Master.)

What you have there is a URL to an object store, reachable from the Hadoop filesystem APIs, and a stack trace coming from java.io.File, which can't read it because it doesn't refer to anything in the local disk.
Use SparkContext.hadoopRDD() as the operation to convert the path into an RDD

There is a probability of file missing in the location, may be you can see it after ssh into EMR cluster but still the steps command wouldn't be able to figure out by itself and starts throwing that file not found exception.
In this scenario what I did is :
Step 1: Checked for the file existence in the project directory which we copied to EMR.
for example mine was in `//usr/local/project_folder/`
Step 2: Copy the script which you're expecting to run on the EMR.
for example I copied from `//usr/local/project_folder/script_name.sh` to `/home/hadoop/`
Step 3: Then executed the script from /home/hadoop/ by passing the absolute path to the command-runner.jar
command-runner.jar bash /home/hadoop/script_name.sh
Thus I found my script running. Hope this may be helpful to someone

--jars from different locations causes different jdbc behavior

When I load a MySQL JDBC driver by first copying it to the driver, and then including it via --jars /path/to/jdbc/driver.jar, then referencing that jdbc driver and loading data into a dataframe succeeds.
$ pyspark --jars /path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
But, if I load the jar over the publicly available https-hosted version of that exact jar file, it fails.
$ pyspark --jars https://s3/path/to/jdbc/driver.jar
>>> rdd = sqlContext.read.jdbc(url="jdbc:mysql://someAWSDatabase.us-west-2.rds.amazonaws.com:3306?user=root&password=somepassword", table="spark.test", properties={"driver":"com.mysql.jdbc.Driver"})
py4j.protocol.Py4JJavaError: An error occurred while calling o37.jdbc.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
...
According to the docs, you can submit jars from various locations, from local to http/https, etc. Why would this cause a different behavior?
Update: I also tried running two spark-submit jobs, one with each variant of the jars path to the jdbc jar. The https jar submission threw the same error as above.

AWS EMR no host: hdfs:///var/log/spark/apps

I am trying to use AWS EMR (emr-4.3.0) Spark 1.6.0, Hadoop 2.7.0
I created EMR cluster, and added Step(in AWS ERM web) with my sample jar.
It's SpringBoot application and written by Java(1.8) (I installed JDK8 in the box)
It run with following command
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class org.springframework.boot.loader.JarLauncher s3://my-test/SparkForSpring-S1.2014.jar
I created SparkContext as following code.
SparkConf conf = new SparkConf().setAppName("SparkForSpring");
return new JavaSparkContext(conf);
but it fails with following error, I feel like it's something like not related to my application, I am new to Spark, Yarn though.
Caused by: org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public org.apache.spark.api.java.JavaSparkContext com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext()] threw exception; nested exception is java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:188)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:586)
... 49 more
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:66)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:547)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext(SparkConfig.java:35)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.CGLIB$javaSparkContext$0(<generated>)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b$$FastClassBySpringCGLIB$$10b15a77.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:312)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.javaSparkContext(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:166)
... 50 more
I read some document but quite not sure what should I do to fix this error. A hint will be greatly helpful.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html

I solved this problem by not using SpringBoot's executable jar rather I use maven shade plugin to package only spring related jar files in one jar and using system classloader. Here is full pom.xml
I got a hint from this question's answer
apache-spark 1.3.0 and yarn integration and spring-boot as a container

JDBC Driver not found - On submitting to YARN from Spark

Trying to read all rows from a DB table and write the same to another empty target table. So when I issue the following command at the main node, it works as expected -
$./bin/spark-submit --class cs.TestJob_publisherstarget --driver-class-path ./lib/mysql-connector-java-5.1.35-bin.jar --jars ./lib/mysql-connector-java-5.1.35-bin.jar,./lib/univocity-parsers-1.5.6.jar,./lib/commons-csv-1.1.1-SNAPSHOT.jar ./lib/uber-ski-spark-job-0.0.1-SNAPSHOT.jar
(Where: uber-ski-spark-job-0.0.1-SNAPSHOT.jar is the packaged jar in ../spark/lib folder and cs.TestJob_publisherstarget is the class)
The above command works perfectly for the code and reads all rows from a table in MySQL and dumps all roes to target table, using the JDBC driver mentioned with --jars option.
Here is the issue:
Everything remaining the same as above, when I submit the same job to YARN, it fails with en exception indicating - can't find the driver
$./bin/spark-submit --verbose --class cs.TestJob_publisherstarget --master yarn-cluster --driver-class-path ./lib/mysql-connector-java-5.1.35-bin.jar --jars ./lib/mysql-connector-java-5.1.35-bin.jar ./lib/uber-ski-spark-job-0.0.1-SNAPSHOT.jar
Exception in YARN Console:
Error: application failed with exception
org.apache.spark.SparkException: Application finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:625)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:650)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:577)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:174)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
EXCEPTION AT LOG:
5/10/12 20:38:59 ERROR yarn.ApplicationMaster: User class threw exception: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root
java.sql.SQLException: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:96)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:133)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at com.cambridgesemantics.application.sdi.compiler.spark.DataSource.getDataFrame(DataSource.scala:20)
at cs.TestJob_publisherstarget$.main(TestJob_publisherstarget.scala:29)
at cs.TestJob_publisherstarget.main(TestJob_publisherstarget.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:484)
15/10/12 20:38:59 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: No suitable driver found for jdbc:mysql://localhost:3306/pubs?user=root&password=root)
Anyway: Where am I supposed to put the JDBC driver jar file? I have copied it over to the lib of each child node, still no luck!

I was having the same issue, it was working in local mode but not in yarn-client.
I added to spark submit:
--conf "spark.executor.extraClassPath=/path/to/mysql-connector-java-5.1.34.jar
and that worked for me

For Spark 1.6, I have the issue to store DataFrame to Oracle by using org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable
In yarn-cluster mode, I put these options in the submit script:
--conf "spark.driver.extraClassPath=$HOME/jdbc-11.2.0.3.0.jar" \
--conf "spark.executor.extraClassPath=$HOME/jdbc-11.2.0.3.0.jar" \
I also have to put Class.forName("..") like below before the saving line:
try {
Class.forName("oracle.jdbc.OracleDriver");
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(ds, url, "RD_SPARK_DTL_INCL_HY ", p);
} catch (Exception e) {....
Of course, you have to copy the lib to each node. Not pretty, but it works. Hope someone can come up better solution later.
I do strongly recommend to use this API -- amazingly convenient and fast.

Exception while submit spark job on yarn cluster with remote jvm

I am using below java code to submit job on yarn-cluster.
public ApplicationId submitQuery(String requestId, String query,String fileLocations) {
String driverJar = getDriverJar();
String driverClass = propertyService.getAppPropertyValue(TypeString.QUERY_DRIVER_CLASS);
String driverAppName = propertyService.getAppPropertyValue(TypeString.DRIVER_APP_NAME);
String extraJarsNeeded = propertyService.getAppPropertyValue(TypeString.DRIVER_EXTRA_JARS_NEEDED);
String[] args = new String[] {
// the name of your application
"--name",
driverAppName,
// memory for driver (optional)
"--driver-memory",
"1000M",
// path to your application's JAR file
// required in yarn-cluster mode
"--jar",
"local:/home/ankit/Repository/Personalization/rtis/Cust360QueryDriver/target/SnapdealCustomer360QueryDriver-jar-with-selective-dependencies.jar",
"--addJars",
"local:/home/ankit/Downloads/lib/spark-assembly-1.3.1-hadoop2.4.0.jar,local:/home/ankit/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar,local:/home/ankit/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar",
// name of your application's main class (required)
"--class",
driverClass,
"--arg",
requestId,
"--arg",
query,
"--arg",
fileLocations,
"--arg",
"yarn-client"
};
System.setProperty("HADOOP_CONF_DIR", "/home/hduser/hadoop-2.7.0/etc/hadoop");
Configuration config = new Configuration();
config.set("yarn.resourcemanager.address", propertyService.getAppPropertyValue(TypeString.RESOURCE_MANGER_URL));
config.set("fs.default.name", propertyService.getAppPropertyValue(TypeString.FS_DEFAULT_NAME));
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args, sparkConf);
// create an instance of yarn Client client
Client client = new Client(cArgs, config, sparkConf);
ApplicationId id = client.submitApplication();
return id;
}
Job is getting submitted to yarn-cluster and i am able to retrieve application id but i am getting below Exception while running job on spark cluster.
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
though mentioned class in /home/ankit/Downloads/lib/spark-assembly-1.3.1-hadoop2.4.0.jar. looks like jar mentioned in --addJars is not getting added in driver's spark context.
Am i doing something wrong?? Any help would be appreciated.

Are you deploying on Cloudera's distribution ? spark.yarn.jar in the CDH 5.4 config has a 'local:' prefix for local files, but Spark version >= 1.5 does not like this, you should just use the full path name for your spark assembly. See also here.

Try building JAR without spark dependency and pass dependent jars with --jars in spark submit. Most of the times ClassNotFoundException is due to spark and application itself is dependent on same jar.
Suggested solution:
package without dependency and add dependent jars with --jars during
spark-submit
Modify application to use same version of third-party
library as spark has
Use shading in your build tool

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

An Apache Beam pipeline on Azure HDInsight's SparkRunner - apache-spark

Related

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

--jars from different locations causes different jdbc behavior

AWS EMR no host: hdfs:///var/log/spark/apps

JDBC Driver not found - On submitting to YARN from Spark

Exception while submit spark job on yarn cluster with remote jvm

Categories

Resources