Sqoop Error while run via Spark - apache-spark

When I run this code via sqoop command, it works
sqoop import --connect "jdbc:sqlserver://myhost:port;databaseName=DBNAME" \
--username MYUSER -P \
--compress --compression-codec snappy \
--as-parquetfile \
--table MYTABLE \
--warehouse-dir /user/myuser/test1/ \
--m 1
Then I create spark scala code as below. But when I execute the project using spark-submit, it not working
val sqoop_options: SqoopOptions = new SqoopOptions()
sqoop_options.setConnectString("jdbc:sqlserver://myhost:port;databaseName=DBNAME")
sqoop_options.setTableName("MYTABLE");
sqoop_options.setUsername("MYUSER");
sqoop_options.setPassword("password");
sqoop_options.setNumMappers(1);
sqoop_options.setTargetDir("/user/myuser/test1/");
sqoop_options.setFileLayout(FileLayout.ParquetFile);
sqoop_options.setCompressionCodec("org.apache.hadoop.io.compress.SnappyCodec")
val importTool = new ImportTool
val sqoop = new Sqoop(importTool, conf, sqoop_options);
val retCode = ToolRunner.run(sqoop, null);
It return driver not found error, even though I run it on the same cluster.
I already put appropriate library on /var/lib/sqoop directory, that's why sqoop command run well. But, is it will refer to another library path when I run it via spark-submit?
Detail error log:
/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/spark/conf/spark-env.sh: line 75: spark.driver.extraClassPath=.:/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar://opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/guava-12.0.1.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/zookeeper.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/protobuf-java-2.5.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-hbase-handler.jar: No such file or directory
/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/spark/conf/spark-env.sh: line 77: spark.executor.extraClassPath=.:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar://opt/cloudera/parcels/CDH/lib/hbase/hbase-server.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/guava-12.0.1.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/zookeeper.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/protobuf-java-2.5.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop2-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-hadoop-compat.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hive/lib/hive-hbase-handler.jar: No such file or directory
2018-03-09 13:59:37,332 INFO [main] security.UserGroupInformation: Login successful for user myuser using keytab file myuser.keytab
2018-03-09 13:59:37,371 INFO [main] sqoop.Sqoop: Running Sqoop version: 1.4.6
2018-03-09 13:59:37,426 WARN [main] sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
2018-03-09 13:59:37,478 INFO [main] manager.SqlManager: Using default fetchSize of 1000
2018-03-09 13:59:37,479 INFO [main] tool.CodeGenTool: Beginning code generation
2018-03-09 13:59:37,479 INFO [main] tool.CodeGenTool: Will generate java class as codegen_MYTABLE
Exception in thread "main" java.lang.RuntimeException: Could not load db driver class: com.microsoft.sqlserver.jdbc.SQLServerDriver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767)
at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:227)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:295)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1833)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1645)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:107)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:478)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.test.spark.sqoop.SqoopExample$.importSQLToHDFS(SqoopExample.scala:56)
at com.test.spark.sqoop.SqoopExample$.main(SqoopExample.scala:18)
at com.test.spark.sqoop.SqoopExample.main(SqoopExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Now my error is:
spark-submit --files kafka-jaas.conf,ampuser.keytab --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka-jaas.conf" --driver-java-options "-Djava.security.auth.login.config=kafka-jaas.conf" --conf spark.driver.extraClassPath=/var/lib/sqoop/sqljdbc4.jar:/opt/cloudera/parcels/CDH/lib/sqoop/lib/*,/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/* --class com.danamon.spark.sqoop.SqoopExample --deploy-mode client --master yarn kafka-streaming-0.0.1-SNAPSHOT-jar-with-dependencies.jar
18/03/13 20:54:51 INFO security.UserGroupInformation: Login successful for user ampuser using keytab file ampuser.keytab
18/03/13 20:54:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
18/03/13 20:54:51 WARN sqoop.ConnFactory: $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
18/03/13 20:54:51 INFO manager.SqlManager: Using default fetchSize of 1000
18/03/13 20:54:51 INFO tool.CodeGenTool: Beginning code generation
18/03/13 20:54:51 INFO tool.CodeGenTool: Will generate java class as codegen_BD_AC_ACCT_PREFERENCES
18/03/13 20:54:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [BD_AC_ACCT_PREFERENCES] AS t WHERE 1=0
18/03/13 20:54:52 INFO orm.CompilationManager: $HADOOP_MAPRED_HOME is not set
Note: /tmp/sqoop-ampuser/compile/95e3ef854d67b50d8ef72955151dc846/codegen_BD_AC_ACCT_PREFERENCES.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/03/13 20:54:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-ampuser/compile/95e3ef854d67b50d8ef72955151dc846/codegen_BD_AC_ACCT_PREFERENCES.jar
18/03/13 20:54:54 INFO mapreduce.ImportJobBase: Beginning import of BD_AC_ACCT_PREFERENCES
18/03/13 20:54:54 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
Exception in thread "main" java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat
at org.apache.sqoop.mapreduce.DataDrivenImportJob.getOutputFormatClass(DataDrivenImportJob.java:190)
at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:94)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:259)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at org.apache.sqoop.manager.SQLServerManager.importTable(SQLServerManager.java:163)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.danamon.spark.sqoop.SqoopExample$.importSQLToHDFS(SqoopExample.scala:57)
at com.danamon.spark.sqoop.SqoopExample$.main(SqoopExample.scala:18)
at com.danamon.spark.sqoop.SqoopExample.main(SqoopExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.kitesdk.data.mapreduce.DatasetKeyOutputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 22 more
Is it caused by my Cloudera installation is not properly configured? or maybe HADOOP_HOME, MAPRED_HOME, etc not set properly?
Should I create new question for this?

You need to set the HADOOP_MAPRED_HOME to the $HADOOP_HOME in your ~/.bashrc
sudo nano ~/.bashrc
then add this line
export HADOOP_MAPRED_HOME = $HADOOP_HOME
save the file and then run this command
source ~/.bashrc

Related

spark-submit on local Hadoop-Yarn setup, fails with Stdout path must be absolute error

I have installed the latest Hadoop and Spark versions on my Windows machine.
I am trying to launch one of the provided examples but it fails and I have no idea what the diagnostic means. It seems it's related to the stdout but I can't figure out the root cause.
I launch the following command:
spark-submit --master yarn --class org.apache.spark.examples.JavaSparkPi C:\spark-3.0.1-bin-hadoop3.2\examples\jars\spark-examples_2.12-3.0.1.jar 100
And the exception I have is:
21/01/25 10:53:53 WARN MetricsSystem: Stopping a MetricsSystem that is not running
21/01/25 10:53:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/01/25 10:53:53 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: Application application_1611568137841_0002 failed 2 times due to AM Container for appattempt_1611568137841_0002_000002 exited with exitCode: -1
Failing this attempt.Diagnostics:
[2021-01-25 10:53:53.381] Stdout path must be absolute
For more detailed output, check the application tracking page: http://xxxx-PC:8088/cluster/app/application_1611568137841_0002 Then click on links to logs of each attempt.
. Failing the application.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBack
end.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:201)
at org.apache.spark.SparkContext.(SparkContext.scala:555)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2574)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:934)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:928)
at org.apache.spark.examples.JavaSparkPi.main(JavaSparkPi.java:37)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/01/25 10:53:53 INFO ShutdownHookManager: Shutdown hook called
21/01/25 10:53:53 INFO ShutdownHookManager: Deleting directory C:\Users\xxx\AppData\Local\Temp\spark-b28ecb32-5e3f-4d6a-973a-c03a7aae0da9
21/01/25 10:53:53 INFO ShutdownHookManager: Deleting directory C:\Users/xxx\AppData\Local\Temp\spark-3665ba77-d2aa-424a-9f75-e772bb5b9104
As for the diagnostics:
Diagnostics:
Application application_1611562870926_0004 failed 2 times due to AM Container for appattempt_1611562870926_0004_000002 exited with exitCode: -1
Failing this attempt.Diagnostics: [2021-01-25 10:29:19.734]Stdout path must be absolute
For more detailed output, check the application tracking page: http://****-PC:8088/cluster/app/application_1611562870926_0004 Then click on links to logs of each attempt.
. Failing the application.
Thank you !
So I am not sure of the root cause yet, it's probably due to the fact that I run under windows and some default property was wrong for Yarn.
When I added the 2 following properties on yarn-site.xml, it worked fine:
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/tmp</value>
</property>
<property>
<name>yarn.log.dir</name>
<value>/tmp</value>
</property>
Hope it helps someone in the future !

Spark job not running when jar is in HDFS

I am trying to run a spark job in standalone mode but the command is not picking up the jar from HDFS.The jar is present in the HDFS location and Its working fine when I run it in local mode.
Below is the command I am using
spark-submit --deploy-mode client --master yarn --class com.main.WordCount /spark/wc.jar
Below is my program:
val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
val spark = new SparkContext(conf)
val file = spark.textFile(args(0))
val count = file.flatMap(f=>f.split(" ")).map(word=>(word,1)).reduceByKey(_+_).collect
count.foreach(println)
And I am getting below error:
Warning: Local jar /spark/wc.jar does not exist, skipping.
java.lang.ClassNotFoundException: com.main.WordCount
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:693)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
But If i use deploy mode cluster I am getting below error:
Exception in thread "main" java.io.FileNotFoundException: File file:/spark/wc.jar does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:340)
at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:530)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:529)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:529)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1119)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1178)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Could you please clarify what is local mode. There are only two deploy mode client and cluster, the only difference is in client mode Driver program will run on the system and in cluster mode driver program will run from random node in the cluster.
For spark submit command:
When you execute spark submit command spark will pull all the local resources/files defined with --files , --py-files argument as well as Spark Main Jar to temporary HDFS location/directory, which is created by that particular spark application with the application name. when you give HDFS location, it will fail to location the Jar on local machine. It is mandatory to keep the Jar on local.

Spark streaming - class not found - HDFS file streaming - java.lang.ClassNotFoundException: com.pepperdata.spark.metrics.PepperdataSparkListener

I have submitted the spark streaming job with the yarn cluster mode.
But I am getting the following error.
SparkSubmit Command:
export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar:/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar
spark-submit --master yarn-cluster --keytab /etc/security/keytabs/srvc_egsc_hdpuser.service.keytab --principal srvc_egsc_hdpuser#EAPKDC.HOUSTON.HP.COM --queue sc_streaming --class com.reni.scmplatform.data.producer.DPMain --executor-memory 5g --driver-memory 8g --conf spark.sql.shuffle.partitions=10 --conf spark.default.parallelism=50 --jars /usr/hdp/current/hbase-client/lib/hbase-common.jar,/usr/hdp/current/hbase-client/lib/hbase-client.jar,/usr/hdp/current/hbase-client/lib/hbase-server.jar,/usr/hdp/current/hbase-client/lib/hbase-protocol.jar,/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar --files /etc/spark/conf/hbase-site.xml,/etc/spark/conf/hive-site.xml hdfs://EAPROD/EA/supplychain/streaming/logistics/entaly/jars/DataProducer-assembly-1.0.15-SNAPSHOT.jar --platform.framework.hdfs.logging.dir=/EA/supplychain/process/logs/logistics/entaly/dataProducer --platform.framework.logging.level=info --platform.framework.logging.publish=true
Error:
18/03/12 05:14:30 ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Exception when registering SparkListener
org.apache.spark.SparkException: Exception when registering SparkListener
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2154)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:578)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2280)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:140)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:877)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:877)
at scala.Option.map(Option.scala:145)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:877)
at com.reni.scmplatform.data.producer.helper.DPStreamEventHandler.start(DPStreamEventHandler.scala:63)
at com.reni.scmplatform.data.producer.DPMain$.main(DPMain.scala:27)
at com.reni.scmplatform.data.producer.DPMain.main(DPMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:561)
Caused by: java.lang.ClassNotFoundException: com.pepperdata.spark.metrics.PepperdataSparkListener
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:175)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2122)
at org.apache.spark.SparkContext$$anonfun$setupAndStartListenerBus$1.apply(SparkContext.scala:2119)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.SparkContext.setupAndStartListenerBus(SparkContext.scala:2119)
... 15 more
18/03/12 05:14:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
18/03/12 05:14:30 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
18/03/12 05:14:30 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Exception when registering SparkListener)
You should add the JAR containing the missing class to the job classpath by using the --jars option (see this answer: spark submit add multiple jars in classpath)
Moreover, I use sbt-assembly plugin to take care of these things for you:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
Then build with sbt compile assemble and all the jars needed for your application will be included in the job jar sent to Yarn.

Spark runs in local but can't find file when running in YARN

I've been trying to submit a simple python script to run it in a cluster with YARN. When I execute the job in local, there's no problem, everything works fine but when I run it in the cluster it fails.
I executed the submit with the following command:
spark-submit --master yarn --deploy-mode cluster test.py
The log error I'm receiving is the following one:
17/11/07 13:02:48 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:49 INFO yarn.Client: Application report for application_1510046813642_0010 (state: ACCEPTED)
17/11/07 13:02:50 INFO yarn.Client: Application report for application_1510046813642_0010 (state: FAILED)
17/11/07 13:02:50 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1510046813642_0010 failed 2 times due to AM Container for appattempt_1510046813642_0010_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://myserver:8088/proxy/application_1510046813642_0010/Then, click on links to logs of each attempt.
**Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py**
java.io.FileNotFoundException: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.users.josholsan
start time: 1510056155796
final status: FAILED
tracking URL: http://myserver:8088/cluster/app/application_1510046813642_0010
user: josholsan
Exception in thread "main" org.apache.spark.SparkException: Application application_1510046813642_0010 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1025)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1072)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/11/07 13:02:50 INFO util.ShutdownHookManager: Shutdown hook called
17/11/07 13:02:50 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-5cc8bf5e-216b-4d9e-b66d-9dc01a94e851
I put special attention to this line
Diagnostics: File does not exist: hdfs://myserver:8020/user/josholsan/.sparkStaging/application_1510046813642_0010/test.py
I don't know why it can't finde the test.py, I also tried to put it in HDFS under the directory of the user executing the job: /user/josholsan/
To finish my post I would like to share also my test.py script:
from pyspark import SparkContext
file="/user/josholsan/concepts_copy.csv"
sc = SparkContext("local","Test app")
textFile = sc.textFile(file).cache()
linesWithOMOP=textFile.filter(lambda line: "OMOP" in line).count()
linesWithICD=textFile.filter(lambda line: "ICD" in line).count()
print("Lines with OMOP: %i, lines with ICD9: %i" % (linesWithOMOP,linesWithICD))
Could the error also be in here?:
sc = SparkContext("local","Test app")
Thanks you so much for your help in advance.
Transferred from the comments section:
sc = SparkContext("local","Test app"): having "local" here will override any command line settings; from the docs:
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.
The test.py file must be placed somewhere where it is visible throughout the whole cluster. E.g. spark-submit --master yarn --deploy-mode cluster http://somewhere/accessible/to/master/and/workers/test.py
Any additional files and resources can be specified using the --py-files argument (tested in mesos, not in yarn unfortunately), e.g. --py-files http://somewhere/accessible/to/all/extra_python_code_my_code_uses.zip
Edit: as #desertnaut commented, this argument should be used before the script to be executed.
yarn logs -applicationId <app ID> will give you the output of your submitted job. More here and here
Hope this helps, good luck!

running spark on yarn as client

I'm trying to run a spark job with yarn using:
./bin/spark-submit --class "KafkaToMaprfs" --master yarn --deploy-mode client /home/mapr/kafkaToMaprfs/target/scala-2.10/KafkaToMaprfs.jar
But facing this error:
/opt/mapr/hadoop/hadoop-2.7.0 17/01/03 11:19:26 WARN NativeCodeLoader:
Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable 17/01/03 11:19:38 ERROR
SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at KafkaToMaprfs$.main(KafkaToMaprfs.scala:61)
at KafkaToMaprfs.main(KafkaToMaprfs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/03 11:19:39 WARN MetricsSystem: Stopping a MetricsSystem that is
not running Exception in thread "main"
org.apache.spark.SparkException: Yarn application has already ended!
It might have been killed or unable to launch application master.
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
at org.apache.spark.SparkContext.(SparkContext.scala:530)
at KafkaToMaprfs$.main(KafkaToMaprfs.scala:61)
at KafkaToMaprfs.main(KafkaToMaprfs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:752)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have a multi node cluster, i'm deploying the application from a remote node.
I'm using spark 1.6.1 and hadoop 2.7.x versions.
I didn't set the cluster, so I couldn't find where the mistake lies.
Can anyone please help me fix this?
In my case i'm using MapR distribution.And i didn't configure the environment.
So, when i dug down to the all the conf folders.I made some changes in the below files,
1. In Spark-env.sh,Make sure these values are set right.
export SPARK_LOG_DIR=
export SPARK_PID_DIR=
export HADOOP_HOME=
export HADOOP_CONF_DIR=
export JAVA_HOME=
export SPARK_SUBMIT_OPTIONS=
2. yarn-env.sh.
Make sure the yarn_conf_dir, and java_home are set with correct values.
3. In spark-defaults.conf
1.spark.driver.extraClassPath
2.set value for HADOOP_CONF_DIR
4. HADOOP_CONF_DIR and JAVA_HOME in $SPARK_HOME/conf/spark-env.sh
1.export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop
2.export JAVA_HOME =
5.spark assembly jar
1.Copy the following JAR file from the local file system to a world-readable location on MapR-FS: Substitute your Spark version and
specific JAR file name in the command.
/opt/mapr/spark/spark-/lib/spark-assembly--hadoop-mapr-.jar
Now i'm able to run my spark application as YARN-CLIENT smoothly using spark-submit.
These are basic essentials to make spark connect with yarn.
Correct me if i missed any other things.

Resources