Running Spark job in Oozie using Yarn-cluster - apache-spark

I had created a Spark Job using Oozie which configure run on yarn-cluster.
The Spark program is written in Scala and is a very simple program which just initialize SparkContext, println("hello world") and stop the SparkContext.
Below is the workflow.xml:
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="spark-0177"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-0177">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>MySpark</name>
<class>com.test1</class>
<jar>/user/hue/oozie/workspaces/tl_test/lib/testOozie1.jar</jar>
<spark-opts>--executor-cores 2 --driver-memory 5g --num-executors 2 --executor-memory 5g</spark-opts>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
However, i was getting the following error:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.<init>(Path.java:135)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:191)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$3.apply(Client.scala:254)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$3.apply(Client.scala:248)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:248)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:623)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:105)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:96)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:46)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:40)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:228)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:370)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:295)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://MYRNDSVRVM350:8020/user/oozie-oozi/0000084-150828094553499-oozie-oozi-W/spark-156b--spark/action-data.seq
Oozie Launcher ends
Please help as i'm totally stuck already.
Thanks.

I meet the exactly same wired problem. It turns out that multiple spaces in <spark-opts> cause this un-informtive error. You have two spaces between --executor-cores 2 --driver-memory 5g.

Can not create a Path from an empty string <- Spark is trying to stage your code, but your Oozie workflow doesn't provide the jar path, hence the empty path exception.
Providing the jar path in Oozie will fix this

Related

Oozie spark action involving spark sql fails if enableHiveSupport() is used

Using AWS EMR here.
Release label:emr-6.7.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.2.1, JupyterHub 1.4.1, Hue 4.10.0, Livy 0.7.1, Hive 3.1.3, Oozie 5.2.1
I'm trying to run a simple oozie workflow (via Hue) consisting of a spark action (python) that tries to create a database using spark.sql("create database temp_db_analytics").
My spark session is initialized as spark = SparkSession.builder.appName('test-app').enableHiveSupport().getOrCreate()
workflow.xml
<workflow-app name="test-workflow" xmlns="uri:oozie:workflow:0.5">
<start to="spark-3853"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-3853">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<name></name>
<jar>test-job.py</jar>
<file>s3://mybucket/test-job.py#test-job.py</file>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
job.properties
oozie.use.system.libpath=True
master=yarn-cluster
mode=cluster
send_email=False
dryrun=False
nameNode=hdfs://<hostname>:8020
jobTracker=<hostname>:8032
security_enabled=False
oozie.wf.application.path=${nameNode}/user/hadoop/SampleWorkflow
test-job.py
from pyspark.sql import SparkSession
if __name__=='__main__':
spark = SparkSession.builder.appName('test-app') \
.enableHiveSupport() \
.getOrCreate()
spark.sql("create database temp_db_analytics").show()
The corresponding error message when I run this using Oozie command line or Hue UI (only relevant portion shown here):
stdout
Traceback (most recent call last):
File "/mnt/yarn/filecache/299/test-job.py", line 16, in <module>
spark.sql("create database temp_db_analytics").show()
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o46.sql.
: java.lang.NoClassDefFoundError: org/apache/calcite/plan/RelOptRule
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2729)
at java.lang.Class.privateGetMethodRecursive(Class.java:3076)
at java.lang.Class.getMethod0(Class.java:3046)
at java.lang.Class.getMethod(Class.java:1812)
at org.apache.spark.sql.hive.client.HiveClientImpl.getHive(HiveClientImpl.scala:205)
at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:269)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:236)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:235)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:285)
at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:396)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:151)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.isDatabaseExistent$1(SharedState.scala:185)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:217)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:169)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:53)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:119)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:119)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:244)
at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:83)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:221)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.calcite.plan.RelOptRule
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:267)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:256)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 74 more
Things I have tried
Running the same job without enableHiveSupport() while creating spark session. This works fine however, I'm given to understand this uses an in-memory hive metastore rather than a persistent metastore which is not the desired outcome.
Running the same job using spark-submit --master yarn --deploy-mode cluster test-job.py works fine.
Explicitly passing the hive-site.xml to the workflow (only tried this on hue) using the --files option. Same error occurs.
Providing the calcite-core JAR to the spark job (via --packages). This gives a different error java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/CreationMetadata at the same line as above. It feels like there is a chain of unfulfilled dependencies. Don't get why this is happening only via Oozie
The dependency on calcite-core seemed to be related to Hive Cost Based Optimizer (check this out). Even tried adding the property hive.cbo.enable=false to hive-site.xml. Still the same issue.
What am I missing here? Any suggestions welcome. More than happy to furnish any more info.

Stop and Restart SparkContext executing in deploy mode "cluster"

In order to fit the efficiency requirements, I am forced to stop SparkContext and restart it with a new configuration more optimal in terms of number of executors, memory per executor, executor memory overhead...
I can achieve this launching my spark-submit in client mode :
spark-submit --num-executors 5 \
--deploy-mode client \
--class className spark.jar
And then within my code executing:
spark.stop()
val spark2 : SparkSession = SparkSession.builder
.config("spark.submit.deployMode", "client")
.config("spark.executor.instances", "8")
.getOrCreate()
And everything works OK.
However, when launching in client mode, stopping the SparkContext and restarting sparkContext in cluster mode, I get the following error:
20/05/28 18:05:24 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.
at org.apache.spark.SparkContext.<init>(SparkContext.scala:379)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/05/28 18:05:24 ERROR util.Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$postApplicationEnd(SparkContext.scala:2416)
at org.apache.spark.SparkContext$$anonfun$stop$1.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1385)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I have also tried launching spark-submit in cluster mode, stopping SparkCOntext and restarting it again in cluster mode. In this case I get the error:
Exception in thread "main" org.apache.spark.SparkException: Application application_1583287354042_80626 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1171)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1608)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I am not sure if it might be related to the fact that the driver is running on the cluster...
I'd be very grateful if someone could provide a solution to achieve these requirements.

Convert Excel file to csv in Spark 1.X

Is there a tool to convert Excel files into csv using Spark 1.X ?
got this issue when executing this tuto
https://github.com/ZuInnoTe/hadoopoffice/wiki/Read-Excel-document-using-Spark-1.x
Exception in thread "main" java.lang.NoClassDefFoundError: org/zuinnote/hadoop/office/format/mapreduce/ExcelFileInputFormat
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn$.convertToCSV(SparkScalaExcelIn.scala:63)
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn$.main(SparkScalaExcelIn.scala:56)
at org.zuinnote.spark.office.example.excel.SparkScalaExcelIn.main(SparkScalaExcelIn.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.zuinnote.hadoop.office.format.mapreduce.ExcelFileInputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
Spark is unable to find org.zuinnote.hadoop.office.format.mapreduce.ExcelFileInputFormat File format class in classpath.
Supply below dependency to spark-submit using --jars parameter-
<!-- https://mvnrepository.com/artifact/com.github.zuinnote/hadoopoffice-fileformat -->
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
<version>1.0.4</version>
</dependency>
Command:
spark-submit --jars hadoopoffice-fileformat-1.0.4.jar \
#rest of the command arguments
You have to build a fat jar that contains all the necessary dependencies. The example project on the HadoopOffice page shows how you build one. One you build the fat/uber jar you simply use it in Spark summit.

Unable to run Spark in yarn-cluster mode

I'm trying to run spark job with YARN in cluster deploy mode.
I tried to run the simpliest spark-submit command only with jar path, class parameter and master yarn-cluster. However I still have the same error, which actually tells me nothing.
Exception in thread "main" org.apache.spark.SparkException: Application application_1506196351647_0032 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1078)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1125)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
If anyone had the similar problem please let me know, I'm using spark 1.6, hadoop 2.6.

I am getting the below error while trying to execute spark submit using oozie on emr

I am running on cluster mode. The apacheds-kerberos-codec-2.0.0-M15.jar is present in multiple places in oozie/share/lib/lib*/spark and oozie/share/lib/lib*/oozie. Is this an environmental issue ?
ava.lang.IllegalArgumentException: Attempt to add (hdfs://ip-172-20-10-53.ec2.internal:8020/user/oozie/share/lib/lib_20170208121307/oozie/apacheds-kerberos-codec-2.0.0-M15.jar) multiple times to the distributed cache.
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:608)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11$$anonfun$apply$8.apply(Client.scala:599)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:599)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$11.apply(Client.scala:598)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:598)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:868)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1154)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:338)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:257)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:60)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:78)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:232)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:455)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It appears that the oozie sharelib and the spark sharelib directory share the same jars, and running a spark workflow imports both directories, which hadoop-3 code base doesn't like.
I've had to reorganize the oozie sharelib directory to only have oozie specific jars such that there are no duplicates between both oozie and spark sharelib dirs:
export HADOOP_USER_NAME=oozie
hdfs dfs -mv /user/oozie/share/lib/lib_20170222143042/oozie /user/oozie/share/lib/lib_20170222143042/oozie.old
hdfs dfs -mkdir /user/oozie/share/lib/lib_20170222143042/oozie
hdfs dfs -cp /user/oozie/share/lib/lib_20170222143042/oozie.old/oozie-hadoop-utils-hadoop-2-4.3.0.jar /user/oozie/share/lib/lib_20170222143042/oozie
hdfs dfs -cp /user/oozie/share/lib/lib_20170222143042/oozie.old/oozie-sharelib-oozie-4.3.0.jar /user/oozie/share/lib/lib_20170222143042/oozie
This fixes the immediate issue of being able to run spark workflows from oozie, but I'm not sure if this affects non-spark workflows.
I have an oozie job that starts a spark job, running in Amazon EMR. I got the same error when the EMR Hadoop setup has one instance for the master and one instance for the slaves. When I increased the amount of instances for the slaves to two, everything worked as expected.

Resources