Unable to schedule job in oozie. Getting Error while creating HiveContext - apache-spark

Trying to run a spark job from oozie. Below is the code which I am trying to run.
SparkConf conf = getConf(appName);
JavaSparkContext sc = new JavaSparkContext(conf);
HiveContext hiveContext = new HiveContext(sc);
I am getting the following error:
JOB[0000000-170808082825775-oozie-oozi-W] ACTION[0000000-170808082825775-oozie-oozi-W#Sample-node] Launcher exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
Here's my workflow xml file
<workflow-app name="DataSampling" xmlns="uri:oozie:workflow:0.4">
<start to='Sample-node'/>
<action name="Sample-node">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.3.0-37/tez/tez.tar.gz</value>
</property>
</configuration>
<master>${master}</master>
<mode>${mode}</mode>
<name>Sample class on Oozie - Sampling</name>
<class>Sampling</class>
<jar>/path/jarfile.jar</jar>
<arg>${numEventsPerPattern}</arg>
<arg>${eventdate}</arg>
<arg>${eventtype}</arg>
<arg>${user}</arg>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end'/>
</workflow-app>
I am using Hortonworks Data Platform 2.5. Can any one please help if I am missing some thing in the classpath.
Thanks in advance.

Finally it worked. Oozie is able to create HiveContext.
Issue is with classpath. Delete the folder /user/oozie/share/lib in hdfs.
Update the following properties in Ambari under core-site.xml
Set the following properties to *
hadoop.proxyuser.oozie.groups
hadoop.proxyuser.oozie.hosts
hadoop.proxyuser.root.groups
hadoop.proxyuser.root.hosts
Created new shared library using the following command:
/usr/hdp/current/oozie-client/bin/oozie-setup.sh sharelib create -fs /user/oozie/share/lib
Restart oozie service
Above 2 steps should be done using oozie user
Added the following tags to work flow xml file
<spark-opts>--num-executors 6 --driver-memory 8g --executor-memory 6g</spark-opts>
Run the oozie job as hdfs user.

Related

Oozie Spark2 Action throws "Attempt to add ({dependencyJar}) multiple times to the distributed cache."

Getting the below error while trying to load the dependency jar for oozie spark2 action. Added workflow.xml below.
Error:
2019-06-12 07:00:35,140 WARN SparkActionExecutor:523 -
SERVER[manager-0] USER[root] GROUP[-] TOKEN[] APP[spark-wf]
JOB[0000068-190611183932696-oozie-root-W]
ACTION[0000068-190611183932696-oozie-root-W#spark-node] Launcher
ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain],
main() threw exception,
Attempt to add (hdfs://${nameNode}/${workflowAppUri}/lib/${dependencyJar})
multiple times to the distributed cache.
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.3" name="spark-wf">
<start to="spark-node"/>
<action name="spark-node">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<master>yarn-cluster</master>
<name>test_spark</name>
<class>${className}</class>
<jar>${workflowAppUri}/lib/${executableJar}</jar>
<spark-opts>--jars ${workflowAppUri}/lib/${dependencyJar}</spark-opts>
<arg>${arg1}</arg>
<arg>${arg2}</arg>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
This is not the same issue related to duplicate jars in oozie and spark2 sharelib directory. Have removed the duplicate jars in spark2 sharelib. But that doesn't help.
What could be the reason for this?
Please help me with this!!!
If we add jars in the lib directory of the application root directory, oozie automatically distributing the jars to it's distributed cache. In my case, I have tried to add the jar which is already in the lib directory. So, I just need to remove the below line from my workflow definition.
<spark-opts>--jars ${workflowAppUri}/lib/${dependencyJar}</spark-opts>
And also I have tested that if you want to attach the jars that are not available in your lib directory, you can mention like below in your workflow definition.
<spark-opts>--jars ${nameNode}/tmp/{someJar}</spark-opts>

Oozie Spark Action failing

I have a simple spark application which is reading csv data and then writing to avro .This application is working fine while submitting as spark-submit command line but failing with below error when trying to execute from oozie spark action .
Error message:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org.apache.spark.sql.execution.SparkPlan.org$apache$spark$sql$execution$SparkPlan$$decodeUnsafeRows(SparkPlan.scala:274)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$1.apply(SparkPlan.scala:366)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$1.apply(SparkPlan.scala:366)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
Oozie details :
job.properties
nameNode=NAMEMODE:8020
jobTracker=JT:8032
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/oozie/spark/
workflow.xml
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to="sparkAction" />
<action name="sparkAction">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx777m</value>
</property>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx1111m</value>
</property>
</configuration>
<master>yarn</master>
<mode>client</mode>
<name>tssETL</name>
<class>com.sc.eni.main.tssStart</class>
<jar>${nameNode}/user/oozie/spark/tss-assembly-1.0.jar</jar>
<spark-opts>--driver-memory 512m --executor-memory 512m --num-executors 1 </spark-opts>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}] </message>
</kill>
<end name="end" />
</workflow-app>
In job tracker the MAP Reduce job is coming as Succeded as its calling Spark Action and failing there but overall Oozie is failing.
Veriosn Used
EMR Cluster: emr-5.13.0
Spark : 2.3
Scala 2.11
I also checked the oozie share lib in hdfs : /user/oozie/share/lib/lib_20180517102659/spark and it contains lz4-1.3.0.jar which has the class net.jpountz.lz4.LZ4BlockInputStream mentioned in error.
Any help would be really appreciated as I am struggeling for quite a long time on this.
Many Thanks
Oozie gives
java.lang.NoSuchMethodError
when one library is available through more than one ways, so creating conflict. Since you have specified
oozie.use.system.libpath=true
so all of the Oozie spark shared libraries are available to it and all jars mentioned in build build.sbt are also available.
To resolve this please check which dependencies you have mentioned in your build.sbt are present in oozie spark shared libraries folder also and then add "% provided" in those dependencies which will remove them from assembly jar and hence there will be no conflict of jars.

Spark job in oozie coordinator error - emr: Can not create a Path from an empty string

I have a problem configuring a coordinator with oozie in a yarn cluster, it's an spark job, when I run the workflow by console the job is launched and executed correctly by the yarn, but when i call the same workflow from an coordinator.xml i have this error:
ERROR org.apache.spark.SparkContext - Error initializing SparkContext.
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.<init>(Path.java:135)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:337)
And the job never is launched in the yarn cluster, looks like yarn can't receive the .jar correct path from oozie, any idea?
Here the coordinator.xml and the workflow.xml simplified.
<coordinator-app name="Firebase acquisition process coordinator" frequency="${coord:days(1)}"
start="${startTime}" end="${endTime}" timezone="UTC" xmlns="uri:oozie:coordinator:0.5">
<controls>
...
</controls>
<action>
<workflow>
<app-path>hdfs://ip-111-11-11-111.us-west- 2.compute.internal:8020/user/hadoop/emr-spark/</app-path>
</workflow>
</action>
</coordinator-app>
<workflow-app name="bbbbbbbbbbbbbbb" xmlns="uri:oozie:workflow:0.5">
<start to="spark-0324"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-0324">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<class>classsxxx.Process</class>
<jar>hdfs://ip-111-11-11-111.us-west-2.compute.internal:8020/user/hadoop/emr-spark/lib/jarnamex.jar</jar>
<file>lib#lib</file>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
I mean, when I do this; oozie job -config ~/emr-spark/job.properties -run
it works!!, but when I try this; oozie job -run -config ~/emr-coordinator/coordinator.properties It doesn't work.
job properties
oozie.use.system.libpath=true
send_email=False
dryrun=False
nameNode=hdfs://ip-111-11-11-111.us-west-2.compute.internal:8020
jobTracker=ip-111-11-11-111.us-west-2.compute.internal:8032
oozie.wf.application.path=/user/hadoop/emr-spark
coordinator properties
startTime=2017-09-08T19:46Z
endTime=2030-01-01T06:00Z
jobTracker=ip-111-11-11-111.us-west-2.compute.internal:8032
nameNode=hdfs://ip-111-11-11-111.us-west-2.compute.internal:8020
oozie.coord.application.path=hdfs://ip-111-11-11-111.us-west-2.compute.internal:8020/user/hadoop/emr-coordinator
oozie.use.system.libpath=true
Referring to resource from the HDFS file system it has to be relative only.
The full/absolute path is computed on demand.
Then the solution was just replace:
hdfs://ip-111-11-11-111.us-west-2.compute.internal:8020/user/hadoop/emr-spark/workflow.xml with: /user/hadoop/emr-spark/workflow.xml
and hdfs://ip-111-11-11-111.us-west-2.compute.internal:8020/user/hadoop/emr-spark/lib/xxxx.jar with /user/hadoop/emr-spark/lib/xxxxx.jar
In the workflow.xml, coordinator.xml or properties.

oozie spark action - how to specify spark-opts

I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.
Here's from oozie doc:
Example:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstsparkjob">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>local[*]</master>
<mode>client<mode>
<name>Spark Example</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
<spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
<arg>inputpath=hdfs://localhost/input/file.txt</arg>
<arg>value=2</arg>
</spark>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
In above spark-opts are specified as --executor-memory 20G --num-executors 50
while on the same page in description it says:
"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"
so according to document it should be --conf executor-memory=20G
which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.
<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>
<spark-opts>--driver-memory 10g</spark-opts>
None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.
reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html
I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts> because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration in oozie workflow:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.cpu.vcores</name>
<value>24</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.
<spark-opts>--executor-memory 20G</spark-opts> should work ideally.
Also, try using:
<master>yarn-cluster</master>
<mode>cluster</mode>
"Spark configuration options can be passed by specifying '--conf key=value' here " is probably referring the configuration tag.
For Ex:
--conf mapred.compress.map.output=true would translate to:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
try changing <master>local[*]</master> to <master>yarn</master>

Oozie spark action error: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]

I am currently setting up an Oozie workflow that uses a Spark action. The Spark code that I use works correctly, tested on both local and YARN. However, when running it as an Oozie workflow I am getting the following error:
Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]
Having read up on this error, I saw that the most common cause was a problem with Oozie sharelibs. I have added all Spark jar files to the Oozie /user/oozie/share/lib/spark on hdfs, restarted Oozie and run sudo -u oozie oozie admin -oozie http://192.168.26.130:11000/oozie -sharelibupdate
to ensure the sharelibs are properly updated. Unforunately none of this has stopped the error occurring.
My workflow is as follows:
<workflow-app xmlns='uri:oozie:workflow:0.4' name='SparkBulkLoad'>
<start to = 'bulk-load-node'/>
<action name = 'bulk-load-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<name>BulkLoader</name>
<jar>${nameNode}/user/spark-test/BulkLoader.py</jar>
<spark-opts>
--num-executors 3 --executor-cores 1 --executor-memory 512m --driver-memory 512m\
</spark-opts>
</spark>
<ok to = 'end'/>
<error to = 'fail'/>
</action>
<kill name = 'fail'>
<message>
Error occurred while bulk loading files
</message>
</kill>
<end name = 'end'/>
</workflow-app>
and job.properties is as follows:
nameNode=hdfs://192.168.26.130:8020
jobTracker=http://192.168.26.130:8050
queueName=spark
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/spark-test/workflow.xml
workflowAppUri=${nameNode}/user/spark-test/BulkLoader.py
Any advice would be greatly appreciated.
I have also specified the libpath
oozie.libpath=<path>/oozie/share/lib/lib_<timestamp>
It is the value you see after the command you wrote
sudo -u oozie oozie admin -oozie http://192.168.26.130:11000/oozie -sharelibupdate
Example:
[ShareLib update status]
sharelibDirOld = hdfs://nameservice1/user/oozie/share/lib/lib_20190328034943
host = http://vghd08hr.dc-ratingen.de:11000/oozie
sharelibDirNew = hdfs://nameservice1/user/oozie/share/lib/lib_20190328034943
status = Successful
Optional:
You can also specify the yarn configuration within Cloudera folder:
oozie.launcher.yarn.app.mapreduce.am.env=/opt/SP/apps/cloudera/parcels/SPARK2-2.2.0.cloudera4-1.cdh5.13.3.p0.603055/lib/spark2
BUT
This might not solve the issue. The other hint I have is if you are using Spark 1.x this folder is necessary in your oozie sharelib folder
/user/oozie/share/lib/lib_20190328034943/spark2/oozie-sharelib-spark.jar
If you copy it in your spark2 folder, it solves the issue of the "missing SparkMain" but ask for other dependencies (it might be a problem in my environment). I think it worth a try, so copy and paste the lib, run your job, and see the logs.

Resources