oozie spark action - how to specify spark-opts - apache-spark

I am running spark job in yarn-client mode via oozie spark action. I need to specify driver and application master related settings. I tried configuring spark-opts as documented by oozie but its not working.
Here's from oozie doc:
Example:
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
...
<action name="myfirstsparkjob">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>foo:8021</job-tracker>
<name-node>bar:8020</name-node>
<prepare>
<delete path="${jobOutput}"/>
</prepare>
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<master>local[*]</master>
<mode>client<mode>
<name>Spark Example</name>
<class>org.apache.spark.examples.mllib.JavaALS</class>
<jar>/lib/spark-examples_2.10-1.1.0.jar</jar>
<spark-opts>--executor-memory 20G --num-executors 50</spark-opts>
<arg>inputpath=hdfs://localhost/input/file.txt</arg>
<arg>value=2</arg>
</spark>
<ok to="myotherjob"/>
<error to="errorcleanup"/>
</action>
...
</workflow-app>
In above spark-opts are specified as --executor-memory 20G --num-executors 50
while on the same page in description it says:
"The spark-opts element if present, contains a list of spark options that can be passed to spark driver. Spark configuration options can be passed by specifying '--conf key=value' here"
so according to document it should be --conf executor-memory=20G
which one is right here then? I tried both but it's not seem working. I am running on yarn-client mode so mainly want to setup driver related settings. I think this is the only place I can setup driver settings.
<spark-opts>--driver-memory 10g --driver-java-options "-XX:+UseCompressedOops -verbose:gc" --conf spark.driver.memory=10g --conf spark.yarn.am.memory=2g --conf spark.driver.maxResultSize=10g</spark-opts>
<spark-opts>--driver-memory 10g</spark-opts>
None of the above driver related settings getting set in actual driver jvm. I verified it on linux process info.
reference: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html

I did found what's the issue. In yarn-client mode you can't specify driver related parameters using <spark-opts>--driver-memory 10g</spark-opts> because your driver (oozie launcher job) is already launched before that point. It's a oozie launcher (which is a mapreduce job) launches your actual spark and any other job and for that job spark-opts is relevant. But to set driver parameters in yarn-client mode you need to basically configure configuration in oozie workflow:
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx6000m</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.cpu.vcores</name>
<value>24</value>
</property>
<property>
<name>mapreduce.job.queuename</name>
<value>default</value>
</property>
</configuration>
I haven't tried yarn-cluster mode but spark-opts may work for driver setting there. But my question was regarding yarn-client mode.

<spark-opts>--executor-memory 20G</spark-opts> should work ideally.
Also, try using:
<master>yarn-cluster</master>
<mode>cluster</mode>
"Spark configuration options can be passed by specifying '--conf key=value' here " is probably referring the configuration tag.
For Ex:
--conf mapred.compress.map.output=true would translate to:
<configuration>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>

try changing <master>local[*]</master> to <master>yarn</master>

Related

How to connect to remote Hive server without hive downloaded?

I am trying to access a Hive cluster without Hive downloaded on my machine. I read on here that I just need a jdbc client to do so. I have a url, username and password for the hive cluster. I have tried making a hive-site.xml with these, as well as doing it programmatically, although this method does not seem to have a place to input username and password. No matter what I do, it seems that the following error is keeping me from accessing hive:
Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
I feel like this is because I do not have Hive downloaded on my computer from the answers to this error online. What exactly do I need to do here to access it without hive downloaded, or do I actually have to download it? Here is my code for reference:
spark = SparkSession \
.builder \
.appName("interfacing spark sql to hive metastore without
configuration file") \
.config("hive.metastore.uris", "https://prod-fmhdinsight-
eu.azurehdinsight.net") \
.enableHiveSupport() \
.getOrCreate()
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4),
('Fifth', 5)]
df = spark.createDataFrame(data)
# see the frame created
df.show()
# write the frame
df.write.mode("overwrite").saveAsTable("t4")
and the hive-site.xml:
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>https://prod-fmhdinsight-eu.azurehdinsight.net</value>
</property>
<!--
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<-->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>https://prod-fmhdinsight-eu.azurehdinsight.net</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>username</value>
<description>user name for connecting to mysql server
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>password for connecting to mysql server
</description>
</property>
tl;dr Use spark.sql.hive.metastore.jars configuration property with maven to let Spark SQL download the required jars.
The other options are builtin (that simply assumes Hive 1.2.1) and a classpath of the Hive JARs (e.g. spark.sql.hive.metastore.jars="/Users/jacek/dev/apps/hive/lib/*").
If your Hive metastore is available remotely via thrift protocol you may want to create $SPARK_HOME/conf/hive-site.xml as follows:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
</configuration>
A nice feature of Hive is to define configuration properties as System properties so the above would look as follows:
$SPARK_HOME/bin/spark-shell \
--driver-java-options="-Dhive.metastore.uris=thrift://localhost:9083"
You may want to add the following to conf/log4j.properties for a more low-level logging:
log4j.logger.org.apache.spark.sql.hive.HiveUtils$=ALL
log4j.logger.org.apache.spark.sql.internal.SharedState=ALL

No FileSystem for scheme: wasbs Apache Zeppelin 0.8.0 with Spark 2.3.2

I am trying to run note in Apache Zeppelin 0.8.0 with Spark 2.3.2 and Azure Blob Storage, but I'm getting No FileSystem for scheme: wasbs error, though I configured all properly, as it is recommended in related issues.
Here are some conf files:
spark-defaults.conf
spark.driver.extraClassPath /opt/jars/*
spark.driver.extraLibraryPath /opt/jars
spark.jars /opt/jars/hadoop-azure-2.7.3.jar,/opt/jars/azure-storage-2.2.0.jar
spark.driver.memory 28669m
core-site.xml
<configuration>
<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.{storage_account_name}.blob.core.windows.net</name>
<value>{account_key_value}</value>
</property>

Enabling dynamic allocation on spark on YARN mode

This question is similar to this but there was no answer.
I am trying to enable dynamic allocation in Spark in YARN mode. I have 11 node cluster with 1 master node and 10 worker nodes. I am following below link for instructions:
For setup in YARN:
http://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
Config variables needs to be set in spark-defaults.conf: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
I have also taken reference from below link and few other resources:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-dynamic-allocation.html#spark.dynamicAllocation.testing
Here are the steps I am doing:
Setting up config variables in spark-defaults.conf.
My spark-defaults.conf related to dynamic allocation and shuffle service is as:
spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
Making changes in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value> $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/*,$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/client/*,$HADOOP_MAPRED_HOME/share/hadoop/client/lib/*,/home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar </value>
</property>
All these steps are replicated in all worker nodes i.e spark-defaults.conf has the above mentioned values and yarn-site.xml has these properties. I have made sure that /home/hadoop/spark/common/network-yarn/target/scala-2.11/spark-2.2.2-SNAPSHOT-yarn-shuffle.jar exists in all worker nodes.
Then I am running $SPARK_HOME/sbin/start-shuffle-service.sh in worker nodes and master node. In master node, I am restarting the YARN using stop-yarn.sh and then start-yarn.sh
Then I am doing YARN node -list -all to see the worker nodes but I am not able to see any node
When I am disabling the property
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
I can see all the worker nodes as normal so it seems like shuffle service is not properly configured.

Oozie Spark Action failing

I have a simple spark application which is reading csv data and then writing to avro .This application is working fine while submitting as spark-submit command line but failing with below error when trying to execute from oozie spark action .
Error message:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org.apache.spark.sql.execution.SparkPlan.org$apache$spark$sql$execution$SparkPlan$$decodeUnsafeRows(SparkPlan.scala:274)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$1.apply(SparkPlan.scala:366)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeTake$1.apply(SparkPlan.scala:366)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
Oozie details :
job.properties
nameNode=NAMEMODE:8020
jobTracker=JT:8032
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/oozie/spark/
workflow.xml
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to="sparkAction" />
<action name="sparkAction">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>oozie.launcher.mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx777m</value>
</property>
<property>
<name>oozie.launcher.yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property>
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-Xmx1111m</value>
</property>
</configuration>
<master>yarn</master>
<mode>client</mode>
<name>tssETL</name>
<class>com.sc.eni.main.tssStart</class>
<jar>${nameNode}/user/oozie/spark/tss-assembly-1.0.jar</jar>
<spark-opts>--driver-memory 512m --executor-memory 512m --num-executors 1 </spark-opts>
</spark>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}] </message>
</kill>
<end name="end" />
</workflow-app>
In job tracker the MAP Reduce job is coming as Succeded as its calling Spark Action and failing there but overall Oozie is failing.
Veriosn Used
EMR Cluster: emr-5.13.0
Spark : 2.3
Scala 2.11
I also checked the oozie share lib in hdfs : /user/oozie/share/lib/lib_20180517102659/spark and it contains lz4-1.3.0.jar which has the class net.jpountz.lz4.LZ4BlockInputStream mentioned in error.
Any help would be really appreciated as I am struggeling for quite a long time on this.
Many Thanks
Oozie gives
java.lang.NoSuchMethodError
when one library is available through more than one ways, so creating conflict. Since you have specified
oozie.use.system.libpath=true
so all of the Oozie spark shared libraries are available to it and all jars mentioned in build build.sbt are also available.
To resolve this please check which dependencies you have mentioned in your build.sbt are present in oozie spark shared libraries folder also and then add "% provided" in those dependencies which will remove them from assembly jar and hence there will be no conflict of jars.

SparkR-submit core not allocated

Hi I'm working on SparkR on yarn mode.
When I submit an application in this way:
./spark-submit --master yarn-client --packages com.databricks:spark-
csv_2.10:1.0.3 --driver-memory 6g --num-executors 8 --executor-memory 6g
--total-executor-cores 32 --executor-cores 8 /home/sentiment/Scrivania/test3.R
One node start as AM (I think is chosen randomly) and take 1gb of Memory and 1 Vcore.
After that ALL nodes has 7Gb of Memory and 1 Vcore for each one. (Except for the node who starts AM that has 8gb and 2core)
Why nodes do not acquire 4 cores as configuration/spark submit says?
spark-default
spark.master spark://server1:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
spark.akka.frameSize 1000
spark.yarn.am.cores 4
spark.kryoserializer.buffer.max 700m
spark.kryoserializer.buffer 100m
Yarn-manager
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>server1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>server1:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>server1:8050</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>server1:8088</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>4</value>
</property>
</configuration>
Update1:
Read from old post that I needed to change the value of this property below from Default to Dominant at capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
Added at Spark-env
SPARK_EXECUTOR_CORES=4
Nothing changed.
Update2:
I read this from spark official page, so 1 core for each executor in Yarn mode is the maximum value?
spark.executor.cores The number of cores to use on each executor. For
YARN and standalone mode only. In standalone mode, setting this
parameter allows an application to run multiple executors on the same
worker, provided that there are enough cores on that worker.
Otherwise, only one executor per application will run on each worker.

Resources