Unable to access Hive table when used yarn-cluster mode - apache-spark

I have kerbores enabled Cloudera cluster. Spark can access Hive table when I use client deployment mode.
I executed kinit command and then executed spark2-submit. Spark can access the Hive table when I use client mode.
spark2-submit --master yarn --deploy-mode client --keytab XXXXXXXXXX.keytab --principal XXXXXXXXXXX#USER.COM --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxPermSize=1024M -Djava.security.krb5.conf=/etc/krb5.conf" test.jar
But when I use cluster mode, spark gives table not found error.
spark2-submit --master yarn --deploy-mode cluster --keytab XXXXXXXXXX.keytab --principal XXXXXXXXXXX#USER.COM --conf "spark.driver.extraJavaOptions=-XX:+UseG1GC -XX:MaxPermSize=1024M -Djava.security.krb5.conf=/etc/krb5.conf" test.jar

Related

What is different between yarn mode and deploy mode in spark?

I'm very confused right now.
Please check if this is right.
4 cases command like below:
# It mean, yarn is cluster mode and deploy cluster mode.
# cluster have YARN Container(have Spark AM, Spark Driver) and YARN node manager.
spark-submit --master yarn --deploy-mode cluster
# It mean, yarn is cluster mode and deploy client mode.
# client have Spark Driver.
# cluster have YARN Container(have Spark AM, Spark Driver) and YARN node manager.
spark-submit --master yarn --deploy-mode client
# It mean, yarn is client mode and deploy cluster mode.
# cluster have YARN Container(have Spark AM) and YARN node manager.
spark-submit --master yarn-client --deploy-mode cluster
# It mean, yarn is client mode and deploy client mode.
# client have Spark Driver.
# cluster have YARN Container(have Spark AM) and YARN node manager.
spark-submit --master yarn-client --deploy-mode client
Is the explanation of the above code correct?
#Use yarn, deploy the driver into the yarn cluster.
spark-submit --master yarn --deploy-mode cluster
#Use yarn, deploy the driver on my local machine(machine that is launching the code)
spark-submit --master yarn --deploy-mode client # this is the default if you don't specify --deploy-mode
These aren't actual options anymore so they aren't really worth discussing:
spark-submit --master yarn-client --deploy-mode cluster
spark-submit --master yarn-client --deploy-mode client
--master yarn-client maybe was an option in early version of spark but isn't used today. (as referenced in the documentation above)

Load properties file in Spark classpath during spark-submit execution

I'm installing the Spark Atlas Connector in a spark submit script (https://github.com/hortonworks-spark/spark-atlas-connector)
Due to security restrictions, I can't put the atlas-application.properties in the spark/conf repository.
I used the two options in the spark-submit :
--driver-class-path "spark.driver.extraClassPath=hdfs:///directory_to_properties_files" \
--conf "spark.executor.extraClassPath=hdfs:///directory_to_properties_files" \
When I launch the spark-submit, I encounter this issue :
20/07/20 11:32:50 INFO ApplicationProperties: Looking for atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Looking for /atlas-application.properties in classpath
20/07/20 11:32:50 INFO ApplicationProperties: Loading atlas-application.properties from null
Please find CDP Atals Configuration article.
https://community.cloudera.com/t5/Community-Articles/How-to-pass-atlas-application-properties-configuration-file/ta-p/322158
Client Mode:
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client --driver-java-options="-Datlas.conf=/tmp/" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10
Cluster Mode:
sudo -u spark spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --files /tmp/atlas-application.properties --conf spark.driver.extraJavaOptions="-Datlas.conf=./" /opt/cloudera/parcels/CDH/jars/spark-examples*.jar 10

Spark metrics sink doesn't expose executor's metrics

I'm using Spark on YARN with
Ambari 2.7.4
HDP Standalone 3.1.4
Spark 2.3.2
Hadoop 3.1.1
Graphite on Docker latest
I was trying to get Spark metrics with Graphite sink following this tutorial.
Advanced spark2-metrics-properties in Ambari are:
driver.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
executor.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
worker.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
master.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=ap-test-m.c.gcp-ps.internal
*.sink.graphite.port=2003
*.sink.graphite.protocol=tcp
*.sink.graphite.period=10
*.sink.graphite.unit=seconds
*.sink.graphite.prefix=app-test
*.source.jvm.class=org.apache.spark.metrics.source.JvmSource
Spark submit:
export HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf/; spark-submit --class com.Main --master yarn --deploy-mode client --driver-memory 1g --executor-memory 10g --num-executors 2 --executor-cores 2 spark-app.jar /data
As a result I'm only getting driver metrics.
Also, I was trying to add metrics.properties to spark-submit command together with global spark metrics props, but that didn't help.
And finally, I tried conf in spark-submit and in java SparkConf:
--conf "spark.metrics.conf.driver.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.executor.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "worker.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "master.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
--conf "spark.metrics.conf.*.sink.graphite.host"="host"
--conf "spark.metrics.conf.*.sink.graphite.port"=2003
--conf "spark.metrics.conf.*.sink.graphite.period"=10
--conf "spark.metrics.conf.*.sink.graphite.unit"=seconds
--conf "spark.metrics.conf.*.sink.graphite.prefix"="app-test"
--conf "spark.metrics.conf.*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
But that didn't help either.
CSVSink also gives only driver metrics.
UPD
When I submit job in cluster mode - I'm getting the same metrics as in Spark History Server. But the jvm metrics are still absent.
Posting to a dated question, but maybe it will help.
Seems like executors do not have metrics.properties file on their filesystems.
One way to confirm this would be to look at the executor logs:
2020-01-16 10:00:10 ERROR MetricsConfig:91 - Error loading configuration file metrics.properties
java.io.FileNotFoundException: metrics.properties (No such file or directory)
at org.apache.spark.metrics.MetricsConfig.loadPropertiesFromFile(MetricsConfig.scala:132)
at org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:55)
at org.apache.spark.metrics.MetricsSystem.<init>(MetricsSystem.scala:95)
at org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:233)
To fix this on yarn pass two parameters to spark-submit:
$ spark-submit \
--files metrics.properties \
--conf spark.metrics.conf=metrics.properties
The --files option ensures that files specified in the option will be shared to executors.
The spark.metrics.conf option specifies a custom file location for the metrics.
Another way to fix the issue would be to place the metrics.properties file into $SPARK_HOME/conf/metrics.properties on both the driver and executor before starting the job.
More on metrics here: https://spark.apache.org/docs/latest/monitoring.html

spark-submit yarn java.lang.ClassNotFoundException:com.microsoft.sqlserver.jdbc.SQLServerDriver

spark-submit --master yarn --deploy-mode cluster sqlserver.py --jars sqljdbc42.jar
I get Error :
java.lang.ClassNotFoundException:com.microsoft.sqlserver.jdbc.SQLServerDriver
Everything works when I use --deploy-mode client and copy the jar file to /usr/hdp/current/spark2-client/jars/sqljdbc42.jar
Should I copy the sqljdbc42.jar to /usr/hdp/current/hadoop-yarn-client/lib/ on all data nodes?

Dynamic Resource allocation in Spark-Yarn Cluster Mode

When i use the below setting to start the spark application (default is yarn-client mode) works fine
spark_memory_setting="--master yarn --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
ISSUE
Whereas when i change the deploy mode as cluster,application not starting up. Not even throwing any error to move on.
spark_memory_setting="--master yarn-cluster --deploy-mode=cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.queue=ciqhigh --conf spark.dynamicAllocation.initialExecutors=50 --conf spark.dynamicAllocation.maxExecutors=50 --executor-memory 2G --driver-memory 4G"
LOG
18/01/08 01:21:00 WARN Client: spark.yarn.am.extraJavaOptions will not
take effect in cluster mode
This is the last line from the logger.
Any suggestions most welcome.
One important think to highlight here, the spark application which am trying to deploy starts the apache thrift server. After my searching, i think its coz of thrift am not able to able to run yarn in cluster mode. Any help to run in cluster mode.
the option --master yarn-cluster is wrong.. this is not a valid master url it should be just "yarn" instead of "yarn-cluster".. just cross check..

Resources