spark 3.x on HDP 3.1 in headless mode with hive - hive tables not found - apache-spark

How can I configure Spark 3.x on HDP 3.1 using headless (https://spark.apache.org/docs/latest/hadoop-provided.html) version of spark to interact with hive?
First, I have downloaded and unzipped the headless spark 3.x:
cd ~/development/software/spark-3.0.0-bin-without-hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
export SPARK_DIST_CLASSPATH=$(hadoop --config /usr/hdp/current/spark2-client/conf classpath)
ls /usr/hdp # note version ad add it below and replace 3.1.x.x-xxx with it
./bin/spark-shell --master yarn --queue myqueue --conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
spark.sql("show databases").show
// only showing default namespace, existing hive tables are missing
+---------+
|namespace|
+---------+
| default|
+---------+
spark.conf.get("spark.sql.catalogImplementation")
res2: String = in-memory # I want to see hive here - how? How to add hive jars onto the classpath?
NOTE
This is an updated version of How can I run spark in headless mode in my custom version on HDP? for Spark 3.x ond HDP 3.1 and custom spark does not find hive databases when running on yarn.
Furthermore: I am aware of the problems of ACID hive tables in spark. For now, I simply want to be able to see the existing databases
edit
We must get the hive jars onto the class path. Trying as follows:
export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}"
And now using spark-sql:
./bin/spark-sql --master yarn --queue myqueue--conf spark.driver.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=3.1.x.x-xxx' --conf spark.hadoop.metastore.catalog.default=hive --files /usr/hdp/current/hive-client/conf/hive-site.xml
fails with:
Error: Failed to load class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
I.e. the line: export SPARK_DIST_CLASSPATH="/usr/hdp/current/hive-client/lib*:${SPARK_DIST_CLASSPATH}", had no effect (same issue if not set).

As noted above and custom spark does not find hive databases when running on yarn the Hive JARs are needed. They are not supplied in the headless version.
I was unable to retrofit these.
Solution: instead of worrying: simply use the spark build with Hadoop 3.2 (on HDP 3.1)

Related

SparkSession Application Source Code Config Properties not Overriding JupyterHub & Zeppelin on AWS EMR defaults

I have Spark Driver setup to use Zeppelin and or JupyterHub as client for interactive Spark Programming on AWS EMR. However, when I create the SparkSession with custom config properties (application name, # of cores, executor ram, # of executors, serializer, etc) it is not overriding the default values for those configs (confirmed under Environment tab in Spark UI and spark.conf.get(...)).
Like any Spark App these clients on EMR should be using my custom config properties because SparkSession code is the 1st highest override before spark-submit, spark config file, and then spark-defaults. JupyterHub also immediately launches a Spark Application w/o coding one or when just running an empty cell.
Is there a setting specific to Zeppelin, JupyterHub, or a separate xml conf that needs adjusted to get custom configs recognized and working? Any help is much appreciated.
Example of creating a basic application where these cluster resource configs should be implemented instead of the standard default configs which is what is happening with Zeppelin/JupyterHub on EMR.
# via zep or jup [configs NOT being recognized]
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("app_name")\
.master("yarn")\
.config("spark.submit.deployMode","client")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.instances", 11)\
.config("spark.executor.cores", 5)\
.config("spark.executor.memory", "19g")\
.getOrCreate()
# via ssh terminal [configs ARE recognized at run-time]
pyspark \
--name "app_name" \
--master yarn \
--deploy-mode client \
--num-executors 11 \
--executor-cores 5 \
--executor-memory 19 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
Found a solution. The config.json file under /etc/jupyter/conf had some default spark config values hence I removed them to display an empty json key/value like => _configs":{}. Creating a custom SparkSession via JupyterHub now understands the specified cluster configs.
These magic commands are always working %%configure
https://github.com/jupyter-incubator/sparkmagic

Dependency is not added to Spark + Zeppelin

I can't add custom dependency to the spark classpath from zeppelin.
Environment:
AWS EMR: Zeppelin 0.8.0, Spark 2.4.0
extra configs for spark interpreter:
spark.jars.ivySettings /tmp/ivy-settings.xml
spark.jars.packages my-group-name:artifact_2.11:version
The files from my-group-name were appeared at
spark.yarn.dist.jars
spark.yarn.secondary.jars
But not accessible via zeppelin notebook (checking by import my.lab._)
However, when i am running the same configs for spark-shell it is working on both local machine, and ssh on emr cluster
and imports are available from spark-shell
Sun.java.command for zeppelin:
org.apache.spark.deploy.SparkSubmit --master yarn-client ... --conf spark.jars.packages=my-group-name:artifact_2.11:version ... --conf spark.jars.ivySettings=/tmp/ivy-settings.xml ... --class org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer /usr/lib/zeppelin/interpreter/spark/spark-interpreter-0.8.0.jar <IP ADDRESS> 34717 :
Spark submit on emr:
spark-shell --master yarn-client --conf spark.jars.ivySettings="/tmp/ivy-settings.xml" --conf spark.jars.packages="my-group-name:artifact_2.11:version"
Any advices where to look for the errors?
You can try to add your jar directly to Zeppelin, in Interpreter settings.
http://zeppelin.apache.org/docs/0.8.0/usage/interpreter/dependency_management.html
Or, add jar to spark libs (in my case it's /usr/hdp/current/spark2/jars/ directory).

Spark on YARN + Secured hbase

I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).
However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
I am following hortonworks recommendations for providing information to yarn cluster regarding the HBase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html
Any pointers what could be going on ?
the mechanism for logging into HBase:
UserGroupInformation.setConfiguration(hbaseConf)
val keyTab = "keytab-location")
val principal = "kerberos-principal"
val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
UserGroupInformation.setLoginUser(ugi)
ugi.doAs(new PrivilegedExceptionAction[Void]() {
override def run: Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}
})
Also, I tried the alternative mechanism for logging in, as:
UserGroupInformation.loginUserFromKeytab(principal, keyTab)
connection=ConnectionFactory.createConnection(hbaseConf)
please suggest.
You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279
A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.
The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client to DEBUG.
The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
For instance, SPARK_CLASSPATH is now deprecated, and its content actually injected in Spark properties spark.driver / spark.executor.extraClassPath.But SPARK_DIST_CLASSPATH is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath is evaluated).
Other variables of interest are
HADOOP_CONF_DIR
SPARK_CONF_DIR
SPARK_EXTRA_LIB_PATH
SPARK_SUBMIT_OPTS
SPARK_PRINT_LAUNCH_COMMAND
The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled is set silently to false -- which makes absolutely no sense, that default is hard-coded to true in Spark source code...!
So you are advised to force it explicitly to true in your job config.
The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar).
The fifth problem is that, AFAIK, if you don't have metrics-core*.jar in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.
¤¤¤¤¤ How to make that stuff work, with debug traces
# we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
# and also *almost* HBase-ready (as in a CDH distro);
# especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
# but spark.*.extraClassPath / .extraJavaOptions are expected to be unset
KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
export HADOOP_JAAS_DEBUG=true
export SPARK_PRINT_LAUNCH_COMMAND=True
spark-submit --master yarn-client \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX#Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX#Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
PS: when using a HBaseContext you don't need /etc/hbase/conf/ in the executor's CLASSPATH, the conf is propagated automatically.
PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN in log4j.properties because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)
PPS: instead of that verbose SPARK_SUBMIT_OPTS var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties and the rest in $SPARK_CONF_DIR/java-opts; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf and env variables in $SPARK_CONF_DIR/spark-env.sh
¤¤¤¤¤ About the "Spark connector" to HBase
Excerpt from the official HBase documentation, chapter 83 Basic Spark
At the root of all Spark and HBase integration is the HBaseContext.
The HBaseContext takes in HBase configurations and pushes them to
the Spark executors. This allows us to have an HBase Connection per
Spark Executor in a static location.
What is not mentioned in the doc is that the HBaseContext uses automatically the HBase "auth token" (when present) to authenticate the executors.
Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition operation on a RDD, using a BufferedMutator for async bulk load into HBase.

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Dependency is not distributed to Spark cluster

I'm trying to execute Spark job on Mesos cluster that depends on spark-cassandra-connector library, but it keeps failing with
Exception in thread "main" java.lang.NoClassDefFoundError: com/datastax/spark/connector/package$
As I understand from spark documentation
JARs and files are copied to the working directory for each SparkContext on the executor nodes.
...
Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages.
But it seems that only pucker-assembly-1.0.jar task jar is distributed.
I'm running spark 1.6.1 with scala 2.10.6.
And here's spark-submit command I'm executing:
spark-submit --deploy-mode cluster
--master mesos://localhost:57811
--conf spark.ssl.noCertVerification=true
--packages datastax:spark-cassandra-connector:1.5.1-s_2.10
--conf spark.cassandra.connection.host=10.0.1.83,10.0.1.86,10.0.1.85
--driver-cores 3
--driver-memory 4000M
--class SimpleApp
https://dripit-spark.s3.amazonaws.com/pucker-assembly-1.0.jar
s3n://logs/E1SR85P3DEM3LU.2016-05-05-11.ceaeb015.gz
So why isn't spark-cassandra-connector distributed to all my spark executers?
You should use the correct Maven coordinate syntax:
--packages com.datastax.spark:spark-cassandra-connector_2.10:1.6.0
See
https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector_2.10
http://spark.apache.org/docs/latest/submitting-applications.html
http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

Resources