Spark on YARN + Secured hbase - apache-spark

I am submitting a job to YARN (on spark 2.1.1 + kafka 0.10.2.1) which connects to a secured hbase cluster. This job, performs just fine when i am running in "local" mode (spark.master=local[*]).
However, as soon as I submit the job with master as YARN (and deploy mode as client), I see the following error message -
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
I am following hortonworks recommendations for providing information to yarn cluster regarding the HBase and keytab etc. Followed this kb article - https://community.hortonworks.com/content/supportkb/48988/how-to-run-spark-job-to-interact-with-secured-hbas.html
Any pointers what could be going on ?
the mechanism for logging into HBase:
UserGroupInformation.setConfiguration(hbaseConf)
val keyTab = "keytab-location")
val principal = "kerberos-principal"
val ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keyTab)
UserGroupInformation.setLoginUser(ugi)
ugi.doAs(new PrivilegedExceptionAction[Void]() {
override def run: Void = {
hbaseCon = Some(ConnectionFactory.createConnection(hbaseConf))
null
}
})
Also, I tried the alternative mechanism for logging in, as:
UserGroupInformation.loginUserFromKeytab(principal, keyTab)
connection=ConnectionFactory.createConnection(hbaseConf)
please suggest.

You are not alone in the quest for Kerberos auth to HBase from Spark, cf. SPARK-12279
A little-known fact is that Spark now generates Hadoop "auth tokens" for Yarn, HDFS, Hive, HBase on startup. These tokens are then broadcasted to the executors, so that they don't have to mess again with Kerberos auth, keytabs, etc.
The first problem is that it's not explicitly documented, and in case of failure the errors are hidden by default (i.e. most people don't connect to HBase with Kerberos, so it's usually pointless to state that the HBase JARs are not in the CLASSPATH and no HBase token was created... usually.)
To log all details about these tokens, you have to set the log level for org.apache.spark.deploy.yarn.Client to DEBUG.
The second problem is that beyond the properties, Spark supports many env variables, some documented, some not documented, and some actually deprecated.
For instance, SPARK_CLASSPATH is now deprecated, and its content actually injected in Spark properties spark.driver / spark.executor.extraClassPath.But SPARK_DIST_CLASSPATH is still in use, and in the Cloudera distro for example, it is used to inject the core Hadoop libs & config into the Spark "launcher" so that it can bootstrap a YARN-cluster execution, before the driver is started (i.e. before spark.driver.extraClassPath is evaluated).
Other variables of interest are
HADOOP_CONF_DIR
SPARK_CONF_DIR
SPARK_EXTRA_LIB_PATH
SPARK_SUBMIT_OPTS
SPARK_PRINT_LAUNCH_COMMAND
The third problem is that, in some specific cases (e.g. YARN-cluster mode in the Cloudera distro), the Spark property spark.yarn.tokens.hbase.enabled is set silently to false -- which makes absolutely no sense, that default is hard-coded to true in Spark source code...!
So you are advised to force it explicitly to true in your job config.
The fourth problem is that, even if the HBase token has been created at startup, then the executors must explicitly use it to authenticate. Fortunately, Cloudera has contributed a "Spark connector" to HBase, to take care of this kind of nasty stuff automatically. It's now part of the HBase client, by default (cf. hbase-spark*.jar).
The fifth problem is that, AFAIK, if you don't have metrics-core*.jar in the CLASSPATH then the HBase connections will fail with puzzling (and unrelated) ZooKepper errors.
¤¤¤¤¤ How to make that stuff work, with debug traces
# we assume that spark-env.sh and spark-default.conf are already Hadoop-ready,
# and also *almost* HBase-ready (as in a CDH distro);
# especially HADOOP_CONF_DIR and SPARK_DIST_CLASSPATH are expected to be set
# but spark.*.extraClassPath / .extraJavaOptions are expected to be unset
KRB_DEBUG_OPTS="-Dlog4j.logger.org.apache.spark.deploy.yarn.Client=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation=DEBUG -Dlog4j.logger.org.apache.hadoop.hbase.spark.HBaseContext=DEBUG -Dsun.security.krb5.debug=true -Djava.security.debug=gssloginconfig,configfile,configparser,logincontext"
EXTRA_HBASE_CP=/etc/hbase/conf/:/opt/cloudera/parcels/CDH/lib/hbase/hbase-spark.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/metrics-core-2.2.0.jar
export SPARK_SUBMIT_OPTS="$KRB_DEBUG_OPTS"
export HADOOP_JAAS_DEBUG=true
export SPARK_PRINT_LAUNCH_COMMAND=True
spark-submit --master yarn-client \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX#Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
spark-submit --master yarn-cluster --conf spark.yarn.report.interval=4000 \
--files "/etc/spark/conf/log4j.properties#yarn-log4j.properties" \
--principal XX#Z.NET --keytab /a/b/XX.keytab \
--conf spark.yarn.tokens.hbase.enabled=true \
--conf spark.driver.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.driver.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.driverEnv.HADOOP_JAAS_DEBUG=true \
--conf spark.executor.extraClassPath=$EXTRA_HBASE_CP \
--conf "spark.executor.extraJavaOptions=$KRB_DEBUG_OPTS -Dlog4j.configuration=yarn-log4j.properties" \
--conf spark.executorEnv.HADOOP_JAAS_DEBUG=true \
--class TestSparkHBase TestSparkHBase.jar
PS: when using a HBaseContext you don't need /etc/hbase/conf/ in the executor's CLASSPATH, the conf is propagated automatically.
PPS: I advise you to set log4j.logger.org.apache.zookeeper.ZooKeeper=WARN in log4j.properties because it's verbose, useless, and even confusing (all the interesting stuff is logged at HBase level)
PPS: instead of that verbose SPARK_SUBMIT_OPTS var, you could also list statically the Log4J options in $SPARK_CONF_DIR/log4j.properties and the rest in $SPARK_CONF_DIR/java-opts; same goes for the Spark properties in $SPARK_CONF_DIR/spark-defaults.conf and env variables in $SPARK_CONF_DIR/spark-env.sh
¤¤¤¤¤ About the "Spark connector" to HBase
Excerpt from the official HBase documentation, chapter 83 Basic Spark
At the root of all Spark and HBase integration is the HBaseContext.
The HBaseContext takes in HBase configurations and pushes them to
the Spark executors. This allows us to have an HBase Connection per
Spark Executor in a static location.
What is not mentioned in the doc is that the HBaseContext uses automatically the HBase "auth token" (when present) to authenticate the executors.
Note also that the doc has an example (in Scala then in Java) of a Spark foreachPartition operation on a RDD, using a BufferedMutator for async bulk load into HBase.

Related

SparkSession Application Source Code Config Properties not Overriding JupyterHub & Zeppelin on AWS EMR defaults

I have Spark Driver setup to use Zeppelin and or JupyterHub as client for interactive Spark Programming on AWS EMR. However, when I create the SparkSession with custom config properties (application name, # of cores, executor ram, # of executors, serializer, etc) it is not overriding the default values for those configs (confirmed under Environment tab in Spark UI and spark.conf.get(...)).
Like any Spark App these clients on EMR should be using my custom config properties because SparkSession code is the 1st highest override before spark-submit, spark config file, and then spark-defaults. JupyterHub also immediately launches a Spark Application w/o coding one or when just running an empty cell.
Is there a setting specific to Zeppelin, JupyterHub, or a separate xml conf that needs adjusted to get custom configs recognized and working? Any help is much appreciated.
Example of creating a basic application where these cluster resource configs should be implemented instead of the standard default configs which is what is happening with Zeppelin/JupyterHub on EMR.
# via zep or jup [configs NOT being recognized]
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("app_name")\
.master("yarn")\
.config("spark.submit.deployMode","client")\
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
.config("spark.executor.instances", 11)\
.config("spark.executor.cores", 5)\
.config("spark.executor.memory", "19g")\
.getOrCreate()
# via ssh terminal [configs ARE recognized at run-time]
pyspark \
--name "app_name" \
--master yarn \
--deploy-mode client \
--num-executors 11 \
--executor-cores 5 \
--executor-memory 19 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
Found a solution. The config.json file under /etc/jupyter/conf had some default spark config values hence I removed them to display an empty json key/value like => _configs":{}. Creating a custom SparkSession via JupyterHub now understands the specified cluster configs.
These magic commands are always working %%configure
https://github.com/jupyter-incubator/sparkmagic

What is happening when starting a Spark application on Kubernetes

I read this: Running Spark on Kubernetes.
I want to know more details about the interaction between Kubernetes Controller/Scheduler and Spark runtime when launching a Spark job on K8s.
Specially, assuming we launch an Spark app by :
bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--..............
My question is: the K8s may not be able to allocate 5 executors (or called containers/pods) immediately due to unavailability of cluster resources at the moment the Spark app is launched. Which way does Spark app take? (1) Spark starts running tasks as soon as possible when there is at least one executor is allocated. (2) Spark won't launch any tasks until all of the 5 executors have been allocated.
If you know Hadoop YARN, it would be great if you could also answer the question in the scenario of running Spark app on Hadoop YARN(DynamicAllocation Disabled) and point out the difference.

Spark Standalone: application gets 0 cores

I seem to be unable to assign cores to an application. This leads to the following (apparently common) error message:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have one master and two slaves in a Spark cluster. All are 8-core i7s with 16GB of RAM.
I have left the spark-env.sh virtually virgin on all three, just specifying the master's IP address.
My spark-submit is the following:
nohup ./bin/spark-submit
--jars ./ikoda/extrajars/ikoda_assembled_ml_nlp.jar,./ikoda/extrajars/stanford-corenlp-3.8.0.jar,./ikoda/extrajars/stanford-parser-3.8.0.jar \
--packages datastax:spark-cassandra-connector:2.0.1-s_2.11 \
--class ikoda.mlserver.Application \
--conf spark.cassandra.connection.host=192.168.0.33 \
--conf spark.cores.max=4 \
--driver-memory 4g –num-executors 2 --executor-memory 2g --executor-cores 2 \
--master spark://192.168.0.141:7077 ./ikoda/ikodaanalysis-mlserver-0.1.0.jar 1000 > ./logs/nohup.out &
I suspect I am conflating the sparkConf initialization in my code with the spark-submit. I need this as the app involves SparkStreaming which can require reinitializing the SparkContext.
The sparkConf setup is as follows:
val conf = new SparkConf().setMaster(s"spark://$sparkmaster:7077").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
conf.set("spark.jars.packages", "datastax:spark-cassandra-connector:"+datastaxpackageversion)
conf.set("spark.cores.max", sparkcoresmax)
The Spark UI shows the following:
OK, this is definitely a case of programmer error.
But maybe others will make a similar error. The Master had been used as a local Spark previously. I had put some executor settings in spark-defaults.conf and then months later had forgotten about this.
There is a cascading hierarchy whereby SparkConf settings get precedence, then spark-submit settings and then spark-defaults.conf. spark-defaults.conf overrides defaults set by Apache Spark team
Once I removed the settings from spark-defaults, all was fixed.
It is because of the maximum of your physical memory.
your spark memory in spark UI 14.6GB So you must request memory for each executor below the volume
14.6GB, for this you can add config to your spark conf something like this:
conf.set("spark.executor.memory", "10gb")
if you request more than your physical memory spark does't allocate cpu cores to your job and display 0 in Cores in spark UI and run NOTHING.

With sbt accembly pkg, but still got Err "java.sql.SQLException: No suitable driver found for jdbc:mysql" [duplicate]

I am using
df.write.mode("append").jdbc("jdbc:mysql://ip:port/database", "table_name", properties)
to insert into a table in MySQL.
Also, I have added Class.forName("com.mysql.jdbc.Driver") in my code.
When I submit my Spark application:
spark-submit --class MY_MAIN_CLASS
--master yarn-client
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
This yarn-client mode works for me.
But when I use yarn-cluster mode:
spark-submit --class MY_MAIN_CLASS
--master yarn-cluster
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
It doens't work. I also tried setting "--conf":
spark-submit --class MY_MAIN_CLASS
--master yarn-cluster
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
--conf spark.executor.extraClassPath=/path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
but still get the "No suitable driver found for jdbc" error.
I had to add the driver option when using the sparkSession's read function.
.option("driver", "org.postgresql.Driver")
var jdbcDF - sparkSession.read
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://<host>:<port>/<DBName>")
.option("dbtable", "<tableName>")
.option("user", "<user>")
.option("password", "<password>")
.load()
Depending on how your dependencies are setup, you'll notice that when you include something like compile group: 'org.postgresql', name: 'postgresql', version: '42.2.8' in Gradle, for example, this will include the Driver class at org/postgresql/Driver.class, and that's the one you want to instruct spark to load.
There is 3 possible solutions,
You might want to assembly you application with your build manager (Maven,SBT) thus you'll not need to add the dependecies in your spark-submit cli.
You can use the following option in your spark-submit cli :
--jars $(echo ./lib/*.jar | tr ' ' ',')
Explanation : Supposing that you have all your jars in a lib directory in your project root, this will read all the libraries and add them to the application submit.
You can also try to configure these 2 variables : spark.driver.extraClassPath and spark.executor.extraClassPath in SPARK_HOME/conf/spark-default.conf file and specify the value of these variables as the path of the jar file. Ensure that the same path exists on worker nodes.
I tried the suggestions shown here which didn't work for me (with mysql). While debugging through the DriverManager code, I realized that I needed to register my driver since this was not happening automatically with "spark-submit". I therefore added
Driver driver = new Driver();
The constructor registers the driver with the DriverManager, which solved the SQLException problem for me.

Spark YARN on EMR - JavaSparkContext - IllegalStateException: Library directory does not exist

I have Java Spark job that works on manually deployed Spark 1.6.0 in standalone mode on an EC2.
I am spark-submitting this job to a EMR 5.3.0 cluster on the master using YARN but it fails.
Spark-submit line is,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
The "yarn-client" is the first argument to the x.jar application and is fed to the SparkContext as setMaster,
conf.setMaster(args[0]);
When I submit it, it starts out running fine, until I initialize the JavaSparkContext from a SparkConf,
JavaSparkContext sc = new JavaSparkContext(conf);
... and then Spark crashes.
In the YARN log, I can see the following,
yarn logs -applicationId application_1487325147456_0051
...
17/02/17 16:27:13 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/17 16:27:13 INFO Client: Deleted staging directory hdfs://ip-172-31-8-237.eu-west-1.compute.internal:8020/user/ec2-user/.sparkStaging/application_1487325147456_0052
17/02/17 16:27:13 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/ec2-user/appcache/application_1487325147456_0051/container_1487325147456_0051_01_000001/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
...
Noting the WARN of spark.yarn.jars flag missing, I found a spark yarn JAR file in
/usr/lib/spark/jars/
... and uploaded it to HDFS per Cloudera's guide on how to run YARN applications on Spark and tried to add that conf, so this became my spark-submit line,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --conf spark.yarn.jars=hdfs://`hostname -f`:8020/sparkyarnlibs/spark-yarn_2.11-2.1.0.jar --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
But that did not work and gave this:
Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I am really puzzled as to what that Library error is caused by and how to proceed onwards from here.
You have specified "--deploy-mode cluster" and yet are calling conf.setMaster("yarn-client") from the code. Using a master URL of "yarn-client" means "use YARN as the master, and use client mode (not cluster mode)", so I wouldn't be surprised if this is somehow confusing Spark because on one hand you're telling it to use cluster mode and on the other you're telling it to use client mode.
By the way, using a master URL like "yarn-client" or "yarn-cluster" is actually deprecated because the "-client" or "-cluster" part is not really part of the Master but rather is the deploy mode. That is, "--master yarn-client" is really more of a shortcut/alias for "--master yarn --deploy-mode client", and similarly "--master yarn-cluster" just means "--master yarn --deploy-mode cluster".
My recommendation would be to not call conf.setMaster() from your code, since the master is already set to "yarn" automatically in /etc/spark/conf/spark-defaults.conf. For this reason, you also don't need to pass "--master yarn" to spark-submit.
Lastly, it sounds like you need to decide whether you really want to use client deploy mode or cluster deploy mode. With client deploy mode, the driver runs on the master instance, and with cluster deploy mode, the driver runs in a YARN container on one of the core/task instances. See https://spark.apache.org/docs/latest/running-on-yarn.html for more information.
If you want to use client deploy mode, you don't need to pass anything extra because it's already the default. If you want to use cluster deploy mode, pass "--deploy-mode cluster".

Resources