Spark enable security with secure YARN Hadoop cluster - apache-spark

I have an Hadoop 3.0 cluster configured with Kerberos. Everything works fine and YARN is started as well.
Now I wish to add Spark on top of it and make full use of Hadoop and security. To do so I use a binary distribution of Spark 2.3 and modified the following.
In spark-env.sh:
YARN_CONF_DIR, set to the folder where my Hadoop configuration files core-site.xml, hdfs-site.xml and yarn-site.xml are located.
In spark-defaults.conf:
spark.master yarn
spark.submit.deployMode cluster
spark.authenticate true
spark.yarn.principal mysparkprincipal
spark.yarn.keytab mykeytabfile
If I understood correctly when using YARN, the secret key will be generated automatically and I don't need to manually set spark.authenticate.secret.
The problem I have is that the worker is complaining about the key:
java.lang.IllegalArgumentException: A secret key must be specified via the spark.authenticate.secret config
I also don't have any indication in the logs that Spark is using YARN or try to do anything with my hdfs volume. It's almost like Hadoop configuration files are ignored completely. I've read documentation on YARN and security for Spark but it's not very clear to me.
My questions are:
How can i be sure that Spark is using YARN
Do I need to set spark.yarn.access.hadoopFileSystems if I only use the server set in YARN_CONF_DIR
Is LOCAL_DIRS best to be set to HDFS and if yes, what is the syntax
Do I need both HADOOP_CONF_DIR and YARN_CONF_DIR?
Edit/Add:
After looking at the source code the exception is from SASL which is not enabled for Spark so I don't understand.
My Hadoop has SSL enabled (Data Confidentiality) and since I give Spark my server configuration maybe it requires SSL for Spark if the configuration for Hadoop has it enabled.
So far I am really confused about everything.
It says that environment variables need to be set using the spark.yarn.appMasterEnv. But which one? All of them?
Also it says that it is Hadoop CLIENT file that I need to have on the classpath but what properties should be present in a CLIENT file?
I guess I can replace the XML files using spark.hadoop.* properties but what are the properties required for Spark to know where is my YARN cluster?
Setting spark.authenticate.enableSaslEncryption to false seems to have no effect as the exception is still about SparkSaslClient
The exception is:
java.lang.IllegalArgumentException: A secret key must be specified via the spark.authenticate.secret config
at org.apache.spark.SecurityManager$$anonfun$getSecretKey$4.apply(SecurityManager.scala:510)
at org.apache.spark.SecurityManager$$anonfun$getSecretKey$4.apply(SecurityManager.scala:510)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.SecurityManager.getSecretKey(SecurityManager.scala:509)
at org.apache.spark.SecurityManager.getSecretKey(SecurityManager.scala:551)
at org.apache.spark.network.sasl.SparkSaslClient$ClientCallbackHandler.handle(SparkSaslClient.java:137)
at com.sun.security.sasl.digest.DigestMD5Client.processChallenge(DigestMD5Client.java:337)
at com.sun.security.sasl.digest.DigestMD5Client.evaluateChallenge(DigestMD5Client.java:220)
at org.apache.spark.network.sasl.SparkSaslClient.response(SparkSaslClient.java:98)
at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:71)
at org.apache.spark.network.crypto.AuthClientBootstrap.doSaslAuth(AuthClientBootstrap.java:115)
at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:74)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Related

How to load data into spark from a remote HDFS?

Our data is stored at a remote Hadoop Cluster, but for doing some PoC I need to run spark application locally on my machine. How can I load data from that remote HDFS?
You can configure spark to access any hadoop instance you have access to.(Ports open, nodes reachable)
Custom Hadoop/Hive Configuration
If your Spark application is interacting with Hadoop, Hive, or both, there are probably Hadoop/Hive
configuration files in Spark’s classpath.
Multiple running applications might require different Hadoop/Hive
client side configurations. You can copy and modify hdfs-site.xml,
core-site.xml, yarn-site.xml, hive-site.xml in Spark’s classpath for
each application. In a Spark cluster running on YARN, these
configuration files are set cluster-wide, and cannot safely be changed
by the application.
As you want to access HDFS you need: hdfs-site.xml and core-site.xml from your cluster you are trying to access.
For anyone, who wants to access remote HDFS from Spark Java app, here is steps.
Firstly, you need to add --conf key to your run command. Depends on Spark version:
(Spark 1.x-2.1)
spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
(Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB
Secondly, when you creating Spark’s Java context, add that:
javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));
If you facing this exception:
java.net.UnknownHostException: clusterB
then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command.
More details in my article: https://mchesnavsky.tech/spark-java-access-remote-hdfs.

Error reading Hive table from Spark using JdbcStorageHandler

I've set up access to an external relational store (PostgreSQL) via my Spark/Hive deployment. I can read this table via Hive/Beeline, but it fails when I try to read via SparkSQL/pyspark3 jupyter notebook, because it's unable to find JdbcStorageHandler. I've tried to add the appropriate jars in a couple of ways but am hitting the same stack trace across the board - any advice on what jar and version I need, and where exactly I should put it, for this to work? Stack trace:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hive.storage.jdbc.JdbcStorageHandler
..
..
java.lang.ClassNotFoundException: org.apache.hive.storage.jdbc.JdbcStorageHandler
In terms of getting Hive/Beeline to work: I did as described in this JDBC Storage Handler document. I hit a few jar dependency problems while doing this, but resolved it by adding the hive-jdbc-2.0.0.jar, postgresql-42.2.12.jar jars after launching Beeline, and can now successfully read data directly from the relational store from Beeline.
Some things I've tried:
Add the jars listed above with spark.jars.packages in the notebook sparkmagic conf. hive-jdbc 2.0.0 installs cleanly but yields aforementioned error. I tried hive-jdbc 3.1.0 also, but it errors out and does not install. I was a little confused as to how to assess compatibility here, might be a distraction.
Launch spark-sql on the cluster directly, add hive-jdbc-2.0.0.jar jar (successfully). Same stack trace.
Add Apache Hive libraries across the cluster during cluster creation (the hive-jdbc, and postgres driver)
Look around the rest of /usr/hdp for hive-jdbc, of which there are a variety of versions (beneath zeppelin, spark2, oozie, hive-hcatalog, hive, ranger-admin).
Environment details:
running on Azure HDInsight
Spark 2.4 (HDI 4.0)
Please copy the hive-jdbc-handler.jar to $SPARK_HOME/standalone-metastore directory in all nodes.
cp /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar /usr/hdp/current/spark2-client/standalone-metastore/
After that launch the spark-shell and test the example
sudo -u spark spark-shell --master yarn --jars /usr/hdp/current/hive-client/lib/hive-jdbc-handler.jar
scala> spark.sql("select * from table_name").show()
If you get below error, then there is an issue created for this one, you need to back port that code Spark 3.0.0 to your cluster spark version for example Spark 2.3.2.
Caused by: java.lang.IllegalArgumentException: requirement failed: length (-1) cannot be negative
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.rdd.InputFileBlockHolder$.set(InputFileBlockHolder.scala:70)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:226)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
https://issues.apache.org/jira/browse/SPARK-27259

Set Cloudera application tags for Spark application

I have set spark.yarn.tags in my spark application and it is visible as well in my config when printed.
But Cloudera manager is unable to detect it in application_tags field of yarn application.
Does application_tags map to spark.yarn.tags for spark applications?
I think I found the solution.
When spark.yarn.tags is set while calling spark-submit, cloudera manager detects it. So I believe it is something it requires before spark context is created, hence it has to be passed as conf while submitting.
This is how it can be passed to the spark-submit
--conf spark.yarn.tags=tag-name

spark kafka security kerberos

I try to use kafka(0.9.1) with secure mode. I would read data with Spark, so I must pass the JAAS conf file to the JVM. I use this cmd to start my job :
/opt/spark/bin/spark-submit -v --master spark://master1:7077 \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.conf=kafka_client_jaas.conf" \
--files "./conf/kafka_client_jaas.conf,./conf/kafka.client.1.keytab" \
--class kafka.ConsumerSasl ./kafka.jar --topics test
I still have the same error :
Caused by: java.lang.IllegalArgumentException: You must pass java.security.auth.login.config in secure mode.
at org.apache.kafka.common.security.kerberos.Login.login(Login.java:289)
at org.apache.kafka.common.security.kerberos.Login.<init>(Login.java:104)
at org.apache.kafka.common.security.kerberos.LoginManager.<init>(LoginManager.java:44)
at org.apache.kafka.common.security.kerberos.LoginManager.acquireLoginManager(LoginManager.java:85)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:55)
I think the spark does not inject the parameter Djava.security.auth.login.conf in the jvm !!
The main cause of this issue is that you have mentioned wrong property name. it should be java.security.auth.login.config and not -Djava.security.auth.login.conf. Moreover if you are using keytab file. make sure to make it available on all executors using --files argument in spark-submit. if you are using kerberos ticket make sure to set KRB5CCNAME on all executors using property SPARK_YARN_USER_ENV.
if you are using older version of spark 1.6.x or earlier. then there are some known issues with spark that this integration will not work then you have to write a custom receiver.
For spark 1.8 and later, you can see configuration here
Incase you need to create custom receiver you can see this

How can i access cfs url from a remote non dse (datastax) node

im am trying to do... from my prog.
val file = sc.textFile("cfs://ip/.....")
but i get java.io.IOException: No FileSystem for scheme: cfs exception...
How should i modify the core-site.xml and where? It should be on dse nodes or should i add it as a resource in my jar.
I use maven to build my jar and execute the jobs remotely...from a non dse node which does not have cassandra or spark or something similar... Other type of flows without cfs files work ok... so the jar is ok so far...
Thnx!
There is some info in the middle of this page about Spark using Hadoop for some operations, such as CFS access: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCassProps.html
I heard about a problem using Hive from a non-DSE node that was solved by adding a property file to core-site.xml. This is really a long-shot since it's Spark, but if you're willing to experiment, try adding the IP address of the remote machine to the core-site.xml file.
<property>
<name>cassandra.host</name>
<value>192.168.2.100</value>
<property>
Find the core-site.xml in /etc/dse/hadoop/conf/ or install_location/resources/hadoop/conf/, depending on the type of installation.
I assume you started the DSE cluster in hadoop and spark mode: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkStart.html
Been quite some time.
The integration is done as usual with any integration of a hadoop client to a compatible hadoop fs.
Copy core-site.xml (append the dse-core-default.xml there) along with dse.yaml, cassandra.yaml and then it requires a proper dependency set-up in the class path eg. dse.jar, cassandra-all, etc.
Note: this is not officially supported so better use other way.

Resources