Persisting to Kerberized HDFS from Spark cluster - apache-spark

my current set-up:
Spark version 2.3.1 (Cluster running on Windows) uses Spark secret (basic).
Hdfs (Cluster running on Linux) Kerberized.
Not ideal! but there's a good reason why I can't use the same set of machines for both clusters.
I am able to read/ write to Hdfs from a standalone Spark application but when I try to run similar code on the Spark cluster I get an authentication error.
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via: [TOKEN, KERBEROS]; Host Details....

Where is your other cluster node? Which user running spark in cluster mode? Does that user has permission to access keytab? I think it can be permission issue or some typo.

Related

Unable to Connect Remote Spark Session with YARN mode on Kubeflow

The main problem is that we are unable to run spark in client mode.
Whenever we try to connect to spark on YARN mode from kubeflow notebook we have the following error:
`Py4JJavaError: An error occurred while calling o81.showString.
: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:932)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:930)`
It seems we have exact same issue here:
Up to now:
we have managed to submit spark on notebook.
Also, it is possible to connect cluster mode from kubeflow notebook.
We have also managed to run spark session with python shell on one of the worker server on kubernetes. We are able to connect remote edge node which managed by Cloudera.
We have checked that there is no network issue between hadoop clusters and kubernetes clusters.
However, we still have no access interactive spark on jupyter notebook.

How to load data into spark from a remote HDFS?

Our data is stored at a remote Hadoop Cluster, but for doing some PoC I need to run spark application locally on my machine. How can I load data from that remote HDFS?
You can configure spark to access any hadoop instance you have access to.(Ports open, nodes reachable)
Custom Hadoop/Hive Configuration
If your Spark application is interacting with Hadoop, Hive, or both, there are probably Hadoop/Hive
configuration files in Spark’s classpath.
Multiple running applications might require different Hadoop/Hive
client side configurations. You can copy and modify hdfs-site.xml,
core-site.xml, yarn-site.xml, hive-site.xml in Spark’s classpath for
each application. In a Spark cluster running on YARN, these
configuration files are set cluster-wide, and cannot safely be changed
by the application.
As you want to access HDFS you need: hdfs-site.xml and core-site.xml from your cluster you are trying to access.
For anyone, who wants to access remote HDFS from Spark Java app, here is steps.
Firstly, you need to add --conf key to your run command. Depends on Spark version:
(Spark 1.x-2.1)
spark.yarn.access.namenodes=hdfs://clusterA,hdfs://clusterB
(Spark 2.2+) spark.yarn.access.hadoopFileSystems=hdfs://clusterA,hdfs://clusterB
Secondly, when you creating Spark’s Java context, add that:
javaSparkContext.hadoopConfiguration().addResource(new Path("core-site-clusterB.xml"));
javaSparkContext.hadoopConfiguration().addResource(new Path("hdfs-site-clusterB.xml"));
If you facing this exception:
java.net.UnknownHostException: clusterB
then try to put full namenode address of your remote HDFS with port (instead of hdfs/cluster short name) to --conf into your running command.
More details in my article: https://mchesnavsky.tech/spark-java-access-remote-hdfs.

Does Spark 2.4.4 support forwarding Delegation Tokens when master is k8s?

I'm currently in the process of setting up a Kerberized environment for submitting Spark Jobs using Livy in Kubernetes.
What I've achieved so far:
Running Kerberized HDFS Cluster
Livy using SPNEGO
Livy submitting Jobs to k8s and spawning Spark executors
KNIME is able to interact with Namenode and Datanodes from outside the k8s Cluster
To achieve this I used the following Versions for the involved components:
Spark 2.4.4
Livy 0.5.0 (The currently only supported version by KNIME)
Namenode and Datanode 2.8.1
Kubernetes 1.14.3
What I'm currently struggling with:
Accessing HDFS from the Spark executors
The error message I'm currently getting, when trying to access HDFS from the executor is the following:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "livy-session-0-1575455179568-exec-1/10.42.3.242"; destination host is: "hdfs-namenode-0.hdfs-namenode.hdfs.svc.cluster.local":8020;
The following is the current state:
KNIME connects to HDFS after having successfully challenged against the KDC (using Keytab + Principal) --> Working
KNIME puts staging jars to HDFS --> Working
KNIME requests new Session from Livy (SPNEGO challenge) --> Working
Livy submits Spark Job with k8s master / spawns executors --> Working
KNIME submits tasks to Livy which should be executed by the executors --> Basically working
When trying to access HDFS to read a file the error mentioned before occurs --> The problem
Since KNIME is placing jar files on HDFS which have to be included in the dependencies for the Spark Jobs it is important to be able to access HDFS. (KNIME requires this to be able to retrieve preview data from DataSets for example)
I tried to find a solution to this but unfortunately, haven't found any useful resources yet.
I had a look at the code an checked UserGroupInformation.getCurrentUser().getTokens().
But that collection seems to be empty. That's why I assume that there are not Delegation Tokens available.
Has anybody ever achieved running something like this and can help me with this?
Thank you all in advance!
For everybody struggeling with this:
It took a while to find the reason on why this is not working, but basically it is related to Spark's Kubernetes implementation as of 2.4.4.
There is no override defined for CoarseGrainedSchedulerBackend's fetchHadoopDelegationTokens in KubernetesClusterSchedulerBackend.
There has been the pull request which will solve this by passing secrets to executors containing the delegation tokens.
It was already pulled into master and is available in Spark 3.0.0-preview but is not, at least not yet, available in the Spark 2.4 branch.

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Spark Application connecting to kerborized Hadoop cluster

I have a Spark machine running standalone mode. It has Spark job that is writing to kerberized HDFS.
Based on the Cloudera documentation Standalone Spark can't connect to kerberized HDFS. is this true?
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/sg_spark_auth.html
My Spark node is not kerberized. Do i need to go for YARN mode to write to kerberized HDFS? Does my Spark cluster also needs to be kerberized to connect to HDFS?
I have posted this earlier, but none of these worked for me
Kinit with Spark when connecting to Hive

Resources