I'm running a PySpark (Spark 3.1.1) application in cluster mode on YARN cluster, which is supposed to process input data and send appropriate kafka messages to a given topic.
Data manipulation part is already covered, however I struggle to use kafka-python library to send the notifications. The problem is that it can't find a valid kerberos ticket to authenticate to kafka cluster.
While executing spark3-submit I add --principal and --keytab properties (equivalents to spark.kerberos.keytab and spark.kerberos.principal). Moreover, I am able to access HDFS and HBase resources.
Does Spark store TGT in a ticket cache that I can reference by setting krb5ccname variable? I am not able to locate a valid kerberos ticket while the app is running.
Is it common to issue kinit from PySpark application to create a ticket to get an access to resources outside HDFS etc.? I tried using krbticket module to issue kinit command from the app (using keytab that I pass as a parameter to spark3-submit), however then the process hangs.
Related
We have a spark application that reads data using spark SQL from HMS tables built on parquet files stored in HDFS. The spark application is running on a seperate hadoop environment. We use delegation tokens to allow the spark application to authenticate to Kerberized HMS/HDFS. We cannot and must not use keytabs to authenticate the spark application directly.
Because delegation tokens expire, after certain period of time our spark application will no longer be able to authenticate and will fail if it has not completed within the timeframe during which the token is valid.
My question is this.
If I call .cache or .persist on the source dataframe against which all subsequent operations are executed, my understanding is that this will cause spark to store all the data in memory. If all the data is in memory, it should not need to make subsequent calls to read leaf files in HDFS and the authentication error could be avoided. Not that the spark application has its own local file system, it is not using the remote HDFS source as its default fs.
Is this assumption about the behavior of .cache or .persist correct, or is the only solution to rewrite the data to intermediate storage?
Solve Kerberos issue, instead of adding work arounds. I'm not sure how you are using the kerberos principal, but I will point out that the documentation maintains a solution for this issue:
Long-Running
Applications Long-running applications may run into
issues if their run time exceeds the maximum delegation token lifetime
configured in services it needs to access.
This feature is not available everywhere. In particular, it’s only
implemented on YARN and Kubernetes (both client and cluster modes),
and on Mesos when using client mode.
Spark supports automatically creating new tokens for these
applications. There are two ways to enable this functionality.
Using a Keytab
By providing Spark with a principal and keytab (e.g.
using spark-submit with --principal and --keytab parameters), the
application will maintain a valid Kerberos login that can be used to
retrieve delegation tokens indefinitely.
Note that when using a keytab in cluster mode, it will be copied over
to the machine running the Spark driver. In the case of YARN, this
means using HDFS as a staging area for the keytab, so it’s strongly
recommended that both YARN and HDFS be secured with encryption, at
least.
I would also point out that caching will reduce visits to HDFS but may still require reads from HDFS if there isn't sufficient space in memory. If you don't solve the Kerberos issue because of [reasons]. You may wish to instead use checkpoints. They are slower than caching, but are made specifically to help [long running process that sometimes fail] get over that hurdle of expensive recalculation, but they do require disk to be written to. This will remove any need to revisit the original HDFS cluster. Typically they're used in Streaming to remove data lineage, but they also have their place in expensive long running spark applications. (You also need to manage their cleanup.)
How to recover with a checkpoint file.
I have a working Spark cluster, with a master node and some worker nodes running on Kubernetes. This cluster has been used for multiple spark submit jobs and is operational.
On the master node, I have started up a Spark History server using the $SPARK_HOME/sbin/start-history-server.sh script and some configs to determine where the History Server's logs should be written:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
spark.hadoop.fs.s3a.access.key=...
spark.hadoop.fs.s3a.secret.key=...
spark.hadoop.fs.s3a.endpoint=...
spark.hadoop.fs.s3a.path.style.access=true
This was done a while after the cluster was operational. The server is writing the logs to an external DB (minIO using the s3a protocol).
Now, whenever I submit spark jobs it seems like nothing is being written away in the location I'm specifying.
I'm wondering about the following: How can the workers know I have started up the spark history server on the master node? Do I need to communicate this to the workers somehow?
Possible causes that I have checked:
No access/permissions to write to minIO: This shouldn't be the case as I'm running spark submit jobs that read/write files to the same minIO using the same settings
Logs folder does not exist: I was getting these errors before, but then I created a location for the files to be written away and since then I'm not getting issues
spark.eventLog.dir should be the same as spark.history.fs.logDirectory: they are
Just found out the answer: the way your workers will know where to store the logs is by supplying the following configs to your spark-submit job:
spark.eventLog.enabled=true
spark.eventLog.dir=...
spark.history.fs.logDirectory=...
It is probably also enough to have these in your spark-defaults.conf on the driver program, which is why I couldn't find a lot of info on this as I didn't add it to my spark-defaults.conf.
My code is at a point where I can successfully authenticate Kerberized Livy to submit a Spark job, and the issue now is that in the Spark submit job parameters, I can only pass in one set of keytab+principal, and so far I am passing in my Spark keytab+principal so that the Spark submit job can even be submitted, but soon after starting the Spark job I get an error - something about the wrong Yarn user, and I don't know how to proceed with that.
Below is the error I get. I have Googled this error and tried all the solutions I found but nothing worked so far.
Things I've tried:
Making sure the spark user exists, and that it is able to be authenticated via keytab+principal.
Trying to run the Spark submit with different principals (hadoop, spark, yarn).
Making sure the user/principal is present on all Hadoop nodes.
Making sure my scala jar is valid and correct.
main : run as user is spark
main : requested yarn user is spark
User spark not found
By the way, why does it say user spark not found? I checked my users list, spark is one of the users.
I'm currently in the process of setting up a Kerberized environment for submitting Spark Jobs using Livy in Kubernetes.
What I've achieved so far:
Running Kerberized HDFS Cluster
Livy using SPNEGO
Livy submitting Jobs to k8s and spawning Spark executors
KNIME is able to interact with Namenode and Datanodes from outside the k8s Cluster
To achieve this I used the following Versions for the involved components:
Spark 2.4.4
Livy 0.5.0 (The currently only supported version by KNIME)
Namenode and Datanode 2.8.1
Kubernetes 1.14.3
What I'm currently struggling with:
Accessing HDFS from the Spark executors
The error message I'm currently getting, when trying to access HDFS from the executor is the following:
org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "livy-session-0-1575455179568-exec-1/10.42.3.242"; destination host is: "hdfs-namenode-0.hdfs-namenode.hdfs.svc.cluster.local":8020;
The following is the current state:
KNIME connects to HDFS after having successfully challenged against the KDC (using Keytab + Principal) --> Working
KNIME puts staging jars to HDFS --> Working
KNIME requests new Session from Livy (SPNEGO challenge) --> Working
Livy submits Spark Job with k8s master / spawns executors --> Working
KNIME submits tasks to Livy which should be executed by the executors --> Basically working
When trying to access HDFS to read a file the error mentioned before occurs --> The problem
Since KNIME is placing jar files on HDFS which have to be included in the dependencies for the Spark Jobs it is important to be able to access HDFS. (KNIME requires this to be able to retrieve preview data from DataSets for example)
I tried to find a solution to this but unfortunately, haven't found any useful resources yet.
I had a look at the code an checked UserGroupInformation.getCurrentUser().getTokens().
But that collection seems to be empty. That's why I assume that there are not Delegation Tokens available.
Has anybody ever achieved running something like this and can help me with this?
Thank you all in advance!
For everybody struggeling with this:
It took a while to find the reason on why this is not working, but basically it is related to Spark's Kubernetes implementation as of 2.4.4.
There is no override defined for CoarseGrainedSchedulerBackend's fetchHadoopDelegationTokens in KubernetesClusterSchedulerBackend.
There has been the pull request which will solve this by passing secrets to executors containing the delegation tokens.
It was already pulled into master and is available in Spark 3.0.0-preview but is not, at least not yet, available in the Spark 2.4 branch.
I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help