How to refresh kerberos ticket in running structured streaming spark application once in 7 days? - apache-spark

I've been running a structured streaming application to join 2 streams from kafka and push to the third stream. The application gets failed once in 7 days as HDFS_DELEGATION_TOKEN expires. I'm using jaas file to send the relevant configuration.
RegistryClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./user.keytab"
storeKey=true
useTicketCache=false
principal="uder#Principal";
};

Pass the below parameters in spark submit command.
--conf spark.yarn.keytab=/path/to/file.keytab
--conf spark.yarn.principal=principleName#domain

Related

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

Spark doesn't acknowledge Kerberos authentication and application fail when delegation token is issued

I'm using spark to read data files from hdfs.
When I do a spark action, a spark exception is raised:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication
In the logs before the execption is thrown I can see:
WARN [hadoop.security.UserGroupInformation] PriviledgedActionException as:principal (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
[21/02/22 17:27:17.439] WARN [hadoop.ipc.Client] Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
[21/02/22 17:27:17.440] WARN [hadoop.security.UserGroupInformation] PriviledgedActionException as:principal (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
Which is really weird because I set the spark submit config with kerberos using --keytab and --principal configs:
spark.master yarn-cluster
spark.app.name app
spark.submit.deployMode cluster
spark.yarn.principal pincipal/principal.com
spark.yarn.keytab all.keytab
spark.driver.memory 4G
spark.executor.memory 8G
spark.executor.instances 4
spark.executor.cores 8
spark.deploy.recoveryMode ZOOKEEPER
spark.deploy.zookeeper.url jdbc:phoenix:m1,m2,m3,m4:2181:/hbase
spark.driver.extraJavaOptions -XX:MaxPermSize=1024M -Dlog4j.configuration=log4j.xml
spark.executor.extraJavaOptions -Dlog4j.configuration=log4j.xml
I don't understand why the delegation token wouldn't be possible since the it is set up as kerberos auth.
I also don't understand why it displays those warnings as if the authentication mode of my spark was set as SIMPLE. Is spark ignoring my config ?
I have 2 environment, one on which the application works properly but I don't know what config I should look at.

Using Apache Spark with a local S3-compatible Object store

I am trying to run a simple Apache spark (Cloudera) read operation using a local object store that is fully s3 sdk/api compatible. But I can not seem to figure out how to get Spark to understand that I am trying to access a local S3 bucket and not remote AWS/S3.
Here's what I've tried...
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com
df = spark.read.parquet("s3a://mybucket/path1/")
Error message...
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to mybucket.s3.amazonaws.com:443 [mybucket.s3.amazonaws.com/12.345.678.90] failed: Connection refused (Connection refused)
I can list the local bucket contents without issue on the command-line so I know that I have the access/secret key correct but I need to make Spark understand not to reach out to aws to try and resolve the bucket url.
Thanks.
Update / Resolution:
The fix to the issue was a missing prerequisite jar at maven coordinates: org.apache.hadoop:hadoop-aws:2.6.0
So the final pyspark call looked like:
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com --jars hadoop-aws-2.6.0.jar
df = spark.read.parquet("s3a://mybucket/path1/")
This is covered in HDP docs, Working with third party object stores.
Settings are the same for CDH.
It comes down
endpoint fs.s3a.endpoint = hostname
disable DNS to bucket map fs.s3a.path.style.access = true
play with signing options.
There are a few other switches you can turn for better compatibility; they're in those docs.
You might find the Cloudstore storediag command useful.

Kerberos ticket renewal on Spark streaming job that communicates to Kafka

I have a long running Spark streaming job that runs on a kerberized Hadoop cluster. It fails every few days with the following error:
Diagnostics: token (token for XXXXXXX: HDFS_DELEGATION_TOKEN owner=XXXXXXXXX#XX.COM, renewer=yarn, realUser=, issueDate=XXXXXXXXXXXXXXX, maxDate=XXXXXXXXXX, sequenceNumber=XXXXXXXX, masterKeyId=XXX) can't be found in cache
I tried adding in --keytab and --principal options to spark-submit. But we already have the following options that do the same thing:
For the second option, we already pass in the keytab and principal with the following:
'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas.conf -Djava.security.krb5.conf=krb5.conf -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12' \
Same for spark.executor.extraJavaOptions. If we add the options --principal and --keytab it results in attempt to add file (keytab) multiple times to distributed cache
There are 2 ways that you can do it.
Have a shell script that does the keytab/ticket generation on a regular interval.
[RECOMMENDED] Pass your keytab to Spark with strict access only to spark user and it can automatically regenerate the tickets for you. Visit this Cloudera community page for more details. It's just a simple bunch of steps and you can get going!
Hope that helps!

Spark - pass property to spark-submit

Spark 1.5.1 with --master yarn-cluster. What I am trying to accomplish is to pass a variable to spark-submit command that will uniquely define spawned application. I do submit spark jobs from external application via webservice (we have another simple web layer application on dropwizard with an endpoint that submits applications). Another webservice will return status of an operation for given identifier. The flow:
SUBMIT JOB:
MyApp -> "/Dropwizard/submit-job?id=100" -> Dropwizard -> "spark-submit --conf=id=100" -> Spark
GET STATUS
MyApp -> "/Dropwizard/status?id=100" -> Dropwizard -> "this will get information from files that are created when application runs. Files will have id in their names"
Problem is sparkContext.getConf().get("id"); returns null.
Can you please give me a clue how to use --conf or drop an idea how can I resolve the problem other way around?
It should be --conf id=100 as shown in the samples here

Resources