Kerberos ticket renewal on Spark streaming job that communicates to Kafka - apache-spark

I have a long running Spark streaming job that runs on a kerberized Hadoop cluster. It fails every few days with the following error:
Diagnostics: token (token for XXXXXXX: HDFS_DELEGATION_TOKEN owner=XXXXXXXXX#XX.COM, renewer=yarn, realUser=, issueDate=XXXXXXXXXXXXXXX, maxDate=XXXXXXXXXX, sequenceNumber=XXXXXXXX, masterKeyId=XXX) can't be found in cache
I tried adding in --keytab and --principal options to spark-submit. But we already have the following options that do the same thing:
For the second option, we already pass in the keytab and principal with the following:
'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_client_jaas.conf -Djava.security.krb5.conf=krb5.conf -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=12' \
Same for spark.executor.extraJavaOptions. If we add the options --principal and --keytab it results in attempt to add file (keytab) multiple times to distributed cache

There are 2 ways that you can do it.
Have a shell script that does the keytab/ticket generation on a regular interval.
[RECOMMENDED] Pass your keytab to Spark with strict access only to spark user and it can automatically regenerate the tickets for you. Visit this Cloudera community page for more details. It's just a simple bunch of steps and you can get going!
Hope that helps!

Related

Spark doesn't acknowledge Kerberos authentication and application fail when delegation token is issued

I'm using spark to read data files from hdfs.
When I do a spark action, a spark exception is raised:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication
In the logs before the execption is thrown I can see:
WARN [hadoop.security.UserGroupInformation] PriviledgedActionException as:principal (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
[21/02/22 17:27:17.439] WARN [hadoop.ipc.Client] Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
[21/02/22 17:27:17.440] WARN [hadoop.security.UserGroupInformation] PriviledgedActionException as:principal (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
Which is really weird because I set the spark submit config with kerberos using --keytab and --principal configs:
spark.master yarn-cluster
spark.app.name app
spark.submit.deployMode cluster
spark.yarn.principal pincipal/principal.com
spark.yarn.keytab all.keytab
spark.driver.memory 4G
spark.executor.memory 8G
spark.executor.instances 4
spark.executor.cores 8
spark.deploy.recoveryMode ZOOKEEPER
spark.deploy.zookeeper.url jdbc:phoenix:m1,m2,m3,m4:2181:/hbase
spark.driver.extraJavaOptions -XX:MaxPermSize=1024M -Dlog4j.configuration=log4j.xml
spark.executor.extraJavaOptions -Dlog4j.configuration=log4j.xml
I don't understand why the delegation token wouldn't be possible since the it is set up as kerberos auth.
I also don't understand why it displays those warnings as if the authentication mode of my spark was set as SIMPLE. Is spark ignoring my config ?
I have 2 environment, one on which the application works properly but I don't know what config I should look at.

How to refresh kerberos ticket in running structured streaming spark application once in 7 days?

I've been running a structured streaming application to join 2 streams from kafka and push to the third stream. The application gets failed once in 7 days as HDFS_DELEGATION_TOKEN expires. I'm using jaas file to send the relevant configuration.
RegistryClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="./user.keytab"
storeKey=true
useTicketCache=false
principal="uder#Principal";
};
Pass the below parameters in spark submit command.
--conf spark.yarn.keytab=/path/to/file.keytab
--conf spark.yarn.principal=principleName#domain

Pyspark job queue config precedence - spark-submit vs SparkSession.builder

I have a shell script which runs a spark-submit command. I want to specify the resource queue name onto which the job runs.
When I use:
spark-submit --queue myQueue job.py (here the job is properly submitted on 'myQueue')
But when I use: spark-submit job.py and inside job.py I create a spark session like:
spark=SparkSession.builder.appName(appName).config("spark.yarn.queue", "myQueue") - In this case the job runs on default queue. Also on checking the configs of this running job on the spark UI, it shows me that queue name is "myQueue" but still the job runs on default queue only.
Can someone explain how can I pass the queue name in sparkSession.builder configs so that it takes into effect.
Using pyspark version 2.3

Using Apache Spark with a local S3-compatible Object store

I am trying to run a simple Apache spark (Cloudera) read operation using a local object store that is fully s3 sdk/api compatible. But I can not seem to figure out how to get Spark to understand that I am trying to access a local S3 bucket and not remote AWS/S3.
Here's what I've tried...
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com
df = spark.read.parquet("s3a://mybucket/path1/")
Error message...
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to mybucket.s3.amazonaws.com:443 [mybucket.s3.amazonaws.com/12.345.678.90] failed: Connection refused (Connection refused)
I can list the local bucket contents without issue on the command-line so I know that I have the access/secret key correct but I need to make Spark understand not to reach out to aws to try and resolve the bucket url.
Thanks.
Update / Resolution:
The fix to the issue was a missing prerequisite jar at maven coordinates: org.apache.hadoop:hadoop-aws:2.6.0
So the final pyspark call looked like:
pyspark2 --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/user/myusername/awskeyfile.jceks --conf fs.s3a.endpoint=https://myenvironment.domain.com --jars hadoop-aws-2.6.0.jar
df = spark.read.parquet("s3a://mybucket/path1/")
This is covered in HDP docs, Working with third party object stores.
Settings are the same for CDH.
It comes down
endpoint fs.s3a.endpoint = hostname
disable DNS to bucket map fs.s3a.path.style.access = true
play with signing options.
There are a few other switches you can turn for better compatibility; they're in those docs.
You might find the Cloudstore storediag command useful.

Spark job on DC/OS cluster on AWS

I am trying to run a batch process in Spark on DC/OS on AWS. For each batch process, I have some specific parameters I send when I do spark submit (for example for which users to perform the batch process).
I have a Spark cluster on DC/OS, with one master and 3 private nodes.
I have created a application.conf file and uploaded it to S3, and enabled the permissions for accessing that file.
My spark submit command looks like this:
dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-class-path https://path_to_the_folder_root_where_is_the_file --conf spark.driver.extraJavaOptions=-Dconfig.file=application.conf --conf spark.executor.extraJavaOptions=-Dconfig.file=application.conf --class class_name jar_location_on_S3'
And I get the error that job.properties file is not found:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'wattio-batch'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33)
at com.enerbyte.spark.jobs.wattiobatch.WattioBatchJob$.main(WattioBatchJob.scala:31)
at com.enerbyte.spark.jobs.wattiobatch.WattioBatchJob.main(WattioBatchJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:123)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
How to set this properly? Although one of the private slaves executes the driver, does it have the access to Internet(is it able to go to S3 and download conf file)?
Thank you
I didn't succeed to send conf file from spark submit command, but what I did is to hard-code the location of application.conf file at the beginning of my program using:
System.setProperty("config.url", "https://s3_location/application.conf")
ConfigFactory.invalidateCaches()
This way, program was able to read the application.conf file every time at launching.

Resources