Too many KDC calls from KafkaConsumer on Spark streaming - apache-spark

I have a standalone (master=local for its own reasons) Spark structured streaming application that reads from kerberized kafka cluster.
It works functionally, but it makes too many calls to KDC to fetch TGS for each micro-batch execution.
Either with useTicketCache=true or by providing keytab in the jaas config, the behavior was same - it bombarded KDC for each broker for each task.
Spark seemed to fetch the TGT correctly from the cache whereas TGS was not reused across multiple runs towards the same broker.
When trying to look at debug logs, it shows below message before fetching TGS
Found ticket for UPN/DOMAIN#REALM to go to krbtgt/DOMAIN#REALM expiring on Thu Jul 23 09:08:39 CEST 2020
Entered Krb5Context.initSecContext with state=STATE_NEW
Service ticket not found in the subject
Am I missing any spark configuration?

It turned out to be an issue with Kerberos configuration.
We accidentally had useSubjectCredsOnly=false in the System property.
Removing it (the default is true) fixed the problem.
Ref : sun.security.jgss.krb5.Krb5Context.java in JDK code

Related

How to control the number of Hadoop IPC retry attempts for a Spark job submission?

Suppose I attempt to submit a Spark (2.4.x) job to a Kerberized cluster, without having valid Kerberos credentials. In this case, the Spark launcher tries repeatedly to initiate a Hadoop IPC call, but fails:
20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
This will repeat a number of times (30, in my case), until eventually the launcher gives up and the job submission is considered failed.
Various other similar questions mention these properties (which are actually YARN properties but prefixed with spark. as per the standard mechanism to pass them with a Spark application).
spark.yarn.maxAppAttempts
spark.yarn.resourcemanager.am.max-attempts
However, neither of these properties affects the behavior I'm describing. How can I control the number of IPC retries in a Spark job submission?
After a good deal of debugging, I figured out the properties involved here.
yarn.client.failover-max-attempts (controls the max attempts)
Without specifying this, the number of attempts appears to come from the ratio of these two properties (numerator first, denominator second).
yarn.resourcemanager.connect.max-wait.ms
yarn.client.failover-sleep-base-ms
Of course as with any YARN properties, these must be prefixed with spark.hadoop. in the context of a Spark job submission.
The relevant class (which resolves all these properties) is RMProxy, within the Hadoop YARN project (source here). All these, and related, properties are documented here.

Spark streaming job changing status to ACCEPTED from RUNNING after few days

I have long running spark streaming job which reads from kafka. This job is started once and expected to run forever.
Cluster is kerberized.
What I have observed is that job runs fine for few days (more than 7 days). At the start of job we can see that it acquires HDFS delegation token which is valid for 7 days.
18/06/17 12:32:11 INFO hdfs.DFSClient: Created token for user: HDFS_DELEGATION_TOKEN owner=user#domain, renewer=yarn, realUser=, issueDate=1529213531903, maxDate=1529818331903, sequenceNumber=915336, masterKeyId=385 on ha-hdfs:cluster
Job keeps running for more than 7 days, but after that period(few days after maxDate) it randomly and suddenly changes status to ACCEPTED. After this it tries to acquire new kerberos ticket and fails giving error for kerberos -
18/06/26 01:17:40 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:41 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:42 INFO yarn.Client: Application report for application_xxxx_80353 (state: ACCEPTED)
18/06/26 01:17:42 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:}
Final exception -
18/06/26 01:17:45 WARN security.UserGroupInformation: PriviledgedActionException as:user#domain
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
Note - I already tried passing keytab file so that delegation could be done forever. But I am not able to pass keytab file to spark as its conflicting with kafka jaas.conf.
So there are 3 related questions -
Why job could change status from RUNNING to ACCEPTED?
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos?
-keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket? So that when this happens it would be like submitting new spark job; as that node has valid KRB ticket and it will not give kerberos error.
Why job could change status from RUNNING to ACCEPTED?
A job will transition from RUNNING to ACCEPTED if the application failed and you still have available tries on your AM retries.
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
Yes. Spark allows for long running applications but on a secure system you must pass in a keytab.
Quoting Configuring Spark on YARN for Long-Running Applications with emphasis added:
Long-running applications such as Spark Streaming jobs must be able to write to HDFS, which means that the hdfs user may need to delegate tokens possibly beyond the default lifetime. This workload type REQUIRES passing Kerberos principal and keytab to the spark-submit script using the --principal and --keytab parameters. The keytab is copied to the host running the ApplicationMaster, and the Kerberos login is renewed periodically by using the principal and keytab to generate the required delegation tokens needed for HDFS.
Based on KAFKA-1696, this issue has not been resolved yet so I'm not sure what you can do unless you're running CDH and can upgrade to Spark 2.1.
References:
What does state transition RUNNING --> ACCEPTED mean?
Hadoop Delegation Tokens Explained - (see section titled "Long-running Applications")
KAFKA-1696 - Kafka should be able to generate Hadoop delegation tokens
YARN Application Security - section "Securing Long-lived YARN Services"
Reading data securely from Apache Kafka to Apache Spark
Updating here the solution which solved my problem for the benefit of others. Solution was to simply provide --principal and --keytab as another copied file so that there wont be conflict.
Why job could change status from RUNNING to ACCEPTED?
Application changed the status because of kerberos ticket not being valid. This can happen any time after lease is expired, but does not happen at any deterministic time after lease is expired.
Is the issue happening as I am not able to pass keytab?
It was indeed because of keytab. There is easy solution for this. Simple way to think about this is, whenever HDFS access is required you need to pass keytab and principal if you have streaming job. Just make copy of your keytab file and pass it with : --keytab "my-copy-yarn.keytab" --principal "user#domain" All other considerations are still same like jaas file etc, so you still need to apply those. So this does not interfere with that.
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket?
This is essentially happening because YARN is trying to renew ticket internally. It does not really matter if the node that application was launched from has valid ticket at the time of launch of new attempt. YARN has to have sufficient information to renew the ticket and when application was launched, it needs to have valid ticket(second part will always be true as without this job wont even start, but you need to take care of first part)

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

Spark reading from HBase in secured cluster

I am trying to execute a Spark job on a Kerberos enabled YARN cluster (Hortonworks). This jobs reads and writes data from/to HBase.
Unfortunately I have some problem with the authentication (esp. when the Spark job tries to access the HBase data) - and understanding how the authentication works. Here is the error I am getting:
ERROR yarn.ApplicationMaster: User class threw exception:
java.io.IOException: Login failure for username from keytab
keytabFile java.io.IOException: Login failure for username from
keytab keytabFile    at
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1146)
I want the authentication to happen based on a keytab of a (technical) user.
Therefore I have currently 2 places, where I provide the principal and keytab information:
In the spark-submit script with the --principal and --keytab options
In the code with UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
My questions:
What is the purpose of each of the two above mentioned places to provide the keytab? Is either of one only used to authenticate against the YARN cluster to get the resources? Or do I really need to provide the principal/keytab information twice for the two different authentications (against YARN and against HBase)? How does Spark handle all that internally?
Do I need to provide the principal as username or as username#principal? Is it the same for both places?
I need to have the keytab file distributed to all worker nodes in the same location, right? To which user must the keytab files be readable? Or is there also a way to pass it around through the spark-submit script?
I know, a lot of questions...
I appreciate your help or any hints!
Thanks and regards

Passing in Kerberos keytab/principal via SparkLauncher

spark-submit allows us to pass in Kerberos credentials via the --keytab and --principal options. If I try to add these via addSparkArg("--keytab",keytab) , I get a '--keytab' does not expect a value error - I presume this is due to lack of support as of v1.6.0.
Is there another way by which I can submit my Spark job using this SparkLauncher class, with Kerberos credentials ? - I'm using Yarn with Secured HDFS.
--principal arg is described as "Principal to be used to login to KDC, while running on secure HDFS".
So it is specific to Hadoop integration. I'm not sure you are aware of that, because your post does not mention either Hadoop, YARN or HDFS.
Now, Spark properties that are Hadoop-specific are described on the manual page Running on YARN. Surprise! Some of these properties sound familiar, like spark.yarn.principal and spark.yarn.keytab
Bottom line: the --blahblah command-line arguments are just shortcuts to properties that you can otherwise set in your code, or in the "spark-defaults" conf file.
Since Samson's answer, I thought I'd add what I've experienced with Spark 1.6.1:
You could use SparkLauncher.addSparkArg("--proxy-user", userName) to send in proxy user info.
You could use SparkLauncher.addSparkArg("--principal", kerbPrincipal) and SparkLauncher.addSparkArg("--keytab", kerbKeytab)
So, you can only use either (a) OR (b) but not both together - see https://github.com/apache/spark/pull/11358/commits/0159499a55591f25c690bfdfeecfa406142be02b
In other words, either the launched process triggers a Spark job on YARN as itself, using its Kerberos credentials (OR) the launched process impersonates an end user to trigger the Spark job on a cluster without Kerberos. On YARN, in case of the former, the job is owned by self while in case of the former, the job is owned by the proxied user.

Resources