Spark streaming job changing status to ACCEPTED from RUNNING after few days - apache-spark

I have long running spark streaming job which reads from kafka. This job is started once and expected to run forever.
Cluster is kerberized.
What I have observed is that job runs fine for few days (more than 7 days). At the start of job we can see that it acquires HDFS delegation token which is valid for 7 days.
18/06/17 12:32:11 INFO hdfs.DFSClient: Created token for user: HDFS_DELEGATION_TOKEN owner=user#domain, renewer=yarn, realUser=, issueDate=1529213531903, maxDate=1529818331903, sequenceNumber=915336, masterKeyId=385 on ha-hdfs:cluster
Job keeps running for more than 7 days, but after that period(few days after maxDate) it randomly and suddenly changes status to ACCEPTED. After this it tries to acquire new kerberos ticket and fails giving error for kerberos -
18/06/26 01:17:40 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:41 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:42 INFO yarn.Client: Application report for application_xxxx_80353 (state: ACCEPTED)
18/06/26 01:17:42 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:}
Final exception -
18/06/26 01:17:45 WARN security.UserGroupInformation: PriviledgedActionException as:user#domain
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
Note - I already tried passing keytab file so that delegation could be done forever. But I am not able to pass keytab file to spark as its conflicting with kafka jaas.conf.
So there are 3 related questions -
Why job could change status from RUNNING to ACCEPTED?
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos?
-keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket? So that when this happens it would be like submitting new spark job; as that node has valid KRB ticket and it will not give kerberos error.

Why job could change status from RUNNING to ACCEPTED?
A job will transition from RUNNING to ACCEPTED if the application failed and you still have available tries on your AM retries.
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
Yes. Spark allows for long running applications but on a secure system you must pass in a keytab.
Quoting Configuring Spark on YARN for Long-Running Applications with emphasis added:
Long-running applications such as Spark Streaming jobs must be able to write to HDFS, which means that the hdfs user may need to delegate tokens possibly beyond the default lifetime. This workload type REQUIRES passing Kerberos principal and keytab to the spark-submit script using the --principal and --keytab parameters. The keytab is copied to the host running the ApplicationMaster, and the Kerberos login is renewed periodically by using the principal and keytab to generate the required delegation tokens needed for HDFS.
Based on KAFKA-1696, this issue has not been resolved yet so I'm not sure what you can do unless you're running CDH and can upgrade to Spark 2.1.
References:
What does state transition RUNNING --> ACCEPTED mean?
Hadoop Delegation Tokens Explained - (see section titled "Long-running Applications")
KAFKA-1696 - Kafka should be able to generate Hadoop delegation tokens
YARN Application Security - section "Securing Long-lived YARN Services"
Reading data securely from Apache Kafka to Apache Spark

Updating here the solution which solved my problem for the benefit of others. Solution was to simply provide --principal and --keytab as another copied file so that there wont be conflict.
Why job could change status from RUNNING to ACCEPTED?
Application changed the status because of kerberos ticket not being valid. This can happen any time after lease is expired, but does not happen at any deterministic time after lease is expired.
Is the issue happening as I am not able to pass keytab?
It was indeed because of keytab. There is easy solution for this. Simple way to think about this is, whenever HDFS access is required you need to pass keytab and principal if you have streaming job. Just make copy of your keytab file and pass it with : --keytab "my-copy-yarn.keytab" --principal "user#domain" All other considerations are still same like jaas file etc, so you still need to apply those. So this does not interfere with that.
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket?
This is essentially happening because YARN is trying to renew ticket internally. It does not really matter if the node that application was launched from has valid ticket at the time of launch of new attempt. YARN has to have sufficient information to renew the ticket and when application was launched, it needs to have valid ticket(second part will always be true as without this job wont even start, but you need to take care of first part)

Related

Spark app fails after ACCEPTED state for a long time. Log says Socket timeout exception

I have Hadoop 3.2.2 running on a cluster with 1 name node, 2 data nodes and 1 resource manager node. I tried to run the sparkpi example on cluster mode. The spark-submit is done from my local machine. YARN accepts the job but the application UI says
this. Further in the terminal where I submitted the job it says
2021-06-05 13:10:03,881 INFO yarn.Client: Application report for application_1622897708349_0001 (state: ACCEPTED)
This continues to print until it fails. Upon failure it prints
I tried increasing the spark.executor.heartbeatInterval to 3600 secs. Still no luck. I also tried running the code from namenode thinking there must be some connection issue with my local machine. Still I'm unable to run it
found the answer albeit I don't know why it works! Adding the private IP address to the security group in AWS did the trick.

How to control the number of Hadoop IPC retry attempts for a Spark job submission?

Suppose I attempt to submit a Spark (2.4.x) job to a Kerberized cluster, without having valid Kerberos credentials. In this case, the Spark launcher tries repeatedly to initiate a Hadoop IPC call, but fails:
20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
This will repeat a number of times (30, in my case), until eventually the launcher gives up and the job submission is considered failed.
Various other similar questions mention these properties (which are actually YARN properties but prefixed with spark. as per the standard mechanism to pass them with a Spark application).
spark.yarn.maxAppAttempts
spark.yarn.resourcemanager.am.max-attempts
However, neither of these properties affects the behavior I'm describing. How can I control the number of IPC retries in a Spark job submission?
After a good deal of debugging, I figured out the properties involved here.
yarn.client.failover-max-attempts (controls the max attempts)
Without specifying this, the number of attempts appears to come from the ratio of these two properties (numerator first, denominator second).
yarn.resourcemanager.connect.max-wait.ms
yarn.client.failover-sleep-base-ms
Of course as with any YARN properties, these must be prefixed with spark.hadoop. in the context of a Spark job submission.
The relevant class (which resolves all these properties) is RMProxy, within the Hadoop YARN project (source here). All these, and related, properties are documented here.

Spark reading from HBase in secured cluster

I am trying to execute a Spark job on a Kerberos enabled YARN cluster (Hortonworks). This jobs reads and writes data from/to HBase.
Unfortunately I have some problem with the authentication (esp. when the Spark job tries to access the HBase data) - and understanding how the authentication works. Here is the error I am getting:
ERROR yarn.ApplicationMaster: User class threw exception:
java.io.IOException: Login failure for username from keytab
keytabFile java.io.IOException: Login failure for username from
keytab keytabFile    at
org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytabAndReturnUGI(UserGroupInformation.java:1146)
I want the authentication to happen based on a keytab of a (technical) user.
Therefore I have currently 2 places, where I provide the principal and keytab information:
In the spark-submit script with the --principal and --keytab options
In the code with UserGroupInformation.loginUserFromKeytabAndReturnUGI(principal, keytab)
My questions:
What is the purpose of each of the two above mentioned places to provide the keytab? Is either of one only used to authenticate against the YARN cluster to get the resources? Or do I really need to provide the principal/keytab information twice for the two different authentications (against YARN and against HBase)? How does Spark handle all that internally?
Do I need to provide the principal as username or as username#principal? Is it the same for both places?
I need to have the keytab file distributed to all worker nodes in the same location, right? To which user must the keytab files be readable? Or is there also a way to pass it around through the spark-submit script?
I know, a lot of questions...
I appreciate your help or any hints!
Thanks and regards

spark: WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set

Do you know why this error below happens in spark shell when I try to access spark UI master:4040?
WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set
This happens, if you start spark shell with yarn.
spark-shell --master yarn
In that case, YARN will start a proxy web application to increase the security of the overall system.
The URL of the proxy will be displayed in the log, while starting the Spark shell.
Here is a sample from my log:
16/06/26 08:38:28 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> myhostname, PROXY_URI_BASES -> http://myhostname:8088/proxy/application_1466917030969_0003), /proxy/application_1466917030969_0003
You shouldn't access the standard Spark Web UI using port 4040 (or whatever you have configured).
Instead I know these 2 options (where I prefere the 2nd one):
Scan the log for the proxy application URL and use that
Open the YARN Web UI http://localhost:8088/cluster and
follow the link to the ApplicationMaster (column Tracking UI) of the
running Spark application
This is also descibed briefly in the YARN and SPark documentation.
Spark Security documentation:
https://spark.apache.org/docs/latest/security.html
Yarn Web Application Proxy documentation:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html

Passing in Kerberos keytab/principal via SparkLauncher

spark-submit allows us to pass in Kerberos credentials via the --keytab and --principal options. If I try to add these via addSparkArg("--keytab",keytab) , I get a '--keytab' does not expect a value error - I presume this is due to lack of support as of v1.6.0.
Is there another way by which I can submit my Spark job using this SparkLauncher class, with Kerberos credentials ? - I'm using Yarn with Secured HDFS.
--principal arg is described as "Principal to be used to login to KDC, while running on secure HDFS".
So it is specific to Hadoop integration. I'm not sure you are aware of that, because your post does not mention either Hadoop, YARN or HDFS.
Now, Spark properties that are Hadoop-specific are described on the manual page Running on YARN. Surprise! Some of these properties sound familiar, like spark.yarn.principal and spark.yarn.keytab
Bottom line: the --blahblah command-line arguments are just shortcuts to properties that you can otherwise set in your code, or in the "spark-defaults" conf file.
Since Samson's answer, I thought I'd add what I've experienced with Spark 1.6.1:
You could use SparkLauncher.addSparkArg("--proxy-user", userName) to send in proxy user info.
You could use SparkLauncher.addSparkArg("--principal", kerbPrincipal) and SparkLauncher.addSparkArg("--keytab", kerbKeytab)
So, you can only use either (a) OR (b) but not both together - see https://github.com/apache/spark/pull/11358/commits/0159499a55591f25c690bfdfeecfa406142be02b
In other words, either the launched process triggers a Spark job on YARN as itself, using its Kerberos credentials (OR) the launched process impersonates an end user to trigger the Spark job on a cluster without Kerberos. On YARN, in case of the former, the job is owned by self while in case of the former, the job is owned by the proxied user.

Resources