EMR Step application state changes randomly - apache-spark

In my stderr logs of EMR Step, I see the state changing from ACCEPTED to RUNNING but after some time it changes again to ACCEPTED. Does this mean another application has been submitted to the same cluster or there is issue with the code running in the step? Please note that this does not happen with every execution of the same code. At times it does change from RUNNING to FINISHED state. How can I avoid this difference in the behavior of the same job with multiple executions? How to make sure that the step moves from RUNNING to FINISHED always and does not go back to ACCEPTED? Sorry if the description is not much clear. Also I am not able to get much help on the internet search. Sample logs are shown below:
22/11/15 03:48:23 INFO Client: Application report for application_1668478752419_0001 (state: RUNNING)
22/11/15 03:48:24 INFO Client: Application report for application_1668478752419_0001 (state: RUNNING)
22/11/15 03:48:25 INFO Client: Application report for application_1668478752419_0001 (state: RUNNING)
22/11/15 03:48:26 INFO Client: Application report for application_1668478752419_0001 (state: RUNNING)
22/11/15 03:48:27 INFO Client: Application report for application_1668478752419_0001 (state: ACCEPTED)

Related

Spark app fails after ACCEPTED state for a long time. Log says Socket timeout exception

I have Hadoop 3.2.2 running on a cluster with 1 name node, 2 data nodes and 1 resource manager node. I tried to run the sparkpi example on cluster mode. The spark-submit is done from my local machine. YARN accepts the job but the application UI says
this. Further in the terminal where I submitted the job it says
2021-06-05 13:10:03,881 INFO yarn.Client: Application report for application_1622897708349_0001 (state: ACCEPTED)
This continues to print until it fails. Upon failure it prints
I tried increasing the spark.executor.heartbeatInterval to 3600 secs. Still no luck. I also tried running the code from namenode thinking there must be some connection issue with my local machine. Still I'm unable to run it
found the answer albeit I don't know why it works! Adding the private IP address to the security group in AWS did the trick.

Spark streaming job changing status to ACCEPTED from RUNNING after few days

I have long running spark streaming job which reads from kafka. This job is started once and expected to run forever.
Cluster is kerberized.
What I have observed is that job runs fine for few days (more than 7 days). At the start of job we can see that it acquires HDFS delegation token which is valid for 7 days.
18/06/17 12:32:11 INFO hdfs.DFSClient: Created token for user: HDFS_DELEGATION_TOKEN owner=user#domain, renewer=yarn, realUser=, issueDate=1529213531903, maxDate=1529818331903, sequenceNumber=915336, masterKeyId=385 on ha-hdfs:cluster
Job keeps running for more than 7 days, but after that period(few days after maxDate) it randomly and suddenly changes status to ACCEPTED. After this it tries to acquire new kerberos ticket and fails giving error for kerberos -
18/06/26 01:17:40 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:41 INFO yarn.Client: Application report for application_xxxx_80353 (state: RUNNING)
18/06/26 01:17:42 INFO yarn.Client: Application report for application_xxxx_80353 (state: ACCEPTED)
18/06/26 01:17:42 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service:}
Final exception -
18/06/26 01:17:45 WARN security.UserGroupInformation: PriviledgedActionException as:user#domain
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
Note - I already tried passing keytab file so that delegation could be done forever. But I am not able to pass keytab file to spark as its conflicting with kafka jaas.conf.
So there are 3 related questions -
Why job could change status from RUNNING to ACCEPTED?
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos?
-keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket? So that when this happens it would be like submitting new spark job; as that node has valid KRB ticket and it will not give kerberos error.
Why job could change status from RUNNING to ACCEPTED?
A job will transition from RUNNING to ACCEPTED if the application failed and you still have available tries on your AM retries.
Is the issue happening as I am not able to pass keytab? If yes, how to pass keytab when using kafka and spark-streaming over kerberos? -keytab does not work as we are passing keytab with --files. keytab is already configured in jaas.conf and distributed with --files param in spark-submit. Any other way job can acquire new ticket?
Yes. Spark allows for long running applications but on a secure system you must pass in a keytab.
Quoting Configuring Spark on YARN for Long-Running Applications with emphasis added:
Long-running applications such as Spark Streaming jobs must be able to write to HDFS, which means that the hdfs user may need to delegate tokens possibly beyond the default lifetime. This workload type REQUIRES passing Kerberos principal and keytab to the spark-submit script using the --principal and --keytab parameters. The keytab is copied to the host running the ApplicationMaster, and the Kerberos login is renewed periodically by using the principal and keytab to generate the required delegation tokens needed for HDFS.
Based on KAFKA-1696, this issue has not been resolved yet so I'm not sure what you can do unless you're running CDH and can upgrade to Spark 2.1.
References:
What does state transition RUNNING --> ACCEPTED mean?
Hadoop Delegation Tokens Explained - (see section titled "Long-running Applications")
KAFKA-1696 - Kafka should be able to generate Hadoop delegation tokens
YARN Application Security - section "Securing Long-lived YARN Services"
Reading data securely from Apache Kafka to Apache Spark
Updating here the solution which solved my problem for the benefit of others. Solution was to simply provide --principal and --keytab as another copied file so that there wont be conflict.
Why job could change status from RUNNING to ACCEPTED?
Application changed the status because of kerberos ticket not being valid. This can happen any time after lease is expired, but does not happen at any deterministic time after lease is expired.
Is the issue happening as I am not able to pass keytab?
It was indeed because of keytab. There is easy solution for this. Simple way to think about this is, whenever HDFS access is required you need to pass keytab and principal if you have streaming job. Just make copy of your keytab file and pass it with : --keytab "my-copy-yarn.keytab" --principal "user#domain" All other considerations are still same like jaas file etc, so you still need to apply those. So this does not interfere with that.
When job again tries to go to RUNNING state, YARN is rejecting it as it does not have valid KRB ticket. Will it help if we ensure that driver node always has valid KRB ticket?
This is essentially happening because YARN is trying to renew ticket internally. It does not really matter if the node that application was launched from has valid ticket at the time of launch of new attempt. YARN has to have sufficient information to renew the ticket and when application was launched, it needs to have valid ticket(second part will always be true as without this job wont even start, but you need to take care of first part)

Spark workers stopped after driver commanded a shutdown

Basically, Master node also perform as a one of the slave. Once slave on master completed it called the SparkContext to stop and hence this command propagate to all the slaves which stop the execution in mid of the processing.
Error log in one of the worker:
INFO SparkHadoopMapRedUtil: attempt_201612061001_0008_m_000005_18112: Committed
INFO Executor: Finished task 5.0 in stage 8.0 (TID 18112). 2536 bytes result sent to driver
INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERMtdown
Check your resource manager user interface, in case you see any executor failed - it details about memory error. However if executor has not failed but still driver called for shut down - usually this is due to driver memory, please try to increase driver memory. Let me know how it goes.

spark: WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set

Do you know why this error below happens in spark shell when I try to access spark UI master:4040?
WARN amfilter.AmIpFilter: Could not find proxy-user cookie, so user will not be set
This happens, if you start spark shell with yarn.
spark-shell --master yarn
In that case, YARN will start a proxy web application to increase the security of the overall system.
The URL of the proxy will be displayed in the log, while starting the Spark shell.
Here is a sample from my log:
16/06/26 08:38:28 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> myhostname, PROXY_URI_BASES -> http://myhostname:8088/proxy/application_1466917030969_0003), /proxy/application_1466917030969_0003
You shouldn't access the standard Spark Web UI using port 4040 (or whatever you have configured).
Instead I know these 2 options (where I prefere the 2nd one):
Scan the log for the proxy application URL and use that
Open the YARN Web UI http://localhost:8088/cluster and
follow the link to the ApplicationMaster (column Tracking UI) of the
running Spark application
This is also descibed briefly in the YARN and SPark documentation.
Spark Security documentation:
https://spark.apache.org/docs/latest/security.html
Yarn Web Application Proxy documentation:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html

Election of new zookeeper leader shuts down the Spark Master

I realized that the master spark becomes unresponsive when I kill the leader zookeeper (of course I assigned the leader election task to the zookeeper). The following is the error log that I see on Master Spark node. Do you have any suggestion to resolve it?
15/06/22 10:44:00 INFO ClientCnxn: Unable to read additional data from
> server sessionid 0x14dd82e22f70ef1, likely server has closed socket,
> closing socket connection and attempting reconnect
15/06/22 10:44:00
> INFO ClientCnxn: Unable to read additional data from server sessionid
> 0x24dc5a319b40090, likely server has closed socket, closing socket
> connection and attempting reconnect
15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED
15/06/22 10:44:01 INFO
> ConnectionStateManager: State change: SUSPENDED
15/06/22 10:44:01 WARN
> ConnectionStateManager: There are no ConnectionStateListeners
> registered.
15/06/22 10:44:01 INFO ZooKeeperLeaderElectionAgent: We
> have lost leadership
15/06/22 10:44:01 ERROR Master: Leadership has
> been revoked -- master shutting down.
This is the expected behaviour. You have to set up 'n' number of masters and you need to specify the zookeeper url in all the master env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
Note that zookeeper maintains quorum. This means you need to have odd number of zookeepers and only when the quorum is maintained zookeeper cluster will be up. Since spark depends up on zookeeper it implies that spark cluster will not be up until zookeeper quorum is maintained.
When you set up two(n) masters and bring down a zookeeper the current master will go down and the new master will be elected and all the worker nodes will be attached to the new master.
You should have started your worker by giving
./start-slave.sh spark://master1:port1,master2:port2
You have to wait for 1-2 minutes!! to notice this failover.

Resources