When could Resource Manager report "AM Release Container" operation success? - apache-spark

I've been running a Spark Application and one of the Stages failed with a FetchFailedException. At roughly the same time a log similar to the following appeared in the resource manager logs.
<data> <time>,988 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAudtiLogger: User=<user> OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=<appid> CONTAINERID=<containerid>
My application was using more than yarn allocated it however it had been running for several days. What I expect happened is that other applications started and wanted to use the cluster and the Resource Manager killed one of my containers to give the resources to the others.
Can anyone help me verify my assumption and/or point me to the documentation that describes the log messages that the Resource Manager outputs?
Edit:
If it helps the Yarn version I'm running is 2.6.0-cdh5.4.9

The INFO message about OPERATION=AM Released Container is from org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt. My vague understanding of the code tells me that it was due to successful container release meaning that the container for the ApplicationMaster of your Spark application finished successfully.
I've just answered a similar question Why Spark application on YARN fails with FetchFailedException due to Connection refused? (yours was almost a duplicate).
FetchFailedException exception is thrown when a reducer task (for a ShuffleDependency) could not fetch shuffle blocks.
The root cause of the FetchFailedException is usually because the executor (with the BlockManager for the shuffle blocks) is lost (i.e. no longer available) due to:
OutOfMemoryError could be thrown (aka OOMed) or some other unhandled exception.
The cluster manager that manages the workers with the executors of your Spark application, e.g. YARN, enforces the container memory limits and eventually decided to kill the executor due to excessive memory usage.
You should review the logs of the Spark application using web UI, Spark History Server or cluster-specific tools like yarn logs -applicationId for Hadoop YARN (which is your case).
A solution is usually to tune the memory of your Spark application.

Related

Spark job on Kubernetes Under Resource Starvation Wait Indefinitely For SPARK_MIN_EXECUTORS

I am using Spark 3.0.1 and working on a project spark deployment on Kubernetes where Kubernetes acting cluster manager for spark job and spark submits the job using client mode. In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum number of executors , the executors goes in Pending State for indefinite time until the resource gets free.
Suppose, Cluster Configurations are:
total Memory=204Gi
used Memory=200Gi
free memory= 4Gi
SPARK.EXECUTOR.MEMORY=10G
SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
Here job should not be submitted as executors allocated are less than MIN_EXECUTORS.
How can driver abort the job in this scenario?
Firstly would like to mention that, spark dynamic allocation not supported for kubernetes yet(as of version 3.0.1), its in pipeline for future release Link
while for the requirement you have posted, you could address by running a resource monitor code snippet before the job initialized and terminate the initialization pod itself with error.
if you want to run this from CLI you could use kubectl describe nodes/ kube-capacity utility to monitor the resources

What should I pay attention to when optimizing the spark task, in order to avoid excessive local logs generated

According to log analysis, the reason for restarting my EMR yarn resourcemanager is NPE crash caused by abnormal disk failure of yarn node.
What should I pay attention to when optimizing the spark task, in order to avoid excessive local logs generated by the task during the running process, which will cause the node to be marked as unhealthy, which will cause abnormal conditions.
Or what parameters should I adjust to reduce the logs that are kept locally
You can specify spark.history.fs.cleaner.maxAge and spark.history.fs.cleaner.interval to clean up file system.
More info here: https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/

Spark on Kubernetes: Is it possible to keep the crashed pods when a job fails?

I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local[*] mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker.
When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?
You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here
Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling.
What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.
There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination - (Optional)
DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination. Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.

Cloudera Execution Problem: Problem:Initial job has not accepted any resources

I'm trying to fetch some data from Cloudera's Quick Start Hadoop distribution (a Linux VM for us) on our SAP HANA database using SAP Spark Controller. Every time I trigger the job in HANA, it gets stuck and I see the following warning being logged continuously every 10-15 seconds in SPARK Controller's log file, unless I kill the job.
WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Although it's logged like a warning it looks like it's a problem that prevents the job from executing on Cloudera. From what I read, it's either an issue with the resource management on Cloudera, or an issue with blocked ports. In our case we don't have any blocked ports so it must be the former.
Our Cloudera is running a single node and has 16GB RAM with 4 CPU cores.
Looking at the overall configuration I have a bunch of warnings, but I can't determine if they are relevant to the issue or not.
Here's also how the RAM is distributed on Cloudera
It would be great if you can help me pinpoint the cause for this issue because I've been trying various combinations of things over the past few days without any success.
Thanks,
Dimitar
You're trying to use the Cloudera Quickstart VM‎ for a purpose beyond it's capacity. It's really meant for someone to play around with Hadoop and CDH and should not be used for any production level work.
Your Node Manager only has 5GB of memory to use for compute resources. In order to do any work, you need to create an Application Master(AM) and a Spark Executor and then have reserve memory for your executors which you won't have on a Quickstart VM.

Apache Spark running on YARN with fix allocation

What's happening right now is YARN simply gets a number of executor from one spark job and give it to another spark job. As a result, this spark job encounters error and die.
Is there a way or an existing configuration where a certain spark job running on YARN have a fix resource allocation?
Fix resource allocation is an old concept and doesn't give benefit of proper resource utilization. Dynamic resource allocation is an advanced/expected feature of YARN. So, I recommend that you see what is happening actually. If a job is already running on then YARN doesn't take the resources and gives it to others. If resources are not available then the 2nd job will get queued and resources will not be pulled up abruptly from the 1st job. The reason is containers have a combination of memory and CPU. If memory is allocated to other job then basically it means that the JVM of the 1st job is lost for ever. YARN doesn't do what have mentioned.

Resources