Google dataproc spark cluster with too many Preemptible nodes sometime hangs

Google dataproc spark cluster with too many Preemptible nodes sometime hangs - apache-spark

When running spark cluster on dataproc with only 2 nonpreemptable worker nodes and other 100~ preemptable nodes I sometimes get a cluster that is not usable at all due to too many connection errors, datanode errors, lost executors but still being tracked for the heartbeat...
Always getting errors like that:
18/08/08 15:40:11 WARN org.apache.hadoop.hdfs.DataStreamer: Error Recovery for BP-877400388-10.128.0.31-1533740979408:blk_1073742308_1487 in pipeline [DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK], DatanodeInfoWithStorage[10.128.0.7:9866,DS-9f1d8b17-0fee-41c7-9d31-8ad89f0df69f,DISK]]: datanode 0(DatanodeInfoWithStorage[10.128.0.35:9866,DS-60d8a566-a1b3-4fce-b9e2-1eeeb4ac840b,DISK]) is bad.
And errors reporting Slow ReadProcessor read fields for block BP-877400388-10.128.0.31-1533740979408:blk_1073742314_1494
from what I see there appear to be something not functioning correctly for those clusters but nothing is reported to indicate that.
Plus the application master is also created on a preemptable node why is that?

According to the documentation, the number of preemptible workers needs to be less than 50% of the total number of nodes within your cluster to have the best results. Regarding the application master within the preemptible node, you could report this behavior by filling an issue tracker for Dataproc.

Related

How can I increase spark.driver.memoryOverhead in Google dataproc?

I am getting two types of errors in running a job on Google dataproc and it is causing executors to be lost one by one until the last executor is lost and the job fails. I have set my master node to n1-highmem-2 (2 vCPU, 13 GB memory) and have set two worker nodes to n1-highmem-8 (8 vCPU, 52 GB memory). The two errors I get are:
"Container exited from explicit termination request."
"Lost executor x: Executor heartbeat timed out"
From my understanding in what I could see online, I need to increase spark.executor.memoryOverhead. I don't know if this is the right answer or not, but I can't see how to change this in the Google dataproc console, and I don't know what to change it to. Any help would be great!
Thanks,
jim

You can set Spark properties at cluster-level with
gcloud dataproc clusters create ... --properties spark:<name>=<value>,...
and/or job-level with
gcloud dataproc jobs submit spark ... --properties <name>=<value>,...
The former requires the spark: prefix, the latter doesn't. If both are set, the latter takes precedence. See more details in this doc.

It turns out the memory per vCPU was the limitation causing the executors to fail one by one. Initially, I was trying to use the custom configuration in console mode for the cluster to add add'l memory per vCPU. It turns that the UI has some bug (per the Google Dataproc team), which limits you from increasing the memory per vCPU (if you use the slider bar to increase the memory beyond the default max of 6.5GB, the cluster set up will fail). However, if you use the command line equivalent of the console, it does allow the set up of the cluster and the increased memory per vCPU was enough to complete the job w/o the executors failing one by one.

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).
So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?

The Spark driver is a single point of failure because it holds all cluster state for the running App.
In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.
So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.

EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only.
So to handle your situation in a best way, you may consider to use both CORE and TASK group.
What you can do to tackle this -
MASTER: On-demand
CORE: On-demand. Minimum no of Instances can be 1.
TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.
This will reduce your cost also ensure that node containing the driver process never goes down.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

Spark 2.4 to Elasticsearch : prevent data loss during Dataproc nodes decommissioning?

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster.
We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled).
During the execution, if the Dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.
This problem does not exist when I save to GCS or HDFS for example.
How to make resilient this task even when nodes are decommissioned?
An extract of the stacktrace:
Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337
Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx
Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
Thanks.
Fred

I'll go through a bit of background on the "downscaling problem" and how to mitigate it. Note that this information applies to both manual downscaling as well as preemptible VMs getting preempted.
Background
Autoscaling removes nodes based on the amount of "available" YARN memory in the cluster. It does not take into account shuffle data on the cluster. Here's an illustration from a recent presentation we gave.
In a MapReduce-style job (Spark jobs are a set of MapReduce-style shuffles between stages), data from all mappers must get to all reducers. Mappers write their shuffle data to local disk, and then reducers fetch data from each mapper. There is a server on every node dedicated to serving shuffle data, and it runs outside of YARN. Therefore, a node can appear idle in YARN even though it needs to stay around to serve its shuffle data.
When a single node gets removed, pretty much all reducers will fail, since they all need to fetch data from every node. Reducers will specifically fail with FetchFailedException (as you saw), indicating they were unable to get shuffle data from a particular node. The driver will eventually re-run necessary mappers, and then re-run the reduce stage. Spark is a bit inefficient (https://issues.apache.org/jira/browse/SPARK-20178), but it works.
Note that you can lose nodes in one of three scenarios:
Intentionally removing nodes (autoscaling or manual downscaling)
Preemptible VMs
getting preempted. Preemptible VMs get preempted at least every 24 hours.
(Relatively rare) a standard GCE VM is
ungracefully terminated by GCE, and restarted. Usually, standard VMs
are transparently live migrated.
When you create an autoscaling cluster, Dataproc adds several properties to improve
job resiliency in the face of losing nodes:
yarn:yarn.resourcemanager.am.max-attempts=10
mapred:mapreduce.map.maxattempts=10
mapred:mapreduce.reduce.maxattempts=10
spark:spark.task.maxFailures=10
spark:spark.stage.maxConsecutiveAttempts=10
spark:spark.yarn.am.attemptFailuresValidityInterval=1h
spark:spark.yarn.executor.failuresValidityInterval=1h
Note that if you enable autoscaling on an existing cluster, it will not have these properties set. (But you can set them manually when creating a cluster).
Mitigations
1) Use graceful decommissioning
Dataproc integrates with YARN's Graceful Decommissioning, and can be set on Autoscaling Policies or manual downscale operations.
When gracefully decommissioning a node, YARN keeps it around until applications that ran containers on the node finish, but does not let it run new containers. That gives nodes an opportunity to serve their shuffle data before being removed.
You will need to ensure that your graceful decommission timeout is long enough to encompass your longest jobs. The autoscaling docs suggest 1h as a starting point.
Note that graceful decommissioning only really makes sense only long-running clusters that process lots of short jobs.
On ephemeral clusters, you would be better off "right-sizing" the cluster from the start, or disabling downscaling unless the cluster is completely idle (set scaleDownMinWorkerFraction=1.0).
2) Avoid preemptible VMs
Even when using graceful decommissioning, preemptible VMs will be periodically terminated through "preemptions". GCE guarantees preemptible VMs will get preempted within 24 hours, and preemptions on large clusters are very spread out.
If you are using graceful decommissioning, and the FetchFailedException error messages include -sw-, you are likely seeing fetch failures due to nodes being preempted.
You have two options to avoid using preemptible VMs:
1. In your autoscaling policy, you can set secondaryWorkerConfig to have 0 min and max instances, and instead put all workers in the primary group.
2. Alternatively, you can keep using "secondary" workers, but set --properties dataproc:secondary-workers.is-preemptible.override=false. That will make your secondary workers be standard VMs.
3) Long term: Enhanced Flexibility Mode
Dataproc's Enhanced Flexibility Mode is the long term answer to the shuffle problem.
The downscaling problem is caused by shuffle data getting stored on local disk. EFM will include new shuffle implementations that allow placing shuffle data on a fixed set of nodes (e.g. just primary workers), or on storage outside of the cluster.
That will make secondary workers stateless, which means they can be removed at any time. This makes autoscaling far more compelling.
At the moment, EFM is still in Alpha, and does not scale to real-world workloads, but look out for a production-ready Beta by the summer.

Uneven distribution of tasks among the spark executors

I am using spark-streaming 2.2.1 on production and in this application i read the data from RabbitMQ and do the further processing and finally save it in the cassandra. So, i am facing this strange issue where number of tasks are not evenly distributed amongst executors on one of the nodes. I restarted the streaming but still issue persist.
As you can see on 10.10.4.72 i have 2 executors. The one running on 41893 port has completed approx. double the number of tasks on rest of the nodes (10.10.3.73 and 10.10.3.72). where as executor running on 33451 port on 10.10.4.72 has completed only 18 tasks. And this issue persist even if i restart my spark-streaming.
Edit question
After 12 hours still as you can see in the below image the same executor has not processed not even a single task in this duration.

Cloudera Execution Problem: Problem:Initial job has not accepted any resources

I'm trying to fetch some data from Cloudera's Quick Start Hadoop distribution (a Linux VM for us) on our SAP HANA database using SAP Spark Controller. Every time I trigger the job in HANA, it gets stuck and I see the following warning being logged continuously every 10-15 seconds in SPARK Controller's log file, unless I kill the job.
WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Although it's logged like a warning it looks like it's a problem that prevents the job from executing on Cloudera. From what I read, it's either an issue with the resource management on Cloudera, or an issue with blocked ports. In our case we don't have any blocked ports so it must be the former.
Our Cloudera is running a single node and has 16GB RAM with 4 CPU cores.
Looking at the overall configuration I have a bunch of warnings, but I can't determine if they are relevant to the issue or not.
Here's also how the RAM is distributed on Cloudera
It would be great if you can help me pinpoint the cause for this issue because I've been trying various combinations of things over the past few days without any success.
Thanks,
Dimitar

You're trying to use the Cloudera Quickstart VM‎ for a purpose beyond it's capacity. It's really meant for someone to play around with Hadoop and CDH and should not be used for any production level work.
Your Node Manager only has 5GB of memory to use for compute resources. In order to do any work, you need to create an Application Master(AM) and a Spark Executor and then have reserve memory for your executors which you won't have on a Quickstart VM.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string