Spark 2.4 to Elasticsearch : prevent data loss during Dataproc nodes decommissioning? - apache-spark

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster.
We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled).
During the execution, if the Dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.
This problem does not exist when I save to GCS or HDFS for example.
How to make resilient this task even when nodes are decommissioned?
An extract of the stacktrace:
Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337
Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx
Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
Thanks.
Fred

I'll go through a bit of background on the "downscaling problem" and how to mitigate it. Note that this information applies to both manual downscaling as well as preemptible VMs getting preempted.
Background
Autoscaling removes nodes based on the amount of "available" YARN memory in the cluster. It does not take into account shuffle data on the cluster. Here's an illustration from a recent presentation we gave.
In a MapReduce-style job (Spark jobs are a set of MapReduce-style shuffles between stages), data from all mappers must get to all reducers. Mappers write their shuffle data to local disk, and then reducers fetch data from each mapper. There is a server on every node dedicated to serving shuffle data, and it runs outside of YARN. Therefore, a node can appear idle in YARN even though it needs to stay around to serve its shuffle data.
When a single node gets removed, pretty much all reducers will fail, since they all need to fetch data from every node. Reducers will specifically fail with FetchFailedException (as you saw), indicating they were unable to get shuffle data from a particular node. The driver will eventually re-run necessary mappers, and then re-run the reduce stage. Spark is a bit inefficient (https://issues.apache.org/jira/browse/SPARK-20178), but it works.
Note that you can lose nodes in one of three scenarios:
Intentionally removing nodes (autoscaling or manual downscaling)
Preemptible VMs
getting preempted. Preemptible VMs get preempted at least every 24 hours.
(Relatively rare) a standard GCE VM is
ungracefully terminated by GCE, and restarted. Usually, standard VMs
are transparently live migrated.
When you create an autoscaling cluster, Dataproc adds several properties to improve
job resiliency in the face of losing nodes:
yarn:yarn.resourcemanager.am.max-attempts=10
mapred:mapreduce.map.maxattempts=10
mapred:mapreduce.reduce.maxattempts=10
spark:spark.task.maxFailures=10
spark:spark.stage.maxConsecutiveAttempts=10
spark:spark.yarn.am.attemptFailuresValidityInterval=1h
spark:spark.yarn.executor.failuresValidityInterval=1h
Note that if you enable autoscaling on an existing cluster, it will not have these properties set. (But you can set them manually when creating a cluster).
Mitigations
1) Use graceful decommissioning
Dataproc integrates with YARN's Graceful Decommissioning, and can be set on Autoscaling Policies or manual downscale operations.
When gracefully decommissioning a node, YARN keeps it around until applications that ran containers on the node finish, but does not let it run new containers. That gives nodes an opportunity to serve their shuffle data before being removed.
You will need to ensure that your graceful decommission timeout is long enough to encompass your longest jobs. The autoscaling docs suggest 1h as a starting point.
Note that graceful decommissioning only really makes sense only long-running clusters that process lots of short jobs.
On ephemeral clusters, you would be better off "right-sizing" the cluster from the start, or disabling downscaling unless the cluster is completely idle (set scaleDownMinWorkerFraction=1.0).
2) Avoid preemptible VMs
Even when using graceful decommissioning, preemptible VMs will be periodically terminated through "preemptions". GCE guarantees preemptible VMs will get preempted within 24 hours, and preemptions on large clusters are very spread out.
If you are using graceful decommissioning, and the FetchFailedException error messages include -sw-, you are likely seeing fetch failures due to nodes being preempted.
You have two options to avoid using preemptible VMs:
1. In your autoscaling policy, you can set secondaryWorkerConfig to have 0 min and max instances, and instead put all workers in the primary group.
2. Alternatively, you can keep using "secondary" workers, but set --properties dataproc:secondary-workers.is-preemptible.override=false. That will make your secondary workers be standard VMs.
3) Long term: Enhanced Flexibility Mode
Dataproc's Enhanced Flexibility Mode is the long term answer to the shuffle problem.
The downscaling problem is caused by shuffle data getting stored on local disk. EFM will include new shuffle implementations that allow placing shuffle data on a fixed set of nodes (e.g. just primary workers), or on storage outside of the cluster.
That will make secondary workers stateless, which means they can be removed at any time. This makes autoscaling far more compelling.
At the moment, EFM is still in Alpha, and does not scale to real-world workloads, but look out for a production-ready Beta by the summer.

Related

Scale up the spark worker nodes using code

I want to scale up the spark cluster to make all the worker nodes up and running before I start my processing. The issue is because the autoscaling of worker nodes is not happening immediately on load and is leading to worker node crashes. The cluster has 32 nodes but is overloading only 4 nodes and crashing so what I am trying to do is write some lines of code in the start of the python notebook which will kick start the remaining nodes and have 24 nodes up and running and then do the actual data processing. Is this possible using code ? Please advise.
In general, autoscale is for interactive workloads. I've rarely seen it provide benefits in jobs, though marketing makes a good job of selling it as a cost saving feature.
You can use Databricks jobs to create an automated cluster. When you run a job on a new automated cluster and terminates the cluster when the job is complete.
If you know when scaling up should happen better than auto scale then you can use this resize API: https://docs.databricks.com/dev-tools/api/latest/clusters.html#resize

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).
So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?
The Spark driver is a single point of failure because it holds all cluster state for the running App.
In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.
So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.
EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only.
So to handle your situation in a best way, you may consider to use both CORE and TASK group.
What you can do to tackle this -
MASTER: On-demand
CORE: On-demand. Minimum no of Instances can be 1.
TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.
This will reduce your cost also ensure that node containing the driver process never goes down.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

apache spark executors and data locality

The spark literature says
Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads.
And If I understand this right, In static allocation the executors are acquired by the Spark application when the Spark Context is created on all nodes in the cluster (in a cluster mode). I have a couple of questions
If executors are acquired on all nodes and will stay allocated to
this application during the the duration of the whole application,
isn't there a chance a lot of nodes remain idle?
What is the advantage of acquiring resources when Spark context is
created and not in the DAGScheduler? I mean the application could be
arbitrarily long and it is just holding the resources.
So when the DAGScheduler tries to get the preferred locations and
the executors in those nodes are running the tasks, would it
relinquish the executors on other nodes?
I have checked a related question
Does Spark on yarn deal with Data locality while launching executors
But I'm not sure there is a conclusive answer
If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?
Yes there is chance. If you have data skew this will happen. The challenge is to tune the executors and executor core so that you get maximum utilization. Spark also provides dynamic resource allocation which ensures the idle executors are removed.
What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.
Spark tries to keep data in memory while doing transformation. Contrary to map-reduce model where after every Map operation it writes to disk. Spark can keep the data in memory only if it can ensure the code is executed in the same machine. This is the reason of allocating resource beforehand.
So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?
Spark can't start a task on an executor unless the executor is free. Now spark application master negotiates with the yarn to get the preferred location. It may or may not get that. If it doesn't get, it will start task in different executor.

auto scale spark cluster

I have a spark streaming job running on a cluster. Spark job pulls messages from Kafka and do the required processing before dumping the processed data to database. I have sized my cluster as per the current load. But this load requirement may go up/down in the future. I want to know the techniques to facilitate this auto scaling without restarting the job. Scaling becomes more complicated if kakfa is being used (as in my case) as I won't like the partitions to be moved around in stateful streaming. Currently the cluster is completely in house but I won't mind migrating to cloud if that assists the scaling use case.
it is not an answer. Just some notes
"in stateful streaming". What did you mean by that? All state in spark is distributed. And you should not rely on local system, as if some task failed, it can be send to any other executor.
do you speak about increasing size of cluster or resources dedicated for your spark job in cluster?
If the first one, you need to monitor each node (memory, cpu) and when it's time (hit some threshold) add more nodes.
If the second one: we didn't find nice solution. Spark provides 'autoscaling' feature, however it doesn't work properly with kafka streaming.

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/
only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ?
In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization actions to easily automate doing the installation during cluster deployment.
This way, you can easily reproduce the customizations without requiring manual involvement if you ever want, for example, to do the same setup on multiple concurrent Dataproc clusters, or want to change machine types, or receive sub-minor-version bug fixes that Dataproc releases occasionally.
There's indeed no officially supported way of pausing a Dataproc cluster at the moment, in large part simply because being able to have reproducible cluster deployments along with several other considerations listed below means that 99% of the time it's better to use initialization-action customizations instead of pausing a cluster in-place. That said, there are possible short-term hacks, such as going into the Google Compute Engine page, selecting the instances that are part of the Dataproc cluster you want to pause, and clicking "stop" without deleting them.
The Compute Engine hourly charges and Dataproc's per-vCPU charges are only incurred when the underlying instance is running, so while you've "stopped" the instances manually, you won't incur Dataproc or Compute Engine's instance-hour charges despite Dataproc still listing the cluster as "RUNNING", albeit with warnings that you'll see if you go to the "VM Instances" tab of the Dataproc cluster summary page.
You should then be able to just click "start" from the Google Compute Engine page page to have the cluster running again, but it's important to consider the following caveats:
The cluster may occasionally fail to start up into a healthy state again; anything using local SSDs already can't be stopped and started again cleanly, but beyond that, Hadoop daemons may have failed for whatever reason to flush something important to disk if the shutdown wasn't orderly, or even user-installed settings may have broken the startup process in unknown ways.
Even when VMs are "stopped", they depend on the underlying Persistent Disks remaining, so you'll continue to incur charges for those even while "paused"; if we assume $0.04 per GB-month, and a default 500GB disk per Dataproc node, that comes out to continuing to pay ~$0.028/hour per instance; generally your data will be more accessible and also cheaper to just put in Google Cloud Storage for long term storage rather than trying to keep it long-term on the Dataproc cluster's HDFS.
If you come to depend on a manual cluster setup too much, then it'll become much more difficult to re-do if you need to size up your cluster, or change machine types, or change zones, etc. In contrast, with Dataproc's initialization actions, you can use Dataproc's cluster scaling feature to resize your cluster and automatically run the initialization actions for new workers created.
Update
Dataproc recently launched the ability to stop and start clusters: https://cloud.google.com/dataproc/docs/guides/dataproc-start-stop

Resources