Doubts related to Spark resource usage - apache-spark

I am executing a Spark Streaming application and I am caching the rdds for history look-back, my batch is of duration one minute and average processing time is 14 seconds, So executors are not computing for whole batch duration. So are executors, are still hold up as I am caching the rdds in memory. And if executors are hold up should we consider this holding up of executors is wastage of resources.

It depends of what you want to achieve.
In Spark 2.0, dynamic allocation is enabled to Spark Streaming with no Bugs.
What was the problem, if you have a huge workload of data if you don't keep at least one executor for the data receive you may lose data. Now this is solved with Spark 2.0 and the release of the data is working.
What is the advantage of keeping your data in cache when a huge amount of data comes? You can process your data without a shuffle, it can increase your response time.
But, if you have a process of 1 by 1 minute. And it takes just 14 seconds to process your data in an average time. I suggest you to release your data and release your workers to open space for other tasks.
If you will not have enough resources for your tasks, the tasks will be queued and will be handled as soon you have the resources.
What is the risk? If you release the workers could be hard to get the resources back if you don't have preemption in your Yarn. This can be a waste of resource depends of your cluster.
What I would do is: create some queues that can handle your jobs. Set the High priority queue, set your streaming there, the other jobs in other queues and turn on the Dynamic Allocation and release the cache. If your application needs something with more resources the Yarn will handle it.

Related

best way to run 300+ concurrent spark jobs in Dataproc?

I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?
Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.

Spark 2.4 to Elasticsearch : prevent data loss during Dataproc nodes decommissioning?

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster.
We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google Dataproc cluster (autoscaling enabled).
During the execution, if the Dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.
This problem does not exist when I save to GCS or HDFS for example.
How to make resilient this task even when nodes are decommissioned?
An extract of the stacktrace:
Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337
Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx
Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).
Thanks.
Fred
I'll go through a bit of background on the "downscaling problem" and how to mitigate it. Note that this information applies to both manual downscaling as well as preemptible VMs getting preempted.
Background
Autoscaling removes nodes based on the amount of "available" YARN memory in the cluster. It does not take into account shuffle data on the cluster. Here's an illustration from a recent presentation we gave.
In a MapReduce-style job (Spark jobs are a set of MapReduce-style shuffles between stages), data from all mappers must get to all reducers. Mappers write their shuffle data to local disk, and then reducers fetch data from each mapper. There is a server on every node dedicated to serving shuffle data, and it runs outside of YARN. Therefore, a node can appear idle in YARN even though it needs to stay around to serve its shuffle data.
When a single node gets removed, pretty much all reducers will fail, since they all need to fetch data from every node. Reducers will specifically fail with FetchFailedException (as you saw), indicating they were unable to get shuffle data from a particular node. The driver will eventually re-run necessary mappers, and then re-run the reduce stage. Spark is a bit inefficient (https://issues.apache.org/jira/browse/SPARK-20178), but it works.
Note that you can lose nodes in one of three scenarios:
Intentionally removing nodes (autoscaling or manual downscaling)
Preemptible VMs
getting preempted. Preemptible VMs get preempted at least every 24 hours.
(Relatively rare) a standard GCE VM is
ungracefully terminated by GCE, and restarted. Usually, standard VMs
are transparently live migrated.
When you create an autoscaling cluster, Dataproc adds several properties to improve
job resiliency in the face of losing nodes:
yarn:yarn.resourcemanager.am.max-attempts=10
mapred:mapreduce.map.maxattempts=10
mapred:mapreduce.reduce.maxattempts=10
spark:spark.task.maxFailures=10
spark:spark.stage.maxConsecutiveAttempts=10
spark:spark.yarn.am.attemptFailuresValidityInterval=1h
spark:spark.yarn.executor.failuresValidityInterval=1h
Note that if you enable autoscaling on an existing cluster, it will not have these properties set. (But you can set them manually when creating a cluster).
Mitigations
1) Use graceful decommissioning
Dataproc integrates with YARN's Graceful Decommissioning, and can be set on Autoscaling Policies or manual downscale operations.
When gracefully decommissioning a node, YARN keeps it around until applications that ran containers on the node finish, but does not let it run new containers. That gives nodes an opportunity to serve their shuffle data before being removed.
You will need to ensure that your graceful decommission timeout is long enough to encompass your longest jobs. The autoscaling docs suggest 1h as a starting point.
Note that graceful decommissioning only really makes sense only long-running clusters that process lots of short jobs.
On ephemeral clusters, you would be better off "right-sizing" the cluster from the start, or disabling downscaling unless the cluster is completely idle (set scaleDownMinWorkerFraction=1.0).
2) Avoid preemptible VMs
Even when using graceful decommissioning, preemptible VMs will be periodically terminated through "preemptions". GCE guarantees preemptible VMs will get preempted within 24 hours, and preemptions on large clusters are very spread out.
If you are using graceful decommissioning, and the FetchFailedException error messages include -sw-, you are likely seeing fetch failures due to nodes being preempted.
You have two options to avoid using preemptible VMs:
1. In your autoscaling policy, you can set secondaryWorkerConfig to have 0 min and max instances, and instead put all workers in the primary group.
2. Alternatively, you can keep using "secondary" workers, but set --properties dataproc:secondary-workers.is-preemptible.override=false. That will make your secondary workers be standard VMs.
3) Long term: Enhanced Flexibility Mode
Dataproc's Enhanced Flexibility Mode is the long term answer to the shuffle problem.
The downscaling problem is caused by shuffle data getting stored on local disk. EFM will include new shuffle implementations that allow placing shuffle data on a fixed set of nodes (e.g. just primary workers), or on storage outside of the cluster.
That will make secondary workers stateless, which means they can be removed at any time. This makes autoscaling far more compelling.
At the moment, EFM is still in Alpha, and does not scale to real-world workloads, but look out for a production-ready Beta by the summer.

apache spark executors and data locality

The spark literature says
Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads.
And If I understand this right, In static allocation the executors are acquired by the Spark application when the Spark Context is created on all nodes in the cluster (in a cluster mode). I have a couple of questions
If executors are acquired on all nodes and will stay allocated to
this application during the the duration of the whole application,
isn't there a chance a lot of nodes remain idle?
What is the advantage of acquiring resources when Spark context is
created and not in the DAGScheduler? I mean the application could be
arbitrarily long and it is just holding the resources.
So when the DAGScheduler tries to get the preferred locations and
the executors in those nodes are running the tasks, would it
relinquish the executors on other nodes?
I have checked a related question
Does Spark on yarn deal with Data locality while launching executors
But I'm not sure there is a conclusive answer
If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?
Yes there is chance. If you have data skew this will happen. The challenge is to tune the executors and executor core so that you get maximum utilization. Spark also provides dynamic resource allocation which ensures the idle executors are removed.
What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.
Spark tries to keep data in memory while doing transformation. Contrary to map-reduce model where after every Map operation it writes to disk. Spark can keep the data in memory only if it can ensure the code is executed in the same machine. This is the reason of allocating resource beforehand.
So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?
Spark can't start a task on an executor unless the executor is free. Now spark application master negotiates with the yarn to get the preferred location. It may or may not get that. If it doesn't get, it will start task in different executor.

How does spark.dynamicAllocation.enabled influence the order of jobs?

Need an understanding on when to use spark.dynamicAllocation.enabled - What are advantages and disadvantages of using it? I have queue where jobs get submitted.
9:30 AM --> Job A gets submitted with dynamicAllocation enabled.
10:30 AM --> Job B gets submitted with dynamicAllocation enabled.
Note: My Data is huge (processing will be done on 10GB data with transformations).
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Dynamic Allocation of Executors is about resizing your pool of executors.
Quoting Dynamic Allocation:
spark.dynamicAllocation.enabled Whether to use dynamic resource allocation, which scales the number of executors registered with this application up and down based on the workload.
And later on in Dynamic Resource Allocation:
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
In other words, job A will usually finish before job B will be executed. Spark jobs are usually executed sequentially, i.e. a job has to finish before another can start.
Usually...
SparkContext is thread-safe and can handle jobs from a Spark application. That means that you may submit jobs at the same time or one after another and in some configuration expect that these two jobs will run in parallel.
Quoting Scheduling Within an Application:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
Wrapping up...
Which Job gets the preference on allocation of executors to Job A or Job B and how does the spark co-ordinates b/w 2 applications?
Job A.
Unless you have enabled Fair Scheduler Pools:
The fair scheduler also supports grouping jobs into pools, and setting different scheduling options (e.g. weight) for each pool. This can be useful to create a “high-priority” pool for more important jobs, for example, or to group the jobs of each user together and give users equal shares regardless of how many concurrent jobs they have instead of giving jobs equal shares.

Spark on Mesos - running multiple Streaming jobs

I have 2 spark streaming jobs that I want to run, as well as keeping some available resources for batch jobs and other operations.
I evaluated Spark Standalone cluster manager, but I realized that I would have to fix the resources for two jobs, which would leave almost no computing power to batch jobs.
I started evaluating Mesos, because it has "fine grained" execution model, where resources are shifted between Spark applications.
1) Does it mean that a single core can be shifted between 2 streaming applications?
2) Although I have spark & cassandra, in order to exploit data locality, do I need to have dedicated core on each of the slave machines to avoid shuffling?
3) Would you recommend running Streaming jobs in "fine grained" or "course grained" mode. I know that logical answer is course grained (in order to minimize the latency of streaming apps) but what when resource in total cluster are limited (cluster of 3 nodes, 4 cores each - there are 2 streaming applications to run and multiple time to time batch jobs)
4) In Mesos, when I run spark streaming job in cluster mode, will it occupy 1 core permanently (like Standalone cluster manager is doing), or will that core execute driver process and sometimes act as executor?
Thank you
Fine grained mode is actually deprecated now. Even with it, each core is allocated to task until completion, but in Spark Streaming, each processing interval is a new job, so tasks only last as long the time it takes to process each interval's data. Hopefully that time is less than the interval time or your processing will back up, eventually running out of memory to store all those RDDs waiting for processing.
Note also that you'll need to have one core dedicated to each stream's Reader. Each will be pinned for the life of the stream! You'll need extra cores in case the stream ingestion needs to be restarted; Spark will try to use a different core. Plus you'll have a core tied up by your driver, if it's also running on the cluster (as opposed to on your laptop or something).
Still, Mesos is a good choice, because it will allocate the tasks to nodes that have capacity to run them. Your cluster sounds pretty small for what you're trying to do, unless the data streams are small themselves.
If you use the Datastax connector for Spark, it will try to keep input partitions local to the Spark tasks. However, I believe that connector assumes it will manage Spark itself, using Standalone mode. So, before you adopt Mesos, check to see if that's really all you need.

Resources