best way to run 300+ concurrent spark jobs in Dataproc? - apache-spark

I have a Dataproc cluster with 2 worker nodes (n1s2). There is an external server which submits around 360 spark jobs within an hour (with a couple of minutes spacing between each submission). The first job completes successfully but the subsequent ones get stuck and do not proceed at all.
Each job crunches some timeseries numbers and writes to Cassandra. And the time taken is usually 3-6 minutes when the cluster is completely free.
I feel this can be solved by just scaling up the cluster, but would become very costly for me.
What would be the other options to best solve this use case?

Running 300+ concurrent jobs on a 2 worker nodes cluster doesn't sound like feasible. You want to first estimate how much resource (CPU, memory, disk) each job needs then make a plan for the cluster size. YARN metrics like available CPU, available memory, especially pending memory would be helpful for identifying the situation where it is lack of resources.

Related

How is abnormal Driver termination handled for a Spark App in Yarn cluster mode

We're using AWS EMR for our spark jobs. All our jobs are submitted in yarn cluster mode, so the driver will run in one of the cluster nodes. We use on-demand node for master, and spot-instances for the core nodes. Now, although we almost always choose instances with < 5% interruption rate, sometimes it so happens that a significant fraction of our cluster nodes get terminated prematurely (probably because of higher demands).
So, I was wondering, in the above situation, what happens if a node containing the driver process goes down? Is there any chance of recovery for the spark job in that case? Or is the job gone forever?
The Spark driver is a single point of failure because it holds all cluster state for the running App.
In practice non-ephemeral storage can be used for check-pointing batch Apps after expensive expensive transformations. That said, trying to re-start after such a situation can be done, but when I looked into it, it is quite difficult to say the least. I asked such a question under my name some time ago, you can find it. I am quite technical but felt: gosh what a lot of hard work.
So, the recovery means rolling your own stuff, or accepting a re-run. Since I last evaluated EMR I see that the driver can run on the Master and that can be failed-over, but that is not the same thing as far as I can see, nor what you wish.
EMR has node leveling for CORE nodes in Yarn. Your spark driver/ Application master only gets created in CORE nodes. And HDFS also resides in CORE nodes only.
So to handle your situation in a best way, you may consider to use both CORE and TASK group.
What you can do to tackle this -
MASTER: On-demand
CORE: On-demand. Minimum no of Instances can be 1.
TASK: Spot with autoscaling with minimal EBS volume. Minimum no of Instances can be 0 this case.
This will reduce your cost also ensure that node containing the driver process never goes down.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

Does memory configuration really matter with fair scheduler?

We have a hadoop cluster with fair scheduler configured. We used to see the scenario whan there were not many jobs in the cluster to run, the running job was trying to take as much as memory and cores available.
With the Fair scheduler does executor memory and cores are really matter for the spark Jobs? Or does it depend upon the fair scheduler to decide how much to give?
It's the policy of Fair Scheduler that the first job assigned to it will have all the resources provided.
When we run the second job, all the resources will be divided in to (available resources)/(no. of jobs)
Now the main thing to focus is, how much maximum number of container memory you have given to run the job. If it is equal to the total number of resources available then it's genuine for your job to use all the resources.

Single worker node stalls job

I have a Spark submit job (PySpark) that works properly 90% of the time, but for 10% it stalls on a specific host. Basically tasks may take seconds to complete on other hosts, but sometimes it grinds to a halt on a host I can identify via the Spark UI. In such cases I end up killing the process and re-running. I am wondering what my options are to mitigate this issue.
My infrastructure is a standalone Spark 2.1 cluster on EC2 instances running on Amazon AWS. I have considered speculative execution, but my process writes to s3 and I've been advised that specifying speculative execution for processes that end up persisting to s3 is a bad idea. Any suggestions are welcome.
Stalling at 90% is not unusual if your data is skewed, i.e. you have some partitions with really large amounts of data which can lead to a lot of GC and OOM.
In this case repartitioning the data, e.g. via the RangePartitioner would be a solution.

Doubts related to Spark resource usage

I am executing a Spark Streaming application and I am caching the rdds for history look-back, my batch is of duration one minute and average processing time is 14 seconds, So executors are not computing for whole batch duration. So are executors, are still hold up as I am caching the rdds in memory. And if executors are hold up should we consider this holding up of executors is wastage of resources.
It depends of what you want to achieve.
In Spark 2.0, dynamic allocation is enabled to Spark Streaming with no Bugs.
What was the problem, if you have a huge workload of data if you don't keep at least one executor for the data receive you may lose data. Now this is solved with Spark 2.0 and the release of the data is working.
What is the advantage of keeping your data in cache when a huge amount of data comes? You can process your data without a shuffle, it can increase your response time.
But, if you have a process of 1 by 1 minute. And it takes just 14 seconds to process your data in an average time. I suggest you to release your data and release your workers to open space for other tasks.
If you will not have enough resources for your tasks, the tasks will be queued and will be handled as soon you have the resources.
What is the risk? If you release the workers could be hard to get the resources back if you don't have preemption in your Yarn. This can be a waste of resource depends of your cluster.
What I would do is: create some queues that can handle your jobs. Set the High priority queue, set your streaming there, the other jobs in other queues and turn on the Dynamic Allocation and release the cache. If your application needs something with more resources the Yarn will handle it.

Spark SQL - Not using all the vCores available

I am running set of spark-sql queries in parallel using Google dataproc. I am spinning up my own cluster and ideally it should consume all the resources available in the cluster. I do see that number of vCores used is only 40 though 320 vCores are available. Do you know how I can tune the performance in this case?
I tried with different number of cores and executors. Though some apps are pending, it doesn't seem to take additional resources. I am spinning up a cluster of 20 nodes but it is still taking lot of time for computation.
I am setting the number of partitions as 50 to restrict the number of output files. Even if I skip setting it, it doesn't seem to improve performance

Resources