Spark Direct Stream Concurrent Job Limit - apache-spark

I am running a spark direct stream from kafka where I need to run many concurrent jobs in order to process all the data in time. In spark you can set spark.streaming.concurrentJobs to a number of concurrent jobs you want to run.
What I want to know is a logical way to determine how many concurrent jobs I can run within my given environment. For privacy issues at my company, I cannot tell you the specs that I have, but what I would want to know is which specs are relevant in determining a limit and why?
Of course the alternative is that I could keep increasing it and testing, then adjusting based on results but I would like a more logical approach and I want to actually understand what determines that limit and why.

To test different numbers of concurrent jobs and see the overall execution time is the most reliable method. However, I suppose the best number roughly equals to the value of Runtime.getRuntime().availableProcessors();
So my advice is to start with that number of available processors, then increase and decrease it by 1,2, and 3. Then make a chart (execution time against the number of jobs) and you'll see the optimal number of jobs.

Related

Best practices to submit a huge numer of jobs with slurm

I need to submit several thousand jobs to our cluster. Each job needs around six hours to complete. This will take around a week if I would use all available resources. Theoretically I could do that but the I would block all other users for a week. So this is not an option.
I have two ideas that could possibly solve the problem:
Create an array job and limit the maximum number of running jobs. I don't like this option because quite often (over night, weekends, etc.) no one uses the cluster and my jobs can not use these unused resources.
Submit all jobs at once but somehow set the priority of each job really low. Ideally anyone could still use the cluster because when they submit jobs they will start sooner than mine. I do not know if this is possible in slurm and if I would have the permission to do that.
Is there a slurm mechanism I am missing? Is it possible to set priorities of a slurm job as described above and would I have permission to do that?
Generaly this is the cluster admin problem. They should have configured the cluster in a way that prioritize short and small jobs over long and large ones and/or prevent large jobs from running on some nodes.
However you can also manually reduce the priority of your job as a non admin with the nice factor option (higher -> less priority):
sbatch --nice=POSITIVE_NUMBER script.sh

large gap between tasks in same job/stage in spark

I have a job that should take less than 1 sec.
In this case it takes around 10-12 sec. drilling down into one stage, shows that the tasks are running fine, you can see that the maximal, long running task, took 0.4 sec:
however, when looking at the timeline, you can see that there is a large gap (~10 sec.) between some tasks under the same stage:
is there anything I'm missing?
what should I configure in order to avoid that long-time gap?
Edit:
Here is the entire list of tasks in the timeline, it seems pretty balanced
Try to repartition your RDDs in order that each partition contains the same volume of data. This kind of problems often happens when partitions contain largely imbalanced data volume. Check this article, it may help understand the partitioning aspect and its effects on performance :
https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel

What does the hint USE_ADDITIONAL_PARALLELISM do in Cloud Spanner

In the doc we can find a query hint named USE_ADDITIONAL_PARALLELISM here: https://cloud.google.com/spanner/docs/query-syntax#statement-hints
However the documentation is very short for it.
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
In what scenario would we use it?
What is its impact on the infrastructure?
How does it scale with number of nodes?
Does it need a query that picks data from different splits, or does it work on a single split?
Any meaningful information about it is welcome.
PS: I was originally introduced to the hint in this thread
A Spanner query may be executed on multiple remote servers.
Source: An illustration of the life of a query from the Cloud Spanner "Query execution plans" documentation
The root node coordinates the query execution.
If the execution plan expects rows on multiple splits to satisfy the query predicate(s), multiple subplans are executed on the respective remote servers.
Due to the distributed nature of Spanner these subplans can sometimes be executed in parallel; for example, the right subplan execution is not dependent on the left subplan results.
If the USE_ADDITIONAL_PARALLELISM query hint is provided, the root node may choose to increase the number of parallel remote executions, if the execution plan includes multiple subplans.
To answer the original questions:
From my understanding it will spread a single query to be executed on multiple nodes; is that correct?
This hint does not change how a query is executed, it only make it possible for subplans of that execution to be initiated with increased parallelism.
In what scenario would we use it?
Especially in cases when a full table scale is required, this may lead to faster, in wall-time, query completion, but the trade offs concerning resource allocation, and the affects on other parallel operations, should also be considered.
What is its impact on the infrastructure?
If an increased number of remote executions are run in parallel, the average CPU for the instance may increase.
How does it scale with number of nodes?
An increased number of nodes provides additional capacity for parallel operations.
Does it need a query that picks data from different splits, or does it work on a single split?
Benefits will likely be significantly higher for queries which require data that resides on multiple splits.
A Cloud Spanner query may have multiple levels of distribution. The USE_ADDITIONAL_PARALLELISM query hint will cause a node executing a query to try and prefetch the results of subqueries further up in the distribution queue. This can be useful in scenarios such as queries doing full table scans or doing full table scans with aggregations like COUNT(), MAX , MIN etc. where identical subqueries can be distributed to many splits and where the individual subqueries to the splits return relatively little data (such as aggregation state). However, if the individual subqueries return significant data then using this hint can cause memory usage on the consuming node to go up significantly due to prefetching.

Dynamic CPUs per Task in Spark

Lets say my job performs several spark actions, where the first few are not using multiple cores for a single task so I would like each instance to perform (executor.cores) tasks in parallel (spark.task.cpus=1).
Then suppose I have another action which can be parallelized - I'm desiring a feature where I could increase spark.task.cpus (say to use more cores on the executor), and perform fewer tasks simultaneously on each instance.
My workaround right now is to save data, start a new sparkContext with new settings, and reload the data.
The use case: my later actions may be unavoidable skewed and I may want to apply more than one core per task to avoid bottlenecking on such large tasks, but I don't want this to impact the earlier actions which can benefit from using 1 core per task.
From looking around my guess is that I can't do this currently, so I'm mainly wondering if there is a a significant limitation for not allowing this. Alternatively, suggestions for how I could trick spark into achieving something similar.
Note: Currently using 1.6.2 but willing to hear other options for Spark2+

How does partitions map to tasks in Spark?

If I partition an RDD into say 60 and I have a total of 20 cores spread across 20 machines, i.e. 20 instances of single core machines, then the number of tasks is 60 (equal to the number of partitions). Why is this beneficial over having a single partition per core and having 20 tasks?
Additionally, I have run an experiment where I have set the number of partitions to 2, checking the UI shows 2 tasks running at any one time; however, what has surprised me is that it switches instances on completion of tasks, e.g. node1 and node2 do the first 2 tasks, then node6 and node8 do the next set of 2 tasks etc. I thought by setting the number of partitions to less than the cores (and instances) in a cluster then the program would just use the minimum number of instances required. Can anyone explain this behaviour?
For the first question: you might want to have more granular tasks than strictly necessary in order to load less into memory at the same time. Also, it can help with error tolerance, as less work needs to be redone in case of failure. It is nevertheless a parameter. In general the answer depends on the kind of workload (IO bound, memory bound, CPU bound).
As for the second one, I believe version 1.3 has some code to dynamically request resources. I'm unsure in which version the break is, but older versions just request the exact resources you configure your driver with. As for how comes a partition moves from one node to another, well, AFAIK it will pick the data for a task from the node that has a local copy of that data on HDFS. Since hdfs has multiple copies (3 by default) of each block of data, there are multiple options to run any given piece).

Resources