Running several spark jobs concurrently from driver - apache-spark

Imagine that we have 3 customers and we want do some same work for each of them in parallel.
def doSparkJob(customerId: String) = {
spark
.read.json(s"$customerId/file.json")
.map(...)
.reduceByKey(...)
.write
.partitionBy("id")
.parquet("output/")
}
We do it concurrently like this (from spark driver):
val jobs: Future[(Unit, Unit, Unit)] = for {
f1 <- Future { doSparkJob("customer1") }
f2 <- Future { doSparkJob("customer1") }
f3 <- Future { doSparkJob("customer1") }
} yield (f1, f2, f3)
Await.ready(jobs, 5.hours)
Do I understand correctly that this is bad approach? Many spark job will push out context of each other from executors and there will be many spilling data to disc appears. How spark will be manage execute task from parallel jobs? How shuffle appears when we have 3 concurrent job from one driver and only 3 executors with one core.
I guess, a good approach should looks like this:
We read all data together for all customers groupByKey by customer and do what we want to do.

Do I understand correctly that this is bad approach?
Not necessarily. A lot depends on the context and Spark implements it's own set of AsyncRDDActions to address scenarios like this one (though there is no Dataset equivalent).
In the simplest scenario, with static allocation, it is quite likely that Spark will just schedule all jobs sequentially, due to lack of resources. Unless configured otherwise, this is the most probable outcome with the described configuration. Please keep in mind that Spark can use in-application scheduling with FAIR scheduler to share limited resources between multiple concurrent jobs. See Scheduling Within an Application.
If amount of resources is sufficient to start multiple jobs at the same, there can be competition between individual jobs, especially for IO and memory intensive jobs. If all jobs use the same of resources (especially databases) it is possible that Spark will cause throttling and subsequent failures or timeouts. A less severe effect of running multiple jobs can be increased cache eviction.
Overall there multiple factors to consider when you choose between sequential and concurrent execution including, but not limited to, available resource (Spark cluster and external services), choice of the API (RDD tend to be more greedy than SQL, therefore requires some low level management) and choice of operators. Even if jobs are sequentially you may still decide to use asynchronous to improve driver utilization and reduce latency. This is particularly useful with Spark SQL and complex execution plans (common bottleneck in Spark SQL). This way Spark can crunch new execution plans, while other jobs are executed.

Related

Limit cores per Apache Spark job

I have a dataset for which I'd like to run multiple jobs for in parallel.
I do this by launching each action in its own thread to get multiple Spark jobs per Spark application like the docs say.
Now the task I'm running doesn't benefit endlessly from throwing more cores at it - at like 50 cores or so the gain of adding more resources is quite minimal.
So for example if I have 2 jobs and 100 cores I'd like to run both jobs in parallel each of them only occupying 50 cores at max to get faster results.
One thing I could probably do is to set the amount of partitions to 50 so the jobs could only spawn 50 tasks(?). But apparently there are some performance benefits of having more partitions than available cores to get a better overall utilization.
But other than that I didn't spot anything useful in the docs to limit the resources per Apache Spark job inside one application. (I'd like to avoid spawning multiple applications to split up the executors).
Is there any good way to do this?
Perhaps asking Spark driver to use fair scheduling is the most appropriate solution in your case.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
There is also a concept of pools, but I've not used them, perhaps that gives you some more flexibility on top of fair scheduling.
Seems like conflicting requirements with no silver bullet.
parallelize as much as possible.
limit any one job from hogging resources IF (and only if) another job is running as well.
So:
if you increase number of partitions then you'll address #1 but not #2.
if you specify spark.cores.max then you'll address #2 but not #1.
if you do both (more partitions and limit spark.cores.max) then you'll address #2 but not #1.
If you only increase number of partitions then only thing you're risking is that a long running big job will delay the completion/execution of some smaller jobs, though overall it'll take the same amount of time to run two jobs on given hardware in any order as long as you're not restricting concurrency (spark.cores.max).
In general I would stay away from restricting concurrency (spark.cores.max).
Bottom line, IMO
don't touch spark.cores.max.
increase partitions if you're not using all your cores.
use fair scheduling
if you have strict latency/response-time requirements then use separate auto-scaling clusters for long running and short running jobs

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

Number of Executor Cores and benefits or otherwise - Spark

Some run-time clarifications are requested.
In a thread elsewhere I read, it was stated that a Spark Executor should only have a single Core allocated. However, I wonder if this is really always true. Reading the various SO-questions and the likes of, as well as Karau, Wendell et al, it is clear that there are equal and opposite experts who state one should in some cases specify more Cores per Executor, but the discussion tends to be more technical than functional. That is to say, functional examples are lacking.
My understanding is that a Partition of an RDD or DF, DS, is serviced by a single Executor. Fine, no issue, makes perfect sense. So, how can the Partition benefit from multiple Cores?
If I have a map followed by, say a, filter, these are not two Tasks that can be interleaved - as in what Informatica does, as my understanding is they are fused together. This being so, then what is an example of benefit from an assigned Executor running more Cores?
From JL: In other (more technical) words, a Task is a computation on the records in a RDD partition in a Stage of a RDD in a Spark Job. What does it mean functionally speaking, in practice?
Moreover, can Executor be allocated if not all Cores can be acquired? I presume there is a wait period and that after a while it may be allocated in a more limited capacity. True?
From a highly rated answer on SO, What is a task in Spark? How does the Spark worker execute the jar file?, the following is stated: When you create the SparkContext, each worker starts an executor. From another SO question: When a SparkContext is created, each worker node starts an executor.
Not sure I follow these assertions. If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
I ask this, as even this excellent post How are stages split into tasks in Spark? does not give a practical example of multiple Cores per Executor. I can follow the post clearly and it fits in with my understanding of 1 Core per Executor.
My understanding is that a Partition (...) serviced by a single Executor.
That's correct, however the opposite is not true - a single executor can handle multiple partitions / tasks across multiple stages or even multiple RDDs).
then what is an example of benefit from an assigned Executor running more Cores?
First and foremost processing multiple tasks at the same time. Since each executor is a separate JVM, which is a relatively heavy process, it might preferable to keep only instance for a number of threads. Additionally it can provide further advantages, like exposing shared memory that can be used across multiple tasks (for example to store broadcast variables).
Secondary application is applying multiple threads to a single partition when user invokes multi-threaded code. That's however not something that is done by default (Number of CPUs per Task in Spark)
See also What are the benefits of running multiple Spark tasks in the same JVM?
If Spark does not know the number of partitions etc. in advance, why allocate Executors so early?
Pretty much by extension of the points made above - executors are not created to handle specific task / partition. There are long running processes, and as long as dynamic allocation is not enabled, there are intended to last for the full lifetime of the corresponding application / driver (preemption or failures, as well as already mentioned dynamic allocation, can affect that, but that's the basic model).

Could FAIR scheduling mode make Spark Streaming jobs that read from different topics running in parallel?

I use Spark 2.1 and Kafka 0.9.
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish.
According to this if i have multiple jobs from multiple threads in case of spark streaming(one topic from each thread) is it possible that multiple topics can run simultaneously if i have enough cores in my cluster or would it just do a round robin across pools but run only one job at a time ?
Context:
I have two topics T1 and T2, both with one 1 partition. I have configured a pool with scheduleMode to be FAIR. I have 4 cores registered with spark. Now each topic has two actions(hence two jobs - totally 4 jobs across topics) Let's say J1 and J2 are jobs for T1 and J3 and J4 are jobs for topic T2. What spark is doing in FAIR mode is execute J1 J3 J2 J4, but at any time only one job is executing. Now as each topic has only one partition, only once core is being used and 3 are just free. This is something which i don't want.
Any way i can avoid this ?
if i have multiple jobs from multiple threads...is it possible that multiple topics can run simultaneously
Yes. That's the purpose of FAIR scheduling mode.
As you may have noticed, I removed "Spark Streaming" from your question since it does not contribute in any way to how Spark schedules Spark jobs. It does not really matter whether you start your Spark jobs from a "regular" application or Spark Streaming one.
Quoting Scheduling Within an Application (highlighting mine):
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into "stages" (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc.
And then the quote you used to ask the question that should now get clearer.
it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources.
So, speaking about Spark Streaming you'd have to configure FAIR scheduling mode and Spark Streaming's JobScheduler should submit Spark jobs per topic in parallel (haven't tested it out myself so it's more theory than practice).
I think that fair scheduler alone will not help, as it's the Spark Streaming engine that takes care of submitting the Spark Jobs and normally does so in a sequential mode.
There's a non-documented configuration parameter in Spark Streaming: spark.streaming.concurrentJobs[1], which is set to 1 by default. It controls the parallelism level of jobs submitted to Spark.
By increasing this value, you may see parallel processing of the different spark stages of your streaming job.
I would think that combining this configuration with the fair scheduler in Spark, you will be able to achieve controlled parallel processing of the independent topic consumers. This is mostly uncharted territory.

Improving SQL Query using Spark Multi Clusters

I was experimenting if Spark with multi clusters can improve slow SQL query. I created two workers for master and they are running on local Spark Standalone. Yes, I did halve the memory and the number of cores to create workers on local machine. I specified partitions for sqlContext, using partitionColumn, lowerBound, UpperBoundand numberPartitions, so that tasks (or partitions) can be distributed over workers. I described them as below (partitionColumn is unique):
df = sqlContext.read.format("jdbc").options(
url = "jdbc:sqlserver://localhost;databasename=AdventureWorks2014;integratedSecurity=true;",
driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver",
dbtable = query,
partitionColumn = "RowId",
lowerBound = 1,
upperBound = 10000000,
numPartitions = 4).load()
I ran my script over the master after specifying the options, but I couldn't get any performance improvement against when running on spark without cluster. I know I should have not halved the memory for integrity of the experiment. But I would like to know if that might be the case or any reason if that's not the case. Any thoughts are welcome. Many thanks.
There are multiple factors which play a role here, though the weights of each of these can differ on a case by case basis.
As nicely pointed out by mtoto, increasing number of workers on a single machine, is unlikely to bring any performance gains.
Multiple workers on a single machine have access to the same fixed pool of resources. Since worker doesn't participate in the processing itself, you just use a higher fraction of this pool for management.
There legitimate cases when we prefer a higher number of executor JVMs, but it is not the same as increasing number of workers (the former one is an application resource, the latter one is a cluster resource).
It is not clear if you use the same number of cores for baseline and multi-worker configuration, nevertheless cores are not the only resource you have to consider working with Spark. Typical Spark jobs are IO (mostly network and disk) bound. Increasing number of threads on a single node, without making sure that there is sufficient disk and network configuration, will just make them wait for the data.
Increasing cores alone is useful only for jobs which are CPU bound (and these will typically scale better on a single machine).
Fiddling with Spark resources won't help you, if external resource cannot keep up with the requests. A high number of concurrent batch reads from a single non-replicated database will just throttle the server.
In this particular case you make it even worse by running a database server on the same node as Spark. It has some advantages (all traffic can go through loopback), but unless database and Spark use different sets of disks, they'll be competing over disk IO (and other resources as well).
Note:
It is not clear what is the query, but if it is slow when executed directly against database, fetching it from Spark will it even slower. You should probably take a closer look at query and/or database structure and configuration first.

Resources