Spark shuffle blocks replication - apache-spark

I'd like to know if it's possible to define replication logic to shuffle blocks without using persist action.
Use case is having complex sql with multiple joins which requires a big amount of shuffles which is saved on worker machines (with splill), loosing a machine might require stage retries (using dag) which is very expansive and might not always work.
Can it be done using configuration or by inheriting from some class in spark context.
Version Spark 2.3

Related

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

Get SparkSession in partition loop [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

How to use driver to load data and executors for processing and writing?

I would like to use Spark structured streaming to watch a drop location that exists on the driver only. I do this with
val trackerData = spark.readStream.text(sourcePath)
After that I would like to parse, filter, and map incoming data and write it out to elastic.
This works well except that it does only work when spark.master is set to e.g. local[*]. When set to yarn, no files get found even when deployment mode is set to client.
I thought that reading data in from local driver node is achieved by setting deployment to client and doing the actual processing and writing within the Spark cluster.
How could I improve my code to use driver for reading in and cluster for processing and writing?
What you want is possible, but not recommended in Spark Structured Streaming in particular and in Apache Spark in general.
The main motivation of Apache Spark is to bring computation to the data not the opposite as Spark is to process petabytes of data that a single JVM (of a driver) would not be able to handle.
The driver's "job" (no pun intended) is to convert a RDD lineage (= a DAG of transformations) to tasks that know how to load a data. Tasks are executed on Spark executors (in most cases) and that's where data processing happens.
There are some ways to make the reading part on driver and processing on executors and among them the most "lucrative" would be to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
One idea that came to my mind is that you could "hack" Spark "Streams" and write your own streaming sink that would do collect or whatever. That could make the processing local.

Spark job throwing NPE [duplicate]

I'm writing Spark Jobs that talk to Cassandra in Datastax.
Sometimes when working through a sequence of steps in a Spark job, it is easier to just get a new RDD rather than join to the old one.
You can do this by calling the SparkContext [getOrCreate][1] method.
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network.
In this case - you're registering a singleton for that JVM, and so it gets around the problem of serialization.
One day my tech lead came to me and said
Don't use SparkContext getOrCreate you can and should use joins instead
But he didn't give a reason.
My question is: Is there a reason not to use SparkContext.getOrCreate when writing a spark job?
TL;DR There are many legitimate applications of the getOrCreate methods but attempt to find a loophole to perform map-side joins is not one of them.
In general there is nothing deeply wrong with SparkContext.getOrCreate. The method has its applications, and although there some caveats, most notably:
In its simplest form it doesn't allow you to set job specific properties, and the second variant ((SparkConf) => SparkContext) requires passing SparkConf around, which is hardly an improvement over keeping SparkContext / SparkSession in the scope.
It can lead to opaque code with "magic" dependency. It affects testing strategies and overall code readability.
However your question, specifically:
Now sometimes there are concerns inside a Spark Job that referring to the SparkContext can take a large object (the Spark Context) which is not serializable and try and distribute it over the network
and
Don't use SparkContext getOrCreate you can and should use joins instead
suggests you're actually using the method in a way that it was never intended to be used. By using SparkContext on an executor node.
val rdd: RDD[_] = ???
rdd.map(_ => {
val sc = SparkContext.getOrCreate()
...
})
This is definitely something that you shouldn't do.
Each Spark application should have one, and only one SparkContext initialized on the driver, and Apache Spark developers made at a lot prevent users from any attempts of using SparkContex outside the driver. It is not because SparkContext is large, or impossible to serialize, but because it is fundamental feature of the Spark's computing model.
As you probably know, computation in Spark is described by a directed acyclic graph of dependencies, which:
Describes processing pipeline in a way that can be translated into actual task.
Enables graceful recovery in case of task failures.
Allows proper resource allocation and ensures lack of cyclic dependencies.
Let's focus on the last part. Since each executor JVM gets its own instance of SparkContext cyclic dependencies are not an issue - RDDs and Datasets exist only in a scope of its parent context so you won't be able to objects belonging to the application driver.
Proper resource allocation is a different thing. Since each SparkContext creates its own Spark application, your "main" process won't be able to account for resources used by the contexts initialized in the tasks. At the same time cluster manager won't have any indication that application or somehow interconnected. This is likely to cause deadlock-like conditions.
It is technically possible to go around it, with careful resource allocation and usage of the manager-level scheduling pools, or even a separate cluster manager with its own set or resources, but it is not something that Spark is designed for, it not supported, and overall would lead to brittle and convoluted design, where correctness depends on a configuration details, specific cluster manager choice and overall cluster utilization.

Why does Spark save Map phase output to local disk?

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.
Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.
I would like to understand the following w.r.t to Hadoop MapReduce.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?
How is the output of the Map tasks from Hadoop MapReduce and Spark different?
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.
It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.
How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.
Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
When you execute a Spark application, the very first thing is starting the SparkContext first that becomes the home of multiple interconnected services with DAGScheduler, TaskScheduler and SchedulerBackend being among the most important ones.
DAGScheduler is the main orchestrator and is responsible for transforming a RDD lineage graph (i.e. a directed acyclic graph of RDDs) into stages. While doing it, DAGScheduler traverses the parent dependencies of the final RDD and creates a ResultStage with parent ShuffleMapStages.
A ResultStage is (mostly) the last stage with ShuffleMapStages being its parents. I said mostly because I think I may have seen that you can "schedule" a ShuffleMapStage.
This is the very early and first optimization Spark applies to your Spark jobs (that together create a Spark application) - execution pipelining where multiple transformations are wired together to create a single stage (because their inter-dependencies are narrow). That's what makes Spark faster than Hadoop MapReduce since two or more transformations can get executed one by one with no data shuffling possibly all in memory.
A single stage is as wide until it hits ShuffleDependency (aka wide dependency).
There are RDD transformations that will cause shuffling (due to creating a ShuffleDependency). That's the moment where Spark is very much like Hadoop's MapReduce since it will save partial shuffle outputs to...local disks on executors.
When a Spark application starts it requests executors from a cluster manager (there are three supported: Spark Standalone, Apache Mesos and Hadoop YARN). This is what SchedulerBackend is for -- to manage communication between your Spark application and cluster resources.
(Let's assume you are not using External Shuffle Manager)
Executors host their own local BlockManagers that are responsible for managing RDD blocks that are kept on local hard drive (possibly in memory and replicated too). You can control RDD block persistence using cache and persist operators and StorageLevels. You can use Storage and Executors tabs in web UI to track blocks with their location and size.
The difference between Spark storing data locally (on executors) and Hadoop MapReduce is that:
The partial results (after computing ShuffleMapStages) are saved on local hard drives not HDFS which is a distributed file system with a very expensive saves.
Only some files are saved to local hard drive (after operations being pipelined) which does not happen in Hadoop MapReduce that saves all maps to HDFS.
Let me answer the following item:
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
That's the trickest part in the Spark execution plan and heavily depends on how wide the shuffling is. If you work only with local data (multiple executors on a single machine) you will see no data traffic since the data is in place already.
If the data shuffle is required, executors will send data between each other and that will increase the traffic.
Data Exchange Between Nodes in Spark Application
Just to elaborate on the traffic between nodes in a Spark application.
Broadcast variables are the means of sending data from the driver to executors.
Accumulators are the means of sending data from executors to the driver.
Operators like collect will pull all the remote blocks from executors to the driver.

Resources