Spark master memory requirements related to data size - apache-spark

Are Spark master memory requirements related to the size of the processed data?
The Spark driver and Spark workers/executors deal with processed data directly (and execute application code), so their memory needs can be linked to the size of the processed data. But is the Spark master in any way affected by the data size? It seems to me that it isn't, because it just manages the Spark workers and doesn't work with the data itself directly.

Spark main data entities like DataFrames or DataSets are based on RDD, or Resilient Distributed Datasets. They are distributed meaning the processing generally takes place in the executors.
Some RDD actions will end with data on the driver process though. Most notably collect and other actions that use it (like show, take or toPandas if you are using python). collect, as the name implies, will collect some or all of the rows of the distributed datasets and materialize them in the driver process. At this point, yes, you will need to take into account the memory footprint of your data.
This is why you will generally want to reduce as much as possible the data you collect. You can groupBy, filter, and many other transformations so that if you need to process the data in the driver, it's the most refined possible.

Related

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

How to ensure that loading of Spark DataFrame from Parquet is distributed and parallelized?

When Spark loads source data from a file into a DataFrame, what factors govern whether the data are loaded fully into memory on a single node (most likely the driver/master node) or in the minimal, parallel subsets needed for computation (presumably on the worker/executor nodes)?
In particular, if using Parquet as the input format and loading via the Spark DataFrame API, what considerations are necessary in order to ensure that loading from the Parquet file is parallelized and deferred to the executors, and limited in scope to the columns needed by the computation on the executor node in question?
(I am looking to understand the mechanism Spark uses to schedule loading of source data in the distributed execution plan, in order to avoid exhausting memory on any one node by loading the full data set.)
As long as you use spark operations, all data transformations and aggregations are perfored only on executors. Therefore there is no need for driver to load the data, its job is to manage processing flow. Driver gets the data only in case you use some terminal operations, like collect(), first(), show(), toPandas(), toLocalIterator() and similar. Additionally, executors does not load all files content into memory, but gets the smallest posible chunks (which are called partitions).
If you use column store format such as Parquet only columns required for the execution plan are loaded - this is default behaviour in spark.
Edit: I just saw that there might be a bug in spark and if you use nested columns inside your schema then unnecessary columns may be loaded, see: Why does Apache Spark read unnecessary Parquet columns within nested structures?

Why does Spark save Map phase output to local disk?

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.
Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.
I would like to understand the following w.r.t to Hadoop MapReduce.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?
How is the output of the Map tasks from Hadoop MapReduce and Spark different?
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.
It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.
How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.
Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
When you execute a Spark application, the very first thing is starting the SparkContext first that becomes the home of multiple interconnected services with DAGScheduler, TaskScheduler and SchedulerBackend being among the most important ones.
DAGScheduler is the main orchestrator and is responsible for transforming a RDD lineage graph (i.e. a directed acyclic graph of RDDs) into stages. While doing it, DAGScheduler traverses the parent dependencies of the final RDD and creates a ResultStage with parent ShuffleMapStages.
A ResultStage is (mostly) the last stage with ShuffleMapStages being its parents. I said mostly because I think I may have seen that you can "schedule" a ShuffleMapStage.
This is the very early and first optimization Spark applies to your Spark jobs (that together create a Spark application) - execution pipelining where multiple transformations are wired together to create a single stage (because their inter-dependencies are narrow). That's what makes Spark faster than Hadoop MapReduce since two or more transformations can get executed one by one with no data shuffling possibly all in memory.
A single stage is as wide until it hits ShuffleDependency (aka wide dependency).
There are RDD transformations that will cause shuffling (due to creating a ShuffleDependency). That's the moment where Spark is very much like Hadoop's MapReduce since it will save partial shuffle outputs to...local disks on executors.
When a Spark application starts it requests executors from a cluster manager (there are three supported: Spark Standalone, Apache Mesos and Hadoop YARN). This is what SchedulerBackend is for -- to manage communication between your Spark application and cluster resources.
(Let's assume you are not using External Shuffle Manager)
Executors host their own local BlockManagers that are responsible for managing RDD blocks that are kept on local hard drive (possibly in memory and replicated too). You can control RDD block persistence using cache and persist operators and StorageLevels. You can use Storage and Executors tabs in web UI to track blocks with their location and size.
The difference between Spark storing data locally (on executors) and Hadoop MapReduce is that:
The partial results (after computing ShuffleMapStages) are saved on local hard drives not HDFS which is a distributed file system with a very expensive saves.
Only some files are saved to local hard drive (after operations being pipelined) which does not happen in Hadoop MapReduce that saves all maps to HDFS.
Let me answer the following item:
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
That's the trickest part in the Spark execution plan and heavily depends on how wide the shuffling is. If you work only with local data (multiple executors on a single machine) you will see no data traffic since the data is in place already.
If the data shuffle is required, executors will send data between each other and that will increase the traffic.
Data Exchange Between Nodes in Spark Application
Just to elaborate on the traffic between nodes in a Spark application.
Broadcast variables are the means of sending data from the driver to executors.
Accumulators are the means of sending data from executors to the driver.
Operators like collect will pull all the remote blocks from executors to the driver.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources