How is data accessed by worker nodes in a Spark Cluster? - apache-spark

I'm trying to understand the functioning of Spark, I know the Cluster manager allocates the resources (Workers) for the driver program.
I want to know, how (which transformations) the cluster manager sends the tasks to worker nodes and how worker nodes access the data (Assume my data is in S3)?
Does worker nodes read only a part of data and apply all transformations on it and return the actions to the driver program? or The worker nodes reads the entire file but only apply specific transformation and return back the result to the driver program?
Follow-up questions:
How and who decides how much amount of data needs to be sent to worker nodes? as we have established a point that partial data is present on each worker node. Eg: I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform few transformations and an action. assume the csv is on S3 and on the master node's local storage.

It's going to be a long answer, but I will try to simplify it at my best:
Typically a Spark cluster contains multiple nodes, each node would have multiple CPUs, a bunch of memory, and storage. Each node would hold some chunks of data, therefore sometimes they're also referred to data nodes as well.
When Spark application(s) are started, they tend to create multiple workers or executors. Those workers/executors took resources (CPU, RAM) from the cluster's nodes above. In other words, the nodes in a Spark cluster play both roles: data storage and computation.
But as you might have guessed, data in a node (sometimes) is incomplete, therefore, workers would have to "pull" data across the network to do a partial computation. Then the results are sent back to the driver. The driver would just do the "collection work", and combine them all to get the final results.
Edit #1:
How and who decides how much amount of data needs to be sent to worker nodes
Each task would decide which data is "local" and which is not. This article explains pretty well how data locality works
I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform a few transformations and an action
This situation is different with the above question, where you have only one file and most likely your worker would be exactly the same as your data node. The executor(s) those are sitting on that worker, however, would read the file piece by piece (by tasks), in parallel, in order to increase parallelism.

Related

How can Spark process data that is way larger than Spark storage?

Currently taking a course in Spark and came across the definition of an executor:
Each executor will hold a chunk of the data to be processed. This
chunk is called a Spark partition. It is a collection of rows that
sits on one physical machine in the cluster. Executors are responsible
for carrying out the work assigned by the driver. Each executor is
responsible for two things: (1) execute code assigned by the driver,
(2) report the state of the computation back to the driver
I am wondering what will happen if the storage of the spark cluster is less than the data that needs to be processed? How executors will fetch the data to sit on the physical machine in the cluster?
The same question goes for streaming data, which unbound data. Do Spark save all the incoming data on disk?
The Apache Spark FAQ briefly mentions the two strategies Spark may adopt:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Although Spark uses all available memory by default, it could be configured to run the jobs only with disk.
In section 2.6.4 Behavior with Insufficient Memory of Matei's PhD dissertation on Spark (An Architecture for Fast and General Data Processing on Large Clusters) benchmarks the performance impact due to the reduced amount of memory available.
In practice, you don't usually persist the source dataframe of 100TB, but only the aggregations or intermediate computations that are reused.

SparkDataframe.load(),when I execute a load command where actually my data get stored?

If I am loading one table from cassandra using spark dataframe.load().Where will my data gets loaded.Is it in spark memory.Or in datanode blocks ,if I am using yarn resource manager.
It will try to store in memory per number of partitions on the Worker Nodes / which in this context is a slightly better term than Data Nodes.
It will spill to disk if not enough memory on the Worker Nodes.
Per number of Cores / Executors, processing will occur. E.g. if you have, say, 20 Executors with 1 Core each, your concurrency of processing is 20 and spilling will occur via eviction. If you run out of disk, an error will result.
Worker Nodes is a better term here compared to Data Nodes, unless you have HDFS and processing locally, then Worker Node is equal to Data Node. Although you could argue what's in a name?
Of course, an Action will need to have been initiated.
And repartition and join or union latterly in the data pipeline affect things, but that goes without saying.

How Spark internally works when reading HDFS files

Say I have a file of 256 KB is stored on HDFS file system of one node (as two blocks of 128 KB each). This file internally contains two blocks
of 128 KB each. Assume I have two nodes cluster of each 1 core only. My understanding is that spark during transformation will read complete file
on one node in memory and then transfer one file block memory data to other node so that both nodes/cores can parallely execute it ? Is that correct ?
What if both nodes had two core each instead of one core ? In that case two cores on single node could do the computation ? Is that right ?
val text = sc.textFile("mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect
You question is a little hypothetical as it is unlikely you would have an Hadoop Cluster with HDFS existing with only one Data Node and 2 Worker Nodes - one being both Worker and Data Node. That is to say, the whole idea of Spark (and MR) with HDFS is to bring the processing to the data. The Worker Nodes are in fact the Data Nodes in the standard Hadoop set up. This is the original intent.
Some variations to answer your question:
Assuming the case as per above described, each Worker Node would process one partition and subsequent transformations on the newer generated RDDs until finished. You may of course repartition the data and what happens depends on the number of partitions and number of Executors per Worker Node.
In a nutshell: if you have N blocks / partitions initially and less than N Executors allocated - E - on a Hadoop Cluster with HDFS, then you will get some transfer of blocks (not a shuffle as is talked about elsewhere) to the Workers assigned, from Workers where no Executor was allocated to you Spark App, otherwise the block is assigned to be processed to that Data / Worker Node, obviously. Each block / partition is processed in some way, shuffled and the next set of Partitions or Partition read in and processed, depending on speed of processing for your transformation(s).
In the case of AWS S3 and Mircosoft's and gooogle's equivalent Cloud Storage which leave aside the principle of data locality as in the above case - i.e. compute power is divorced from storage, with the assumption that the network is not the bottleneck - which was exactly the Hadoop classic reason to bring the processing to the data, then it works similarly to the aforementioned, i.e. transfer of S3 data to Workers.
All of this assume an Action has been invoked.
I leave aside the principles of Rack Awareness, etc. as it becomes all quite complicated, but the Resource Managers understand these things and decide accordingly.
In the first case, Spark will usually load 1 partition on the first node and then if it cannot find an empty core, it will load the 2nd partition on the 2nd node after waiting for spark/locality.wait (default 3 seconds).
In the 2nd case both partitions will be loaded on the same node unless it does not have both cores free.
Many circumstances can cause this to change if you play with the default configurations.

Where does Spark schedule .textFile() task

Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host).
sc.textFile("hdfs://external_host/file.txt")
I understand that Spark schedules tasks based on the locality of the underlying RDD. But on which worker (ie. executor) is .textFile(..) scheduled (since we do not have an executor running on external_host)?
I imagine it loads the HDFS blocks as partitions to worker memory, but how does Spark decide what the best worker is (I would imagine it chooses the closest based on latency or something else, is this correct?)

Why does Spark save Map phase output to local disk?

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.
Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.
I would like to understand the following w.r.t to Hadoop MapReduce.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?
How is the output of the Map tasks from Hadoop MapReduce and Spark different?
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.
It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.
How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.
Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?
When you execute a Spark application, the very first thing is starting the SparkContext first that becomes the home of multiple interconnected services with DAGScheduler, TaskScheduler and SchedulerBackend being among the most important ones.
DAGScheduler is the main orchestrator and is responsible for transforming a RDD lineage graph (i.e. a directed acyclic graph of RDDs) into stages. While doing it, DAGScheduler traverses the parent dependencies of the final RDD and creates a ResultStage with parent ShuffleMapStages.
A ResultStage is (mostly) the last stage with ShuffleMapStages being its parents. I said mostly because I think I may have seen that you can "schedule" a ShuffleMapStage.
This is the very early and first optimization Spark applies to your Spark jobs (that together create a Spark application) - execution pipelining where multiple transformations are wired together to create a single stage (because their inter-dependencies are narrow). That's what makes Spark faster than Hadoop MapReduce since two or more transformations can get executed one by one with no data shuffling possibly all in memory.
A single stage is as wide until it hits ShuffleDependency (aka wide dependency).
There are RDD transformations that will cause shuffling (due to creating a ShuffleDependency). That's the moment where Spark is very much like Hadoop's MapReduce since it will save partial shuffle outputs to...local disks on executors.
When a Spark application starts it requests executors from a cluster manager (there are three supported: Spark Standalone, Apache Mesos and Hadoop YARN). This is what SchedulerBackend is for -- to manage communication between your Spark application and cluster resources.
(Let's assume you are not using External Shuffle Manager)
Executors host their own local BlockManagers that are responsible for managing RDD blocks that are kept on local hard drive (possibly in memory and replicated too). You can control RDD block persistence using cache and persist operators and StorageLevels. You can use Storage and Executors tabs in web UI to track blocks with their location and size.
The difference between Spark storing data locally (on executors) and Hadoop MapReduce is that:
The partial results (after computing ShuffleMapStages) are saved on local hard drives not HDFS which is a distributed file system with a very expensive saves.
Only some files are saved to local hard drive (after operations being pipelined) which does not happen in Hadoop MapReduce that saves all maps to HDFS.
Let me answer the following item:
If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?
That's the trickest part in the Spark execution plan and heavily depends on how wide the shuffling is. If you work only with local data (multiple executors on a single machine) you will see no data traffic since the data is in place already.
If the data shuffle is required, executors will send data between each other and that will increase the traffic.
Data Exchange Between Nodes in Spark Application
Just to elaborate on the traffic between nodes in a Spark application.
Broadcast variables are the means of sending data from the driver to executors.
Accumulators are the means of sending data from executors to the driver.
Operators like collect will pull all the remote blocks from executors to the driver.

Resources