Spark - how does it distribute data around the nodes? - apache-spark

How does spark distribute data to workers?
Do the workers read from the data source, or does the driver reads it and sends it to workers? And when a worker needs data that is in another worker, do they communicate directly?
Thanks!

If you use distributed input methods like SparkContext.textFile then workers read directly from your data source (or if you explicitly open HDFS files from inside worker task code then of course those will also occur on the workers).
If you manually read data in on your main driver program, and then used SparkContext.parallelize, then indeed your driver will be sending data to your workers.
Data dependencies from worker to worker are generally referred to as the shuffle; this type of worker-to-worker communication is in a lot of ways the heart of most big data processing systems, precisely because it's tricky to do efficiently and reliably. Conceptually you can treat it more-or-less as "communicating directly", but there may be a lot more going on under the hood depending on how the data dependency is taken on.

Related

How is data accessed by worker nodes in a Spark Cluster?

I'm trying to understand the functioning of Spark, I know the Cluster manager allocates the resources (Workers) for the driver program.
I want to know, how (which transformations) the cluster manager sends the tasks to worker nodes and how worker nodes access the data (Assume my data is in S3)?
Does worker nodes read only a part of data and apply all transformations on it and return the actions to the driver program? or The worker nodes reads the entire file but only apply specific transformation and return back the result to the driver program?
Follow-up questions:
How and who decides how much amount of data needs to be sent to worker nodes? as we have established a point that partial data is present on each worker node. Eg: I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform few transformations and an action. assume the csv is on S3 and on the master node's local storage.
It's going to be a long answer, but I will try to simplify it at my best:
Typically a Spark cluster contains multiple nodes, each node would have multiple CPUs, a bunch of memory, and storage. Each node would hold some chunks of data, therefore sometimes they're also referred to data nodes as well.
When Spark application(s) are started, they tend to create multiple workers or executors. Those workers/executors took resources (CPU, RAM) from the cluster's nodes above. In other words, the nodes in a Spark cluster play both roles: data storage and computation.
But as you might have guessed, data in a node (sometimes) is incomplete, therefore, workers would have to "pull" data across the network to do a partial computation. Then the results are sent back to the driver. The driver would just do the "collection work", and combine them all to get the final results.
Edit #1:
How and who decides how much amount of data needs to be sent to worker nodes
Each task would decide which data is "local" and which is not. This article explains pretty well how data locality works
I have two worker nodes with 4 cores each and I have one 1TB csv file to read and perform a few transformations and an action
This situation is different with the above question, where you have only one file and most likely your worker would be exactly the same as your data node. The executor(s) those are sitting on that worker, however, would read the file piece by piece (by tasks), in parallel, in order to increase parallelism.

Best way to return data back to driver from spark workers

We're facing performance issues running big data task on a single machine. Task by design is memory and compute intensive and is running an optimization algorithm (branch and bound algorithm) on huge data sets. A single EC2 c5.24x large machine (96vCPU/192GiB) is now taking over 2 days to complete the task. We have tried to optimize the code (multiple threads, more memory, parallelism, optimized algorithm) but looks like there's a limit to how much we could achieve and doesn't sound like a scalable option as the data set is growing and we're adding more use cases to it.
Thinking of distributing this in to smaller tasks and have it executed by multiple workers in spark cluster. Output of the task will be a single Gzipped JSON (2- 20 MB in size) and by distributing i want each worker to build smaller JSONs or RDD chunks which could later be merged at the driver side.
Is this doable? Is there a limit on how much data each worker can send back to the driver? Or is it better to store each worker output to some database (S3) and then merge at driver side? What are the pros and cons of each approach?
I would suggest not to collect data from executors to driver and combine them there. It makes the driver a bottleneck and it will run into frequent resource related issues. The best option is to let the executors process the data and produce JSONs. You can leave the output JSONs as is which will help in reprocessing them again. If you are too worried about size of these JSONs or want to combine them to send to another process then you can use a file utility to combine them or you can use dataframe.coleasc(1) to generate one output file.

How to use driver to load data and executors for processing and writing?

I would like to use Spark structured streaming to watch a drop location that exists on the driver only. I do this with
val trackerData = spark.readStream.text(sourcePath)
After that I would like to parse, filter, and map incoming data and write it out to elastic.
This works well except that it does only work when spark.master is set to e.g. local[*]. When set to yarn, no files get found even when deployment mode is set to client.
I thought that reading data in from local driver node is achieved by setting deployment to client and doing the actual processing and writing within the Spark cluster.
How could I improve my code to use driver for reading in and cluster for processing and writing?
What you want is possible, but not recommended in Spark Structured Streaming in particular and in Apache Spark in general.
The main motivation of Apache Spark is to bring computation to the data not the opposite as Spark is to process petabytes of data that a single JVM (of a driver) would not be able to handle.
The driver's "job" (no pun intended) is to convert a RDD lineage (= a DAG of transformations) to tasks that know how to load a data. Tasks are executed on Spark executors (in most cases) and that's where data processing happens.
There are some ways to make the reading part on driver and processing on executors and among them the most "lucrative" would be to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
One idea that came to my mind is that you could "hack" Spark "Streams" and write your own streaming sink that would do collect or whatever. That could make the processing local.

Spark Ingestion path: "Source to Driver to Worker" or "Source to Workers"

When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.
If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources