Spark Ingestion path: "Source to Driver to Worker" or "Source to Workers" - apache-spark

When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.

If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.

Related

Why there is no JDBC Spark Streaming receiver?

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!
First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).
I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

How to use driver to load data and executors for processing and writing?

I would like to use Spark structured streaming to watch a drop location that exists on the driver only. I do this with
val trackerData = spark.readStream.text(sourcePath)
After that I would like to parse, filter, and map incoming data and write it out to elastic.
This works well except that it does only work when spark.master is set to e.g. local[*]. When set to yarn, no files get found even when deployment mode is set to client.
I thought that reading data in from local driver node is achieved by setting deployment to client and doing the actual processing and writing within the Spark cluster.
How could I improve my code to use driver for reading in and cluster for processing and writing?
What you want is possible, but not recommended in Spark Structured Streaming in particular and in Apache Spark in general.
The main motivation of Apache Spark is to bring computation to the data not the opposite as Spark is to process petabytes of data that a single JVM (of a driver) would not be able to handle.
The driver's "job" (no pun intended) is to convert a RDD lineage (= a DAG of transformations) to tasks that know how to load a data. Tasks are executed on Spark executors (in most cases) and that's where data processing happens.
There are some ways to make the reading part on driver and processing on executors and among them the most "lucrative" would be to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
One idea that came to my mind is that you could "hack" Spark "Streams" and write your own streaming sink that would do collect or whatever. That could make the processing local.

Spark - how does it distribute data around the nodes?

How does spark distribute data to workers?
Do the workers read from the data source, or does the driver reads it and sends it to workers? And when a worker needs data that is in another worker, do they communicate directly?
Thanks!
If you use distributed input methods like SparkContext.textFile then workers read directly from your data source (or if you explicitly open HDFS files from inside worker task code then of course those will also occur on the workers).
If you manually read data in on your main driver program, and then used SparkContext.parallelize, then indeed your driver will be sending data to your workers.
Data dependencies from worker to worker are generally referred to as the shuffle; this type of worker-to-worker communication is in a lot of ways the heart of most big data processing systems, precisely because it's tricky to do efficiently and reliably. Conceptually you can treat it more-or-less as "communicating directly", but there may be a lot more going on under the hood depending on how the data dependency is taken on.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Spark Streaming with large number of streams and models used for analytical processing of RDDs

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?
Or
Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..
Please suggest.
Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.
The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.
Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

Resources