Data receiving in Spark Streaming - apache-spark

Recently I've been doing performance tests on Spark Streaming. But some problems puzzled me a lot.
In Spark Streaming, receivers are scheduled to run in executors on worker nodes.
How many receivers are there in a cluster? Can I control the number of receivers?
If not all workers run receiver to receive stream data, the other worker nodes will not receive any data? In that case, how can I guarantee task scheduling based on data locality? Copy data from nodes those run receivers?

There is only one receiver per DStream, but you can create more than one DStream and union them together to act as one. This is why it is suggested to run Spark Streaming against a cluster that is at least N(receivers) + 1 cores. Once the data is past the receiving portion, it is mostly a simple Spark application and follows the same rules of a batch job. (This is why streaming is referred to as micro-batching)

Related

Spark Streaming - Kafka Integration

We are using small spark cluster with 5 nodes and all these 5 nodes were connected with Kafka brokers.
We are planning to scale the cluster by adding more nodes and this may require configuring this additional to connect with the Kafka cluster .We are assessing the best practices of integrations
How it actually to be integrated to make the integration as easy as possible
Is it needed for all the workers node to be connected with the
brokers , in that case , it might not be scalable ?
I would advice to go over the documentation of spark with kafka integretion
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"How it actually to be integrated to make the integration as easy as possible" :
I'm not sure what do you mean - but basically when you connect to kafka you should provide the bootstrap servers : Bootstrap Servers are a list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
These servers are just used for the initial connection to discover the full cluster membership. so the number of nodes of the kafka cluster will not change the way you integrate
"Is it needed for all the workers node to be connected with the brokers , in that case , it might not be scalable ?" :
spark integration works in the following way (sort of):
the sprak driver - connects to the kafka to understand the required partitions and offsets
based on part 1 the partitions are assigned to the spark "workers" - which is usually a 1 to 1 from a kafka partition to a spark partition.
not all workers (I guess you mean executors) connect to all kafka nodes - so in this case it is also scalable
side note : you can use a configuration to further break the number of spark partitions that would read from a single kafka partition - its called minPartitions and its from spark 2.4.7
last note : spark streaming with kafka is a very used and known use case and is used in very big data ecosystems as a first intuitive thought I would assume its scalable
Came across the following phrase while going though the book , https://learning.oreilly.com/library/view/stream-processing-with/9781491944233/ch19.html
Particularly the phrase The driver does not send data to the executors; instead, it simply sends a few offsets they use to directly consume data. - seems the _all the executors( worker nodes) have to have connection with the kafka as it is quite possible the tasks might run on any executor
The gist of data delivery is that the Spark driver queries offsets and
decides offset ranges for every batch interval from Apache Kafka.
After it receives those offsets, the driver dispatches them by
launching a task for each partition, resulting in a 1:1 parallelism
between the Kafka partitions and the Spark partitions at work. Each
task retrieves data using its specific offset ranges.
The driver does
not send data to the executors; instead, it simply sends a few offsets
they use to directly consume data. As a consequence, the parallelism
of data ingestion from Apache Kafka is much better than the legacy
receiver model, where each stream was consumed by a single machine.

Execution order of windows in spark streaming

Background
I've got a spark streaming application, that reads data from Kinesis -> does the windowing on it -> saves the data to external system (by doing foreachRDD).
Recently I've observed, that my windows are consumed by foreachRDD one-by-one. This means if I have sudden burst of data in my app (so that foreachRDD for a window takes a long time), then the windows will be stacking in a queue before being processed (while most of machines in my cluster are idle).
Question
Is this a semantic of spark streaming that windows are being processed one-by-one? If yes, is there any way to do "windowing" operation in parallel in spark, so that windows are consumed by foreachRDD at the same time?
Find out how many shards your kinesis stream has and create that many receivers by invoking createStream defined in the KinesisUtils Scala class.
This blog post from Amazon explains it well ....
Every input DStream is associated with a receiver, and in this case
also with a KCL worker. The best way to understand this is to refer to
the method createStream defined in the KinesisUtils Scala class.
Every call to KinesisUtils.createStream instantiates a Spark Streaming
receiver and a KCL worker process on a Spark executor. The first time
a KCL worker is created, it connects to the Amazon Kinesis stream and
instantiates a record processor for every shard that it manages. For
every subsequent call, a new KCL worker is created and the record
processors are re-balanced among all available KCL workers. The KCL
workers pull data from the shards, and routes them to the receiver,
which in turns stores them into the associated DStream.

Spark Streaming and Kafka: one cluster or several standalone boxes?

I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.
An important element to consider in this case is the partitioning of the topic.
The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.
So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.
Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.
Option 1 is straight forward, simple and probably more efficient. If your requirements are met, that's the one to go for (And honor the KISS Principle).

Data locality in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?
When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality).
Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
This distributes the received batches of data across the specified number of machines in the cluster before further processing.
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Where is executed the Apache Spark reductionByWindow function?

I try to learn apache spark and I can't understand from documentation how window operations work.
I have two worker node and I use Kafka Spark Utils to create DStream from a Topic.
On this DStream I apply map function and a reductionByWindow.
I can't understand if reductionByWindow is executed on a each worker or in the driver.
I have searched on google without any result.
Can Someone explain me?
Both receiving and processing data happens on the worker nodes. Driver creates receivers (on worker nodes) which are responsible for data collection, and periodically starts jobs to process collected data. Everything else is pretty much standard RDDs and normal Spark jobs.

Resources