Spark Streaming - Kafka Integration - apache-spark

We are using small spark cluster with 5 nodes and all these 5 nodes were connected with Kafka brokers.
We are planning to scale the cluster by adding more nodes and this may require configuring this additional to connect with the Kafka cluster .We are assessing the best practices of integrations
How it actually to be integrated to make the integration as easy as possible
Is it needed for all the workers node to be connected with the
brokers , in that case , it might not be scalable ?

I would advice to go over the documentation of spark with kafka integretion
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"How it actually to be integrated to make the integration as easy as possible" :
I'm not sure what do you mean - but basically when you connect to kafka you should provide the bootstrap servers : Bootstrap Servers are a list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
These servers are just used for the initial connection to discover the full cluster membership. so the number of nodes of the kafka cluster will not change the way you integrate
"Is it needed for all the workers node to be connected with the brokers , in that case , it might not be scalable ?" :
spark integration works in the following way (sort of):
the sprak driver - connects to the kafka to understand the required partitions and offsets
based on part 1 the partitions are assigned to the spark "workers" - which is usually a 1 to 1 from a kafka partition to a spark partition.
not all workers (I guess you mean executors) connect to all kafka nodes - so in this case it is also scalable
side note : you can use a configuration to further break the number of spark partitions that would read from a single kafka partition - its called minPartitions and its from spark 2.4.7
last note : spark streaming with kafka is a very used and known use case and is used in very big data ecosystems as a first intuitive thought I would assume its scalable

Came across the following phrase while going though the book , https://learning.oreilly.com/library/view/stream-processing-with/9781491944233/ch19.html
Particularly the phrase The driver does not send data to the executors; instead, it simply sends a few offsets they use to directly consume data. - seems the _all the executors( worker nodes) have to have connection with the kafka as it is quite possible the tasks might run on any executor
The gist of data delivery is that the Spark driver queries offsets and
decides offset ranges for every batch interval from Apache Kafka.
After it receives those offsets, the driver dispatches them by
launching a task for each partition, resulting in a 1:1 parallelism
between the Kafka partitions and the Spark partitions at work. Each
task retrieves data using its specific offset ranges.
The driver does
not send data to the executors; instead, it simply sends a few offsets
they use to directly consume data. As a consequence, the parallelism
of data ingestion from Apache Kafka is much better than the legacy
receiver model, where each stream was consumed by a single machine.

Related

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.
Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?
So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.
Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.
Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.
Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.
As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.
in which situations I should prefer connectors over the Spark streaming solution.
"It Depends" :-)
Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
If you're not using Spark already, Kafka Connect is arguably more
straightforward to deploy (run the JVM, pass in the configuration)
As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)
Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?
Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)
If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.
Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Scaling with Apache Spark/Apache Flink

I plan an application that reads from Apache Kafka and after (potentially time-consuming) processing saves data to a database.
My case are messages, not streams, but for scalability I'm thinking about plugging this into Spark or Flink, but can't grasp how these scale: should my app, when a part of Spark/Flink, read some data from Kafka and then exit or keep reading continuously?
How will then Spark/Flink decide they must spawn more instances of my app to improve throughput?
Thanks!
In Apache Flink you can define the parallelism of the operations by setting the env.setParallelism(#parallelism) to make all operators run with #parallelism parallel instances, or even you can define/override it per operator such as dataStream.map(...).setParallelism(#parallelism);.
For more info Check Flink docs https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html.
Regarding reading from Kafa you can define the parallel receivers (same group) to scale up/down with the Kafka topic partitions : env.addSource(kafkaConsumer).setParallelism(#topicPartitions)
Check Kafka documentation for more info about Kafka topic and partitions and consumer group : https://kafka.apache.org/documentation/.
Note that if you don't specify the parallelism level inside the Flink program and you deploy it on local Flink cluster. The value of parallelism.default parameter inside the configs file flinkDir/conf/flink-conf.yaml will be used. Unless you specify it by the -p like ./bin/flink run .... -p #parallelism. check Flink cli options.

Spark streaming + Kafka vs Just Kafka

Why and when one would choose to use Spark streaming with Kafka?
Suppose I have a system getting thousand messages per seconds through Kafka. I need to apply some real time analytics on these messages and store the result in a DB.
I have two options:
Create my own worker that reads messages from Kafka, run the analytics algorithm and store the result in DB. In a Docker era it is easy to scale this worker through my entire cluster with just scale command. I just need to make sure I have an equal or grater number of partitions than my workers and all is good and I have a true concurrency.
Create a Spark cluster with Kafka streaming input. Let the Spark cluster to do the analytics computations and then store the result.
Is there any case when the second option is a better choice? Sounds to me like it is just an extra overhead.
In a Docker era it is easy to scale this worker through my entire cluster
If you already have that infrastructure available, then great, use that. Bundle your Kafka libraries in some minimal container with health checks, and what not, and for the most part, that works fine. Adding a Kafka client dependency + a database dependency is all you really need, right?
If you're not using Spark, Flink, etc, you will need to handle Kafka errors, retries, offset and commit handling more closely to your code rather than letting the framework handle those for you.
I'll add in here that if you want Kafka + Database interactions, check out the Kafka Connect API. There's existing solutions for JDBC, Mongo, Couchbase, Cassandra, etc. already.
If you need more complete processing power, I'd go for Kafka Streams rather than needing to separately maintain a Spark cluster, and so that's "just Kafka"
Create a Spark cluster
Let's assume you don't want to maintain that, or rather you aren't able to pick between YARN, Mesos, Kubernetes, or Standalone. And if you are running the first three, it might be worth looking at running Docker on those anyway.
You're exactly right that it is extra overhead, so I find it's all up to what you have available (for example, an existing Hadoop / YARN cluster with idle memory resources), or what you're willing to support internally (or pay for vendor services, e g. Kafka & Databricks in some hosted solution).
Plus, Spark isn't running the latest Kafka client library (up until 2.4.0 updated to Kafka 2.0, I believe), so you'll need to determine if that's a selling point.
For actual streaming libraries, rather than Spark batches, Apache Beam or Flink would probably let you do the same types of workloads against Kafka
In general, in order to scale a producer / consumer, you need some form of resource scheduler. Installing Spark may not be difficult for some, but knowing how to use it efficiently and tune for appropriate resources can be

Spark Streaming and Kafka: one cluster or several standalone boxes?

I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.
An important element to consider in this case is the partitioning of the topic.
The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.
So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.
Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.
Option 1 is straight forward, simple and probably more efficient. If your requirements are met, that's the one to go for (And honor the KISS Principle).

Data locality in Spark Streaming

Recently I've been doing performance tests on Spark Streaming. I ran a receiver on one of the 6 slaves and submitted a simple Word Count application to the cluster(actually I know this configuration is not proper in practice,just a simple test).I analyzed the scheduling log and found that nearly 88% of tasks are scheduled to the node where receiver ran on and the locality are always PROCESS_LOCAL and the CPU utilization is very high. Why does not Spark Streaming distribute data across the cluster and make full use of cluster? I've read official guide and it does not explain in detail, especially in Spark Streaming. Will it copy stream data to another node with free CPU and start new task on it when a task is on a node with busy CPU? If so, how can we explain the former case?
When you run the stream receiver just on one of the 6 nodes, all the received data are processed on this node (that is the data locality).
Data are not distributed across other nodes by default. If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
This distributes the received batches of data across the specified number of machines in the cluster before further processing.
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Resources