Spark Streaming and Kafka: one cluster or several standalone boxes? - apache-spark

I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.

An important element to consider in this case is the partitioning of the topic.
The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.
So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.
Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.

Option 1 is straight forward, simple and probably more efficient. If your requirements are met, that's the one to go for (And honor the KISS Principle).

Related

What is the best way to consume the same topic from many different kafka brokers with spark structured streaming?

I have a situation where my load is distributed between a few data centers (dc), which in each data center has its own Kafka Broker and data processors that process the data only for its data center.
So, I'll have the brokers broker1-dc1, broker1-dc2,..,broker1-dcn, and all brokers will have the same topics, e.g. DATA_TOPIC.
I want is to consume the topic DATA_TOPIC from all my different brokers and persist this data in a single data lake table, I am doing it with structured streaming, but that isn't a requirement.
I don't have much experience with spark and what I want to know is the best way that I can do this, I'm considering two options:
Have different spark jobs, in which each one consumes the data from a different data center and have a unique checkpoint location;
Have a unique job that has a consumer (Kafka readStream) for each data center, and do a union between all consumers
Which of these options are better, or Is there an even better option?
I don't know if this helps, but I'm planning to use an AWS architecture with EMR, S3, Glue, and delta lake or iceberg as table formats.
Thanks
Kafka clients can only use one bootstrap.servers at a time, so if the plan is to define N streaming dataframes, that seems like a poor design choice since one failing stream ideally shouldn't stop your application.
Instead, I'd suggest looking into using MirrorMaker2 to consolidate topics into one Kafka cluster that you'll run processing against, which should result in the same effect as the union.
Your first option is somewhat similar, but it's a tradeoff on if you want to manage N Spark applications along with their checkpoints, or N Kafka Connect processes that serve a single purpose and can be ran in one Connect cluster

Spark Streaming - Kafka Integration

We are using small spark cluster with 5 nodes and all these 5 nodes were connected with Kafka brokers.
We are planning to scale the cluster by adding more nodes and this may require configuring this additional to connect with the Kafka cluster .We are assessing the best practices of integrations
How it actually to be integrated to make the integration as easy as possible
Is it needed for all the workers node to be connected with the
brokers , in that case , it might not be scalable ?
I would advice to go over the documentation of spark with kafka integretion
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
"How it actually to be integrated to make the integration as easy as possible" :
I'm not sure what do you mean - but basically when you connect to kafka you should provide the bootstrap servers : Bootstrap Servers are a list of host/port pairs to use for establishing the initial connection to the Kafka cluster.
These servers are just used for the initial connection to discover the full cluster membership. so the number of nodes of the kafka cluster will not change the way you integrate
"Is it needed for all the workers node to be connected with the brokers , in that case , it might not be scalable ?" :
spark integration works in the following way (sort of):
the sprak driver - connects to the kafka to understand the required partitions and offsets
based on part 1 the partitions are assigned to the spark "workers" - which is usually a 1 to 1 from a kafka partition to a spark partition.
not all workers (I guess you mean executors) connect to all kafka nodes - so in this case it is also scalable
side note : you can use a configuration to further break the number of spark partitions that would read from a single kafka partition - its called minPartitions and its from spark 2.4.7
last note : spark streaming with kafka is a very used and known use case and is used in very big data ecosystems as a first intuitive thought I would assume its scalable
Came across the following phrase while going though the book , https://learning.oreilly.com/library/view/stream-processing-with/9781491944233/ch19.html
Particularly the phrase The driver does not send data to the executors; instead, it simply sends a few offsets they use to directly consume data. - seems the _all the executors( worker nodes) have to have connection with the kafka as it is quite possible the tasks might run on any executor
The gist of data delivery is that the Spark driver queries offsets and
decides offset ranges for every batch interval from Apache Kafka.
After it receives those offsets, the driver dispatches them by
launching a task for each partition, resulting in a 1:1 parallelism
between the Kafka partitions and the Spark partitions at work. Each
task retrieves data using its specific offset ranges.
The driver does
not send data to the executors; instead, it simply sends a few offsets
they use to directly consume data. As a consequence, the parallelism
of data ingestion from Apache Kafka is much better than the legacy
receiver model, where each stream was consumed by a single machine.

When is a Kafka connector preferred over a Spark streaming solution?

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.
Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?
So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.
Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.
Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.
Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.
As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.
in which situations I should prefer connectors over the Spark streaming solution.
"It Depends" :-)
Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
If you're not using Spark already, Kafka Connect is arguably more
straightforward to deploy (run the JVM, pass in the configuration)
As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)
Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?
Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)
If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.
Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Kafka Spark Streaming ingestion for multiple topics

We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.
Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.
The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.
Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?
Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?
Thanks!
I think Spark should be able to handle multiple topics fine as they have support for this from a long time and yes Kafka connect is not confluent API. Confluent does provide connectors for their cluster but you can use it too. You can see that Apache Kafka also has documentation for Connect API.
It is little difficult with Apache version of Kafka, but you can use it.
https://kafka.apache.org/documentation/#connectapi
Also if you're opting for multiple kafka topics in single spark streaming job, you may need to think about not creating small files as your frequency seems very less.

Spark streaming + Kafka vs Just Kafka

Why and when one would choose to use Spark streaming with Kafka?
Suppose I have a system getting thousand messages per seconds through Kafka. I need to apply some real time analytics on these messages and store the result in a DB.
I have two options:
Create my own worker that reads messages from Kafka, run the analytics algorithm and store the result in DB. In a Docker era it is easy to scale this worker through my entire cluster with just scale command. I just need to make sure I have an equal or grater number of partitions than my workers and all is good and I have a true concurrency.
Create a Spark cluster with Kafka streaming input. Let the Spark cluster to do the analytics computations and then store the result.
Is there any case when the second option is a better choice? Sounds to me like it is just an extra overhead.
In a Docker era it is easy to scale this worker through my entire cluster
If you already have that infrastructure available, then great, use that. Bundle your Kafka libraries in some minimal container with health checks, and what not, and for the most part, that works fine. Adding a Kafka client dependency + a database dependency is all you really need, right?
If you're not using Spark, Flink, etc, you will need to handle Kafka errors, retries, offset and commit handling more closely to your code rather than letting the framework handle those for you.
I'll add in here that if you want Kafka + Database interactions, check out the Kafka Connect API. There's existing solutions for JDBC, Mongo, Couchbase, Cassandra, etc. already.
If you need more complete processing power, I'd go for Kafka Streams rather than needing to separately maintain a Spark cluster, and so that's "just Kafka"
Create a Spark cluster
Let's assume you don't want to maintain that, or rather you aren't able to pick between YARN, Mesos, Kubernetes, or Standalone. And if you are running the first three, it might be worth looking at running Docker on those anyway.
You're exactly right that it is extra overhead, so I find it's all up to what you have available (for example, an existing Hadoop / YARN cluster with idle memory resources), or what you're willing to support internally (or pay for vendor services, e g. Kafka & Databricks in some hosted solution).
Plus, Spark isn't running the latest Kafka client library (up until 2.4.0 updated to Kafka 2.0, I believe), so you'll need to determine if that's a selling point.
For actual streaming libraries, rather than Spark batches, Apache Beam or Flink would probably let you do the same types of workloads against Kafka
In general, in order to scale a producer / consumer, you need some form of resource scheduler. Installing Spark may not be difficult for some, but knowing how to use it efficiently and tune for appropriate resources can be

Resources