Spark Structured Streaming Kafka Integration - Streaming Query - apache-spark

I'm working on an application that would connect to a Kafka source and on the same source, I would want to create multiple streaming queries with different filter conditions. Each of the queries would be processing a business logic and writing it to HBase.
I'm trying to solve some race conditions in the business logic, and want to understand how the internals of Spark Strucutured Streaming work while reading from Kafka.
1) How many Kafka consumers would be created across the application? Would it be related to the number of partitions on the topic or to the number of executors running the application?
2) Would each of the streaming query write to the same unbounded table with 1 Kafka consumer per query?

Related

What is the best way to consume the same topic from many different kafka brokers with spark structured streaming?

I have a situation where my load is distributed between a few data centers (dc), which in each data center has its own Kafka Broker and data processors that process the data only for its data center.
So, I'll have the brokers broker1-dc1, broker1-dc2,..,broker1-dcn, and all brokers will have the same topics, e.g. DATA_TOPIC.
I want is to consume the topic DATA_TOPIC from all my different brokers and persist this data in a single data lake table, I am doing it with structured streaming, but that isn't a requirement.
I don't have much experience with spark and what I want to know is the best way that I can do this, I'm considering two options:
Have different spark jobs, in which each one consumes the data from a different data center and have a unique checkpoint location;
Have a unique job that has a consumer (Kafka readStream) for each data center, and do a union between all consumers
Which of these options are better, or Is there an even better option?
I don't know if this helps, but I'm planning to use an AWS architecture with EMR, S3, Glue, and delta lake or iceberg as table formats.
Thanks
Kafka clients can only use one bootstrap.servers at a time, so if the plan is to define N streaming dataframes, that seems like a poor design choice since one failing stream ideally shouldn't stop your application.
Instead, I'd suggest looking into using MirrorMaker2 to consolidate topics into one Kafka cluster that you'll run processing against, which should result in the same effect as the union.
Your first option is somewhat similar, but it's a tradeoff on if you want to manage N Spark applications along with their checkpoints, or N Kafka Connect processes that serve a single purpose and can be ran in one Connect cluster

Kafka Spark Streaming ingestion for multiple topics

We are currently ingesting Kafka messages into HDFS using Spark Streaming. So far we spawn a whole Spark job for each topic.
Since messages are produced pretty rarely for some topics (average of 1 per day), we're thinking about organising the ingestion in pools.
The idea is to avoid creating a whole container (and related resources) for this "unfrequent" topics. In fact Spark Streaming accepts a list of topics in input, so we're thinking about using this feature in order to have a single job consuming all of them.
Do you guys think the one exposed is a good strategy? We also thought about batch ingestion, but we like to keep real-time behavior so we excluded this option. Do you have any tip or suggestion?
Does Spark Streaming handle well multiple topics as a source in case of failures in terms of offset consistency etc.?
Thanks!
I think Spark should be able to handle multiple topics fine as they have support for this from a long time and yes Kafka connect is not confluent API. Confluent does provide connectors for their cluster but you can use it too. You can see that Apache Kafka also has documentation for Connect API.
It is little difficult with Apache version of Kafka, but you can use it.
https://kafka.apache.org/documentation/#connectapi
Also if you're opting for multiple kafka topics in single spark streaming job, you may need to think about not creating small files as your frequency seems very less.

Are Spark Streaming, Structured Streaming and Kafka Streaming the same thing?

I have come across three popular streaming techniques that are Spark Streaming, Structured Streaming and Kafka Streaming.
I have gone through various sites but not getting this answer, are these three the same thing or different?
If not same what is the basic difference.
I am not looking for an in depth answer. But an answer to above question (yes or no) and a little intro to each of them so that I can explore more. :)
Thanks in advance
Subrat
I guess you are referring to Kafka Streams when you say "Kafka Streaming".
Kafka Streams is a JVM library, part of Apache Kafka. It is a way of processing data in Kafka topics providing an abstraction layer. Applications running KafkaStreams library can be run anywhere (not just in the Kafka cluster, actually, it is not recommended to). They'll consume, process and produce data to/from the Kafka cluster.
Spark Streaming is a part of Apache Spark distributed data processing library, that provides Stream (as oppposed to batch) processing. Spark initially provided batch computation only, so a specific layer Spark Streaming was provided for stream processing. Spark Streaming can be fed with Kafka data, but it can be connected to other sources as well.
Structured Streaming, within the realm of Apache Spark, is a different approach that came to overcome certain limitations to stream processing of the previous approach that Spark Streaming was using. It was added to Spark from a certain version onwards(2.0 IIRC).

Spark streaming multiple Kafka topic to multiple Database table with checkpoint

I'm building spark streaming from Apache Kafka to our columnar DataBase.
To ensure fault tolerance I'm using HDFS checkpoint and write ahead log.
Apache Kafka topic -> spark streaming -> HDFS checkpoint-> spark SQl ( for messages manipulation)-> spark jdbc for our Db.
When I'm using spark jobs for one topic and table everything is working file.
I'm trying to stream in one spark job multiple Kafka topics and to write for multiple tables in here started the problem with the checkpoint ( which is per one topic table )
The problem is with checkpoints :(
1) If I will use "KafkaUtils.createDirectStream" with a list of topics and "groupBy" topic name but the checkpoint folder is one and if, for example, I will need to increase resources during the ongoing streaming (change amount of cors due to Kafka Lag) this will be impossible because today it's possible only if I delete the checkpoint folder and restart the spark job.
2) Use multiple spark streamingContext this I will try today and see if it works.
3) Multiple sparkStreaming with high-level consumers ( offset saved in kafka10...)
Any other ideas/solutions that I'm missing
Does structure streams with multiple Kafka topics and checkpoints behave differently?
Thx

Spark Streaming and Kafka: one cluster or several standalone boxes?

I am about taking a decision about using Spark-Streaming Kafka integration.
I have a Kafka topic (I can break it into several topics) queuing several dozens of thousands of messages per minute, my spark streaming application ingest the messages by applying transformations, and then update a UI.
Knowing that all failures are handled and data are replicated in Kafka, what is the best option for implementing the Spark Streaming application in order to achieve the best possible performance and robustness:
One Kafka topic and one Spark cluster.
Several Kafka topics and several stand-alone Spark boxes (one machine with stand alone spark cluster for each topic)
Several Kafka topics and one Spark cluster.
I am tempted to go for the second option, but I couldn't find people talking about such a solution.
An important element to consider in this case is the partitioning of the topic.
The parallelism level of your Kafka-Spark integration will be determined by the number of partitions of the topic. The direct Kafka model simplifies the consumption model by establishing a 1:1 mapping between the number of partitions of the topic and RDD partitions for the corresponding Spark job.
So, the recommended setup would be: one Kafka topic with n partitions (where n is tuned for your usecase) and a Spark cluster with enough resources to process the data from those partitions in parallel.
Option #2 feels like trying to re-implement what Spark gives you out of the box: Spark gives you resilient distributed computing. Option #2 is trying to parallelize the payload over several machines and deal with failure by having independent executors. You get that with a single Spark cluster, with the benefit of improved resource usage and a single deployment.
Option 1 is straight forward, simple and probably more efficient. If your requirements are met, that's the one to go for (And honor the KISS Principle).

Resources