Spark streaming applications subscribing to same kafka topic - apache-spark

I am new to spark and kafka and I have a slightly different usage pattern of spark streaming with kafka.
I am using
spark-core_2.10 - 2.1.1
spark-streaming_2.10 - 2.1.1
spark-streaming-kafka-0-10_2.10 - 2.0.0
kafka_2.10 - 0.10.1.1
Continuous event data is being streamed to a kafka topic which I need to process from multiple spark streaming applications. But when I run the spark streaming apps, only one of them receives the data.
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("group.id", "test-consumer-group");
kafkaParams.put("enable.auto.commit", "true");
kafkaParams.put("auto.commit.interval.ms", "1000");
kafkaParams.put("session.timeout.ms", "30000");
Collection<String> topics = Arrays.asList("4908100105999_000005");;
JavaInputDStream<ConsumerRecord<String, String>> stream = org.apache.spark.streaming.kafka010.KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams) );
... //spark processing
I have two spark streaming applications, usually the first one I submit consumes the kafka messages. Second application just waits for messages and never proceeds.
As I read, kafka topics can be subscribed from multiple consumers, is it not true for spark streaming ? Or there is something I am missing with kafka topic and its configuration ?
Thanks in advance .

You can create different streams with same groupids. Here are more details from the online documentation for 0.8 integrations, there are two approaches:
Approach 1: Receiver-based Approach
Multiple Kafka input DStreams can be created with different groups and
topics for parallel receiving of data using multiple receivers.
Approach 2: Direct Approach (No Receivers)
No need to create multiple input Kafka streams and union them. With
directStream, Spark Streaming will create as many RDD partitions as
there are Kafka partitions to consume, which will all read data from
Kafka in parallel. So there is a one-to-one mapping between Kafka and
RDD partitions, which is easier to understand and tune.
You can read more at Spark Streaming + Kafka Integration Guide 0.8
From your code looks like you are using 0.10, refer Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0
Even thought it is using spark streaming api, everything is controlled by kafka properties so depends on group id you specify in properties file, you can start multiple streams with different group id's.
Cheers !

Number of consumers [Under a consumer group], cannot exceed the number of partitions in the topic. If you want to consume the messages in parallel, then you will need to introduce suitable number of partitions and create receivers to process each partition.

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Kafka Partition+Spark Streaming Context

Scenario-I have 1 topic with 2 partitions with different data set collections say A,B.I am aware that the the dstream can consume the messages at the partition level and the topic level.
Query-Can we use two different streaming contexts for the each partition or a single streaming context for the entire topic and later filter the partition level data?I am concerned about the performance on increasing the no of streaming contexts.
Quoting from the documentation.
Simplified Parallelism: No need to create multiple input Kafka streams
and union them. With directStream, Spark Streaming will create as many
RDD partitions as there are Kafka partitions to consume, which will
all read data from Kafka in parallel. So there is a one-to-one mapping
between Kafka and RDD partitions, which is easier to understand and
tune.
Therefore, if you are using Direct Stream based Spark Streaming consumer it should handle the parallelism.

Spark Streaming Design for 1000+ topics

I have to design a spark streaming application with below use case. I am looking for best possible approach for this.
I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic.
Ex.
Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic
Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic
Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic
.
.
.
Input Type N Topic --> Spark Streaming --> Output Type N Topic and so on.
I need to answer following questions.
Is it a good idea to launch 1000+ spark streaming application per topic basis ? Or I should have one streaming application for all topics as processing logic going to be same ?
If one streaming context , then how will I determine which RDD belongs to which Kafka topic , so that after processing I can write it back to its corresponding OUTPUT Topic?
Client may add/delete topic from Kafka , how do dynamically handle in Spark streaming ?
How do I restart job automatically on failure ?
Any other issue you guys see here ?
Highly appreicate your response.
1000 different Spark applications will not be maintainable, imagine deploying, or upgrading each application.
You will have to use the recommended "Direct approach" instead of the Receiver approach, otherwise your application is going to use more than 1000 cores, if you don't have more, it will be able to receive data from your Kafka's topic but not to process them. From Spark Streaming Doc :
Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application.
You can see in the Kafka Integration (there is one for Kafka 0.8 and one for 0.10) doc how to see in which topic belongs a message
If a client add new topics or partitions, you will need to update your Spark Streaming's topics conf, and redeploy it. If you use Kafka 0.10 you can also use RegEx for topics' name, see Consumer Strategies. I've experienced reading from a deleted topic in Kafka 0.8, and there was no problems, still verify ("Trust, but verify")
See Spark Streaming's doc about Fault Tolerance, also use the mode --supervise when submiting your application to your cluster, see the Deploying documentation for more information
To achieve exactly-once semantic, I suggest this Github from Spark Streaming's main commiter : https://github.com/koeninger/kafka-exactly-once
Bonus, good similar StackOverFlow post : Spark: processing multiple kafka topic in parallel
Bonus2: Watch out for the soon-to-be-released Spark 2.2 and the Structured Streaming component

What is use of "spark.streaming.blockInterval" in Spark Streaming DirectAPI

I want to understand, What role "spark.streaming.blockInterval" plays in Spark Streaming DirectAPI, as per my understanding "spark.streaming.blockInterval" is used for calculating partitions i.e. #partitions = (receivers x* batchInterval) /blockInterval, but in DirectAPI spark streaming partitions is equal to no. of kafka partitions.
How "spark.streaming.blockInterval" is used in DirectAPI ?
spark.streaming.blockInterval :
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark.
And KafkaUtils.createDirectStream() do not use receiver.
With directStream, Spark Streaming will create as many RDD partitions
as there are Kafka partitions to consume

Why did two spark streaming jobs pull messages from the same Kafka topic with same group id not balancing load but getting same messages?

Kafka 0.8 official doc describes Kafka Consumer as follows:
"Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers."
I setup a Kafka cluster with Kafka 0.8.1.1 and use Spark Streaming job (spark 1.3) to pull data from its topics. The Spark Streaming code as follows:
... ...
HashMap<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", brokerList);
kafkaParams.put("group.id", groupId);
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
messages.foreachRDD(new Function<JavaPairRDD<String, String>, Void>() {
#Override
public Void call(JavaPairRDD<String, String> rdd) throws Exception {
long msgNum = strJavaRDD.count();
System.out.println("There are " + msgNum + " messages read from Kafka.");
... ...
return null;}});
And then I submitted two Spark Streaming jobs to access the same topic with same group id. I assumed that when I send 100 messages to the topic, the two jobs totally get 100 message (e.g. job1 get 50 and job2 get 50; or job1 get 100 and job2 get 0). However, they get 100 respectively. Such a result seems different from what the Kafka doc said.
Is there anything with my code? Did I set the group id config correctly? Is this a bug or a design for createDirectStream()?
Test Env: Kafka 0.8.1.1 + Spark 1.3.1
Group is a feature of Kafka's high level consumer API before version 0.9, it's not available in simple consume API. createDirectStream use the simple consumer API.
Some Tips:
The main reason to use a SimpleConsumer implementation is you want greater control over partition consumption than Consumer Groups give you. (EG: Read a message multiple times)
createDirectStream: Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch.
Refer:
Spark Streaming + Kafka Integration Guide
0.8.0 SimpleConsumer Example
Kafka 0.9.0 release added a new Java consumer to replace the existing high-level ZooKeeper-based consumer and low-level consumer APIs. And then you can use group and commit the offset manual at the same time.
Creating two different spark apps to do the same thing with the same messages doesn't make sense. Use one app with more executors.

Resources