I have two kafka stream, on stream has event id and the other stream has event id and transaction id.
One event id can have multiple transaction id.
I want to use spark streaming to keep sinking all the data to database from the second kafka topic (without any aggregation), but only for the event id from the first kafka topic and within 20 mins when we first get that event id from the first kafka topic.
I am wondering if there is any suggestion regarding how to do it in Spark Streaming.
Any comments or suggestions are much welcomed.
Related
I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.
Let's say I have a kafka topic without any duplicate messages.
If I consumed this topic with spark structured streaming and added a column with currentTime() and partitioned by this time column and saved records to s3 would there be a risk of creating duplicates in s3 in case of some failures?
Or spark is smart enough to deliver these messages exactly once?
I'm newbie to kafka and spark, wondering how to recover offset from kafka after a spark job failed.
conditions:
say 5gb/s of kafka stream, it's hard to consume from beginning
stream data has already been consumed, so how to tell spark to re-consume message / redo the failed task smoothly
I'm not to sure which area to search for, maybe someone can point me to right direction
When we are dealing with kafka, we have must have 2 different topics. One for Success and One for Failed.
Let's say, I have 2 topics Topic-Success and Topic-Failed.
When Kafka processing the data stream successfully, we can mark it and store it in Topic-Success Topic and When Kafka unable to Process data stream, then will store it in Topic-Failed Topic.
So that, when you want to re-consume the failed data stream, we can process that failed one from Topic-Failed Topic. Here you can eliminate re-consuming all the data from-beginning.
Hope this helps you.
In kafka 0.10.x there is a concept of Consumer Group which is used to track the offset of the messages.
If you have made enable.auto.commit=true and auto.offset.reset=latest it will not consume from beginning. Now taking this approach you might also need to track your offsets as the process might failed after consumption. I would suggest you to use this method suggested in Spark Docs to
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
CanCommitOffsets lies in your hands to commit those messages when your end to end pipeline gets excuted
I have to design a spark streaming application with below use case. I am looking for best possible approach for this.
I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic.
Ex.
Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic
Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic
Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic
.
.
.
Input Type N Topic --> Spark Streaming --> Output Type N Topic and so on.
I need to answer following questions.
Is it a good idea to launch 1000+ spark streaming application per topic basis ? Or I should have one streaming application for all topics as processing logic going to be same ?
If one streaming context , then how will I determine which RDD belongs to which Kafka topic , so that after processing I can write it back to its corresponding OUTPUT Topic?
Client may add/delete topic from Kafka , how do dynamically handle in Spark streaming ?
How do I restart job automatically on failure ?
Any other issue you guys see here ?
Highly appreicate your response.
1000 different Spark applications will not be maintainable, imagine deploying, or upgrading each application.
You will have to use the recommended "Direct approach" instead of the Receiver approach, otherwise your application is going to use more than 1000 cores, if you don't have more, it will be able to receive data from your Kafka's topic but not to process them. From Spark Streaming Doc :
Note that, if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance Tuning section). This will create multiple receivers which will simultaneously receive multiple data streams. But note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application.
You can see in the Kafka Integration (there is one for Kafka 0.8 and one for 0.10) doc how to see in which topic belongs a message
If a client add new topics or partitions, you will need to update your Spark Streaming's topics conf, and redeploy it. If you use Kafka 0.10 you can also use RegEx for topics' name, see Consumer Strategies. I've experienced reading from a deleted topic in Kafka 0.8, and there was no problems, still verify ("Trust, but verify")
See Spark Streaming's doc about Fault Tolerance, also use the mode --supervise when submiting your application to your cluster, see the Deploying documentation for more information
To achieve exactly-once semantic, I suggest this Github from Spark Streaming's main commiter : https://github.com/koeninger/kafka-exactly-once
Bonus, good similar StackOverFlow post : Spark: processing multiple kafka topic in parallel
Bonus2: Watch out for the soon-to-be-released Spark 2.2 and the Structured Streaming component
We need to sort the consumed records in spark streaming of kafka consumer part. Is it possible to know all the published records are consumed in kafka consumer ?
You can use KafkaConsumer#endOffsets(...) to get the offsets of the current end-of-log per partition. Of course, keep in mind that the end-of-log moves as long as new data is written by a consumer. Thus, for getting "end offsets" you must be sure that there is no running producer...