Spark + Read kafka topic from a specific offset based on timestamp - apache-spark

How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic starting 6 hours ago.

Kafka does not work in that way. You are seeing Kafka like something you can query with another different parameter than offset, besides keep in mind that topic can have more than one partition so each one has a different one. Maybe you can use another relational storage to map offset/partition with timestamp, a little bit risky. Thinking in akka stream kafka consumer, for example, each of your request by timestamp should be send via another topic to activate your consumers(each of them with one ore more partitions assigned) and query for the specific offset, produce and merge. With Spark, you can adjust your consumer strategies for each job but the process should be the same.
Another thing is if your Kafka recovers it´s possible that you need to read the whole topic to update your pair (timestamp/offset). All of this can sound a little bit weird and maybe it should be better to store your topic in Cassandra (for example) and you can query it later.

The answers provided here seems to be dated. As with the latest API documentation for Spark 3.x.x given here, Structured Streaming Kafka Integration
There are quite few flexible ways in which the messages can be retrieved between a specified window from Kafka.
An example code for the batch api that get messages from all the partitions, that are between the window specified via startingTimestamp and endingTimestamp, which is in epoch time with millisecond precision.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingTimestamp", 1650418006000)
.option("endingTimestamp", 1650418008000)
.load()

Kafka is an append-only log storage. You can start consuming from a particular offset in a partition given that you know the offset. Consumption is super fast, you can have a design where you start from the smallest offset and start doing some logic only once you have come across a message (which could probably have a timestamp field to check for).

Related

Is it possible to have a single kafka stream for multiple queries in structured streaming?

I have a spark application that has to process multiple queries in parallel using a single Kafka topic as the source.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
What would be the recommended way to improve performance in the scenario above ? Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Any thoughts are welcome,
Thank you.
The behavior I noticed is that each query has its own consumer (which is in its own consumer group) causing the same data to be streamed to the application multiple times (please correct me if I'm wrong) which seems very inefficient, instead I would like to have a single stream of data that would be then processed in parallel by Spark.
tl;dr Not possible in the current design.
A single streaming query "starts" from a sink. There can only be one in a streaming query (I'm repeating it myself to remember better as I seem to have been caught multiple times while with Spark Structured Streaming, Kafka Streams and recently with ksqlDB).
Once you have a sink (output), the streaming query can be started (on its own daemon thread).
For exactly the reasons you mentioned (not to share data for which Kafka Consumer API requires group.id to be different), every streaming query creates a unique group ID (cf. this code and the comment in 3.3.0) so the same records can be transformed by different streaming queries:
// Each running query should use its own group id. Otherwise, the query may be only assigned
// partial data since Kafka will assign partitions to multiple consumers having the same group
// id. Hence, we should generate a unique id for each query.
val uniqueGroupId = KafkaSourceProvider.batchUniqueGroupId(sourceOptions)
And that makes sense IMHO.
Should I focus on optimizing Kafka partitions instead of how Spark interacts with Kafka ?
Guess so.
You can separate your source data frame into different stages, yes.
val df = spark.readStream.format("kafka") ...
val strDf = df.select(cast('value).as("string")) ...
val df1 = strDf.filter(...) # in "parallel"
val df2 = strDf.filter(...) # in "parallel"
Only the first line should be creating Kafka consumer instance(s), not the other stages, as they depend on the consumer records from the first stage.

Consume a Kafka Topic every hour using Spark

I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.

Best deduplication strategy to be used with spark

What is the best de-duplication strategy to be used with spark?
I have a Kafka source that is continuously fed with structured information (say JSON) from various producers continuously.
I am having an HDInsight spark cluster that can pick messages in real time for this Kafka source, process them and put it into a destination Kafka source in real time.
My use case demands that the information received from the source may have duplicates which need to be eliminated. The duplicates have to be be checked against say last 24 hours.
My attempt :
I tried using the .dropduplicate method in spark along with watermarking , but I think it's not the best thing to do since the data for a single day window may exceed 50 GB in my use case.
I also looked for bloom filter implementation which can be used with spark but couldn't find a good one.
My question:
What are the possible approaches to eliminate duplication in general for large scale spark streaming application.?
Which of these features can be used along with HDInsight clusters on Azure ?
What are the fault tolerance capability in such services ?

Spark Kafka Streaming - Send original timestamp rather than current timestamp

I am using spark Structured streaming to send records to a kafka topic. The kafka topic is created with the config - message.timestamp.type=CreateTime
This is done so that the target Kafka topic records have the same timestamp as the original Records.
My kafka streaming code :
kafkaRecords.selectExpr("CAST(key AS STRING)", "CAST(value AS BINARY)","CAST(timestamp AS TIMESTAMP)")
.write
.format("kafka")
.option("kafka.bootstrap.servers","IP Of kafka")
.option("topic",targetTopic)
.option("kafka.max.in.flight.requests.per.connection", "1")
.option("checkpointLocation",checkPointLocation)
.save()
However, this does not preserve the original timestamp that is 2018/11/04, instead the timestamp reflects the latest date 2018/11/9.
On another note, just to confirm that kafka config is functioning, when I explicitly create a Kafka Producer and producer records having the timestamp and send that across, the original timestamp is preserved.
How can I get the same behaviour in Kafka Structured Streaming as well.
The CreateTime config of a topic would mean when the records are created, that is the time you get.
It's not clear where you're reading the data and seeing the timestamps, if you are running the producer code "today", that's the time they get, not before.
If you want timestamps of the past, you'll need to actually make your ProducerRecords contain that timestamp by using the constructor that includes a timestamp parameter, but Spark does not expose it.
If you put just the timestamp in the payload value, as you're doing, that's the time you'll want to be doing analysis on, probably, not a ConsumerRecord.timestamp()
If you want to exactly copy data from one topic to another, Kafka uses MirrorMaker to accomplish this. Then you only need config files, not writing&deploying Spark code

Spark Structured Streaming - compare two streams

I am using Kafka and Spark 2.1 Structured Streaming. I have two topics with data in json format eg:
topic 1:
{"id":"1","name":"tom"}
{"id":"2","name":"mark"}
topic 2:
{"name":"tom","age":"25"}
{"name":"mark","age:"35"}
I need to compare those two streams in Spark base on tag:name and when value is equal execute some additional definition/function.
How to use Spark Structured Streaming to do this ?
Thanks
Following the current documentation (Spark 2.1.1)
Any kind of joins between two streaming Datasets are not yet
supported.
ref: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
At this moment, I think you need to rely on Spark Streaming as proposed by #igodfried's answer.
I hope you got your solution. In case not, then you can try creating two KStreams from two topics and then join those KStreams and put joined data back to one topic. Now you can read the joined data as one DataFrame using Spark Structured Streaming. Now you'll be able to apply any transformations you want on the joined data. Since Structured streaming doesn't support join of two streaming DataFrames you can follow this approach to get the task done.
I faced a similar requirement some time ago: I had 2 streams which had to be "joined" together based on some criteria. What I used was a function called mapGroupsWithState.
What this functions does (in few words, more details on the reference below) is to take stream in the form of (K,V) and accumulate together its elements on a common state, based on the key of each pair. Then you have ways to tell Spark when the state is complete (according to your application), or even have a timeout for incomplete states.
Example based on your question:
Read Kafka topics into a Spark Stream:
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1,topic2") // Both topics on same stream!
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(value AS STRING) as jsonData") // Kafka sends bytes
Do some operations on your data (I prefer SQL, but you can use the DataFrame API) to transform each element into a key-value pair:
spark.sqlContext.udf.register("getKey", getKey) // You define this function; I'm assuming you will be using the name as key in your example.
val keyPairsStream = rawDataStream
.sql("getKey(jsonData) as ID, jsonData from rawData")
.groupBy($"ID")
Use the mapGroupsWithState function (I will show you the basic idea; you will have to define the myGrpFunct according to your needs):
keyPairsStream
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(myGrpFunct)
Thats it! If you implement myGrpFunct correctly, you will have one stream of merged data, which you can further transform, like the following:
["tom",{"id":"1","name":"tom"},{"name":"tom","age":"25"}]
["mark",{"id":"2","name":"mark"},{"name":"mark","age:"35"}]
Hope this helps!
An excellent explanation with some code snippets: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming/
One method would be to transform both streams into (K,V) format. In your case this would probably take the form of (name, otherJSONData) See the Spark documentation for more information on joining streams and an example located here. Then do a join on both streams and then perform whatever function on the newly joined stream. If needed you can use map to return (K,(W,V)) to (K,V).

Resources