Spark Kafka Streaming - Send original timestamp rather than current timestamp - apache-spark

I am using spark Structured streaming to send records to a kafka topic. The kafka topic is created with the config - message.timestamp.type=CreateTime
This is done so that the target Kafka topic records have the same timestamp as the original Records.
My kafka streaming code :
kafkaRecords.selectExpr("CAST(key AS STRING)", "CAST(value AS BINARY)","CAST(timestamp AS TIMESTAMP)")
.write
.format("kafka")
.option("kafka.bootstrap.servers","IP Of kafka")
.option("topic",targetTopic)
.option("kafka.max.in.flight.requests.per.connection", "1")
.option("checkpointLocation",checkPointLocation)
.save()
However, this does not preserve the original timestamp that is 2018/11/04, instead the timestamp reflects the latest date 2018/11/9.
On another note, just to confirm that kafka config is functioning, when I explicitly create a Kafka Producer and producer records having the timestamp and send that across, the original timestamp is preserved.
How can I get the same behaviour in Kafka Structured Streaming as well.

The CreateTime config of a topic would mean when the records are created, that is the time you get.
It's not clear where you're reading the data and seeing the timestamps, if you are running the producer code "today", that's the time they get, not before.
If you want timestamps of the past, you'll need to actually make your ProducerRecords contain that timestamp by using the constructor that includes a timestamp parameter, but Spark does not expose it.
If you put just the timestamp in the payload value, as you're doing, that's the time you'll want to be doing analysis on, probably, not a ConsumerRecord.timestamp()
If you want to exactly copy data from one topic to another, Kafka uses MirrorMaker to accomplish this. Then you only need config files, not writing&deploying Spark code

Related

Sending time ordered events into Kafka

I am using Autoloader (from Databricks) to ingest some parquet files and send them later to a Kafka topic.
I am able to read the files and write them without any problem but I have doubts about the order.
These files contain a timestamp field inside the payload which indicates the modification date of the file.
Is it possible to write each of the events that I receive with the autoloader in the Kafka sink ordered by that date?
I would like to be able to write in Kafka from the oldest to the newest events based on this timestamp.
I have considered to define a function that is going to be invoked the foreachBatch in which it makes a simple orderBy for each batch.
Something like this:
def orderByFunc ( batchDF:DataFrame, batchID:Long ) : Unit = {
val rodered_df=batchDF.orderBy($"some_field".desc) // order by the timestamp field
rodered_df.write.format("kafka").option(...) // write into Kafka
}
streamingInputDF
.writeStream
.queryName(job_name)
.option("checkpointLocation", checkpoint_path)
.foreachBatch(orderByFunc _)
.start()
Is there a less cumbersome way? am I missing something?
Thank you very much to all

Consume a Kafka Topic every hour using Spark

I want to consume a Kafka topic as a batch where I want to read Kafka topic hourly and read the latest hourly data.
val readStream = existingSparkSession
.read
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "kafka.raw")
.load()
But this always read first 20 data rows and these rows are starting from the very beginning so this never pick latest data rows.
How can I read the latest rows on a hourly basis using scala and spark?
If you read Kafka messages in Batch mode you need to take care of the bookkeeping which data is new and which is not yourself. Remember that Spark will not commit any messages back to Kafka, so every time you restart the batch job it will read from beginning (or based on the setting startingOffsets which defaults to earliest for batch queries.
For your scenario where you want to run the job once every hour and only process the new data that arrived to Kafka in the previous hour, you can make use of the writeStream trigger option Trigger.Once for streaming queries.
There is a nice blog from Databricks that nicely explains why a streaming query with Trigger.Once should be preferred over a batch query.
The main point being:
"When you’re running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you."
Make sure that you also set the option "checkpointLocation" in your writeStream. In the end, you can have a simple cron job that submits your streaming job once an hour.

Kafka S3 Sink Connector - how to mark a partition as complete

I am using Kafka sink connector to write data from Kafka to s3. The output data is partitioned into hourly buckets - year=yyyy/month=MM/day=dd/hour=hh. This data is used by a batch job downstream. So, before starting the downstream job, I need to be sure that no additional data will arrive in a given partition once the processing for that partition has started.
What is the best way to design this? How can I mark a partition as complete? i.e. no additional data will be written to it once marked as complete.
EDIT: I am using RecordField as timestamp.extractor. My kafka messages are guaranteed to be sorted within partitions by the partition field
Depends on which Timestamp Extractor you are using in the Sink config.
You would have to guarantee the no records can have a timestamp earlier than the time you consume it.
AFAIK, the only way that's possible is using the WallClock Timestamp Extractor. Otherwise, you are consuming a Kafka Record timestamp, or some timestamp within each message. Both of which can be overwritten on the Producer end to some event in the past

What happens when you restart a spark job if it encounters unexpected format in the data fed to kafka

I have a question regarding Spark Structured Streaming with Kafka.
Suppose that I am running a spark job and every thing is working perfectly.
One fine day, my spark job fails because of inconsistencies in data that is fed to kafka. Inconsistencies may be anything like data format issues or junk characters which spark couldn't have processed. In such case, how do we fix the issue? Is there a way we can get into the kafka topic and make changes to the data manually?
If we don't fix the data issue and restart the spark job, it will read the same old row which contributed to failure since we have not yet committed the checkpoint. so how do we get out of this loop. How to fix the data issue in Kafka topic for resuming the aborted spark job?
I would avoid trying to manually change one single message within a Kafka topic unless you really know what you are doing.
To prevent this from happening in the future, you might want to consider using a schema for your data (in combination with a schema registry).
For mitigating the problem you described I see the following options:
Manually change the offset of the Consumer Group of your structured streaming application
create a "new" streaming job that starts reading from a particular offset
Manually change offset
When using Sparks structured streaming the consumer group is automatically set by Spark. According to the code the Consumer Group will be defined as:
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
You can change the offset by using the kafka-consumer-groups tool. First identify the actual name of the consumer group by
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
and then set the offset for that consumer group for a particular topic (e.g. offset 100)
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --execute --reset-offsets --group spark-kafka-source-1337 --topic topic1 --to-offset 100
If you need to change the offset only for a particular partition you can have a look at the help function of the tool on how to do this.
Create new Streaming Job
You could make use of the Spark option startingOffsets as describe in the Spark + Kafka integration guide:
Option: startingOffsets
value: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
default: "latest" for streaming, "earliest" for batch
meaning: The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
For this to work, it is important to have a "new" query. That means you need to delete your checkpoint files of your existing job or create complete new application.

Spark + Read kafka topic from a specific offset based on timestamp

How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic starting 6 hours ago.
Kafka does not work in that way. You are seeing Kafka like something you can query with another different parameter than offset, besides keep in mind that topic can have more than one partition so each one has a different one. Maybe you can use another relational storage to map offset/partition with timestamp, a little bit risky. Thinking in akka stream kafka consumer, for example, each of your request by timestamp should be send via another topic to activate your consumers(each of them with one ore more partitions assigned) and query for the specific offset, produce and merge. With Spark, you can adjust your consumer strategies for each job but the process should be the same.
Another thing is if your Kafka recovers it´s possible that you need to read the whole topic to update your pair (timestamp/offset). All of this can sound a little bit weird and maybe it should be better to store your topic in Cassandra (for example) and you can query it later.
The answers provided here seems to be dated. As with the latest API documentation for Spark 3.x.x given here, Structured Streaming Kafka Integration
There are quite few flexible ways in which the messages can be retrieved between a specified window from Kafka.
An example code for the batch api that get messages from all the partitions, that are between the window specified via startingTimestamp and endingTimestamp, which is in epoch time with millisecond precision.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingTimestamp", 1650418006000)
.option("endingTimestamp", 1650418008000)
.load()
Kafka is an append-only log storage. You can start consuming from a particular offset in a partition given that you know the offset. Consumption is super fast, you can have a design where you start from the smallest offset and start doing some logic only once you have come across a message (which could probably have a timestamp field to check for).

Resources