I am using Autoloader (from Databricks) to ingest some parquet files and send them later to a Kafka topic.
I am able to read the files and write them without any problem but I have doubts about the order.
These files contain a timestamp field inside the payload which indicates the modification date of the file.
Is it possible to write each of the events that I receive with the autoloader in the Kafka sink ordered by that date?
I would like to be able to write in Kafka from the oldest to the newest events based on this timestamp.
I have considered to define a function that is going to be invoked the foreachBatch in which it makes a simple orderBy for each batch.
Something like this:
def orderByFunc ( batchDF:DataFrame, batchID:Long ) : Unit = {
val rodered_df=batchDF.orderBy($"some_field".desc) // order by the timestamp field
rodered_df.write.format("kafka").option(...) // write into Kafka
}
streamingInputDF
.writeStream
.queryName(job_name)
.option("checkpointLocation", checkpoint_path)
.foreachBatch(orderByFunc _)
.start()
Is there a less cumbersome way? am I missing something?
Thank you very much to all
Related
I want to log number of records read to database from incoming stream of spark structured streaming. I'm using foreachbatch to transform incoming stream batch and write to desired location. I want to log 0 records read if there are no records in a particular hour. But foreach batch does not execute when there is no stream. Can anyone help me with it? My code is as below:
val incomingStream = spark.readStream.format("eventhubs")
.options(customEventhubParameters.toMap).load()
val query=incomingStream.writeStream.foreachBatch{
(batchDF: DataFrame, batchId: Long) =>
writeStreamToDataLake(batchDF,batchId,partitionColumn,
fileLocation,errorFilePath,eventHubName,configMeta)
}.option("checkpointLocation",fileLocation+checkpointFolder+"/"+eventHubName)
.trigger(Trigger.ProcessingTime(triggerTime.toLong))
.start().awaitTermination()
This is how it works and even mods, extensions to StreamingQueryListener are invoked only when there is something to process and thus status changes of the stream.
There probably is another way, but I would say "think outside of the box" and pre-popualte with 0 per timeframe such a database and when querying AGGRegate and you will have the correct answer.
https://medium.com/#johankok/structured-streaming-in-a-flash-576cdb17bbee can give some insight plus the Spark: The Definitive Guide.
I have a use case in which I have to subscribe to multiple topics in kafka in spark structured streaming. Then I have to parse each message and form a delta lake table out of it. I have made the parser and the messages(in form of xml) correctly parsing and forming delta-lake table. However, I am only subscribing to only one topic as of now. I want to subscribe to multiple topics and based on the topic, it should go to the parser dedicatedly made for this particular topic. So basically I want to identify the topic name for all the messages as they process so that I can send them to the desired parser and process further.
This is how I am accessing the messages from different topics. However, I have no idea how to identify the source of the incoming messages while processing them.
val stream_dataframe = spark.readStream
.format(ConfigSetting.getString("source"))
.option("kafka.bootstrap.servers", ConfigSetting.getString("bootstrap_servers"))
.option("kafka.ssl.truststore.location", ConfigSetting.getString("trustfile_location"))
.option("kafka.ssl.truststore.password", ConfigSetting.getString("truststore_password"))
.option("kafka.sasl.mechanism", ConfigSetting.getString("sasl_mechanism"))
.option("kafka.security.protocol", ConfigSetting.getString("kafka_security_protocol"))
.option("kafka.sasl.jaas.config",ConfigSetting.getString("jass_config"))
.option("encoding",ConfigSetting.getString("encoding"))
.option("startingOffsets",ConfigSetting.getString("starting_offset_duration"))
.option("subscribe",ConfigSetting.getString("topics_name"))
.option("failOnDataLoss",ConfigSetting.getString("fail_on_dataloss"))
.load()
var cast_dataframe = stream_dataframe.select(col("value").cast(StringType))
cast_dataframe = cast_dataframe.withColumn("parsed_column",parser(col("value"))) // Parser is the udf, made to parse the xml from the topic.
How can I identify the topic name of the messages as they process in spark structured streaming ?
As per the official documentation (emphasis mine)
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
...
As you see input topic is part of the output schema, and can be accessed without any special actions.
I am using spark Structured streaming to send records to a kafka topic. The kafka topic is created with the config - message.timestamp.type=CreateTime
This is done so that the target Kafka topic records have the same timestamp as the original Records.
My kafka streaming code :
kafkaRecords.selectExpr("CAST(key AS STRING)", "CAST(value AS BINARY)","CAST(timestamp AS TIMESTAMP)")
.write
.format("kafka")
.option("kafka.bootstrap.servers","IP Of kafka")
.option("topic",targetTopic)
.option("kafka.max.in.flight.requests.per.connection", "1")
.option("checkpointLocation",checkPointLocation)
.save()
However, this does not preserve the original timestamp that is 2018/11/04, instead the timestamp reflects the latest date 2018/11/9.
On another note, just to confirm that kafka config is functioning, when I explicitly create a Kafka Producer and producer records having the timestamp and send that across, the original timestamp is preserved.
How can I get the same behaviour in Kafka Structured Streaming as well.
The CreateTime config of a topic would mean when the records are created, that is the time you get.
It's not clear where you're reading the data and seeing the timestamps, if you are running the producer code "today", that's the time they get, not before.
If you want timestamps of the past, you'll need to actually make your ProducerRecords contain that timestamp by using the constructor that includes a timestamp parameter, but Spark does not expose it.
If you put just the timestamp in the payload value, as you're doing, that's the time you'll want to be doing analysis on, probably, not a ConsumerRecord.timestamp()
If you want to exactly copy data from one topic to another, Kafka uses MirrorMaker to accomplish this. Then you only need config files, not writing&deploying Spark code
How do I set a spark job to pick up a kafka topic from a specific offset based on a timestamp ? Let's say that I need to get all data from a kafka topic starting 6 hours ago.
Kafka does not work in that way. You are seeing Kafka like something you can query with another different parameter than offset, besides keep in mind that topic can have more than one partition so each one has a different one. Maybe you can use another relational storage to map offset/partition with timestamp, a little bit risky. Thinking in akka stream kafka consumer, for example, each of your request by timestamp should be send via another topic to activate your consumers(each of them with one ore more partitions assigned) and query for the specific offset, produce and merge. With Spark, you can adjust your consumer strategies for each job but the process should be the same.
Another thing is if your Kafka recovers it´s possible that you need to read the whole topic to update your pair (timestamp/offset). All of this can sound a little bit weird and maybe it should be better to store your topic in Cassandra (for example) and you can query it later.
The answers provided here seems to be dated. As with the latest API documentation for Spark 3.x.x given here, Structured Streaming Kafka Integration
There are quite few flexible ways in which the messages can be retrieved between a specified window from Kafka.
An example code for the batch api that get messages from all the partitions, that are between the window specified via startingTimestamp and endingTimestamp, which is in epoch time with millisecond precision.
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingTimestamp", 1650418006000)
.option("endingTimestamp", 1650418008000)
.load()
Kafka is an append-only log storage. You can start consuming from a particular offset in a partition given that you know the offset. Consumption is super fast, you can have a design where you start from the smallest offset and start doing some logic only once you have come across a message (which could probably have a timestamp field to check for).
I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic.
I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don't want to have multiple Kafka sinks for each of the different Kafka topics because that would mean running multiple streaming queries - one for each Kafka topic, especially since my aggregation logic is complex.
Questions:
Is there a way to output the results of a structured streaming query to multiple Kafka topics each with a different key column but without having to execute multiple streaming queries?
If not, would it be efficient to cascade the multiple queries such that the first query does the complex aggregation and writes output to Kafka and then the other queries just read the output of the first query and write their topics to Kafka thus avoiding doing the complex aggregation again?
Thanks in advance for any help.
So the answer was kind of staring at me in the eye. It's documented as well. Link below.
One can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to figure out is how to generate the value of that column.
This is documented - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
I am also looking for solution of this problem and in my case its not necessarily kafka sink. I want to write some records of a dataframe in sink1 while some other records in sink2 (depending upon some condition, without reading the same data twice in 2 streaming queries).
Currently it does not seem possible as per current implementation ( createSink() method in DataSource.scala provides support for a single sink).
However, In Spark 2.4.0 there is a new api coming: foreachBatch() which will give handle to a dataframe microbatch which can be used to cache the dataframe, write to different sinks or processing multiple times before uncaching aagin.
Something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.cache()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.uncache()
}
right now this feature available in databricks runtime :
https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#reuse-existing-batch-data-sources-with-foreachbatch
EDIT 15/Nov/18 :
It is available now in Spark 2.4.0 ( https://issues.apache.org/jira/browse/SPARK-24565)
There is no way to have a single read and multiple writes in structured streaming out of the box. The only way is to implement custom sink that will write into multiple topics.
Whenever you call dataset.writeStream().start() spark starts a new stream that reads from a source (readStream()) and writes into a sink (writeStream()).
Even if you try to cascade it spark will create two separate streams with one source and one sink each. In other words, it will read, process and write data twice:
Dataset df = <aggregation>;
StreamingQuery sq1 = df.writeStream()...start();
StreamingQuery sq2 = df.writeStream()...start();
There is a way to cache read data in spark streaming but this option is not available for structured streaming yet.