Streaming from a Delta Live Tables in databrick to kafka instance - apache-spark

I have the following live table
And i'm looking to write that into a stream to be written back into my kafka source.
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?
I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
Thanks in Advance,
Ben
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()

As of right now, Delta Live Tables can only write data as a Delta table - it's not possible to write in other formats. You can implement a workaround by creating a Databricks workflow that consist of two tasks (with dependencies or not depending if the pipeline is triggered or not):
DLT Pipeline that will do the actual data processing
A task (easiest way to do with notebook) that will read a table generated by DLT as a stream and write its content into Kafka, with something like that:
df = spark.readStream.format("delta").table("database.table_name")
(df.write.format("kafka").option("kafka....", "")
.trigger(availableNow=True) # if it's not continuous
.start()
)
P.S. If you have solution architect or customer success engineer attached to your Databricks account, you can communicate this requirement to them for product prioritization.

The transformation is done after the read stream process is started
read_df = spark.readStream.format('kafka') ... .... # other options
processed_df = read_df.withColumn('some column', some_calculation )
processed_df.writeStream.format('parquet') ... .... # other options
.start()
The spark documentation is helpful and detailed but some articles are not for beginners. You can look on youtube or read articles to help you get started like this one

Related

Right way to read stream from Kafka topic using checkpointLocation offsets

I'm trying to develop a small Spark app (using Scala) to read messages from Kafka (Confluent) and write them (insert) into Hive table. Everything works as expected, except for one important feature - managing offsets when the application is restarted (submited). It confuses me.
Cut from my code:
def main(args: Array[String]): Unit = {
val sparkSess = SparkSession
.builder
.appName("Kafka_to_Hive")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse/")
.config("hive.metastore.uris", "thrift://localhost:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSess.sparkContext.setLogLevel("ERROR")
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
sparkSess.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = sparkSess
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", 'localhost:9092')
.option("group.id", 'kafka-to-hive-1')
// ------> which Kafka options do I need to set here for starting from last right offset to ensure completenes of data and "exactly once" writing? <--------
.option("failOnDataLoss", (false: java.lang.Boolean))
.option("subscribe", 'some_topic')
.load()
import org.apache.spark.sql.functions._
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
val df = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
df.writeStream
.foreachBatch((batchDataFrame, batchId) => {
batchDataFrame.createOrReplaceTempView("`some_view_name`")
val sqlText = "SELECT * FROM `some_view_name` a where some_field='some value'"
val batchDataFrame_view = batchDataFrame.sparkSession.sql(sqlText);
batchDataFrame_view.write.insertInto("default.some_hive_table")
})
.option("checkpointLocation", "/user/some_user/tmp/checkpointLocation")
.start()
.awaitTermination()
}
Questions (the questions are related to each other):
Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?
Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")
What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)
"Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?"
You would need to set startingOffsets=latest and clean up the checkpoint files.
"Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")"
Similar to first question, if you set the startingOffsets as the json string, you need to delete the checkpointing files. Otherwise, the spark application will always fetch the information stored in the checkpoint files and override the settings given in the startingOffsets option.
"What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)"
Asking about "the right way" might lead to opinion based answers and is therefore off-topic on Stackoverflow. Anyway, using Spark Structured Streaming is already a mature and production-ready approach in my experience. However, it is always worth also looking into KafkaConnect.

How to slow down the write speed of Kafka Producer?

I use the spark to write data to kafka in this way.
df.write(). format("kafka"). save()
can I control the writing speed to kafka to avoid pressure on kafka?
Is there some options that helps to slow down the speed?
I think setting linger.ms to a non-zero value would help. As it controls the amount of time to wait for additional messages before sending the current batch. Code can look like the following
df.write.format("kafka").option("linger.ms", "100").save()
But this really depends on a lot of things. If your Kafka is 'big' enough and configured properly, I wouldn't worry too much about the speed. After all, kafka is designed to cope with this situation (traffic spike).
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger in File source, and maxOffsetsPerTrigger in Kafka source.
val streamingETLQuery = cloudtrailEvents
.withColumn("date", $"timestamp".cast("date") // derive the date
.writeStream
.trigger(ProcessingTime("10 seconds")) // check for files every 10s
.format("parquet") // write as Parquet partitioned by date
.partitionBy("date")
.option("path", "/cloudtrail")
.option("checkpointLocation", "/cloudtrail.checkpoint/")
.start()
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("text-logs")
Read the following links for more details:
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

How can I handle old data in the kafka topic?

I start using spark structured streaming.
I get readStream from kafka topic (startOffset: latest)
with waterMark,
group by event time with window duration,
and write to kafka topic.
My question is,
How can I handle the data written to the kafka topic before spark structured streaming job?
I tried to run with `startOffset: earliest' at first. but the data in the kafka topic is too large, so spark streaming process is not started because of yarn timeout. (even though I increase timeout value)
1.
If I simply create a batch job and filter by specific data range.
the result is not reflected in the current state of spark streaming,
there seems to be a problem with the consistency and accuracy of the result.
I tried to reset the checkpoint directory but It did not work.
How can I handle the old and large data?
Help me.
you can try the parmeter maxOffsetsPerTrigger for Kafka + Structured Streaming for receiving old data from Kafka. Set the value for this parameter to the number of records you want to receive from Kafka at one time.
Use:
sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test-name")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1)
.option("group.id", "2")
.option("auto.offset.reset", "earliest")
.load()

How to process Avro messages while reading a stream of messages from Kafka?

The below code reads the messages from Kafka and the messages are in Avro so how do I parse the message and put it into a dataframe in Spark 2.2.0?
Dataset<Row> df = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load();
This https://github.com/databricks/spark-avro library had no example for streaming case.
how do I parse the message and put it into a dataframe in Spark 2.2.0?
That's your home exercise that is going to require some coding.
This https://github.com/databricks/spark-avro library had no example for streaming case.
I've been told (and seen a couple of questions here) that spark-avro does not support Spark Structured Streaming (aka Spark Streams). It works fine with non-streaming Datasets, but can't handle streaming ones.
That's why I wrote that this is something you have to code yourself.
That could look as follows (I use Scala for simplicity):
// Step 1. convert messages to be strings
val avroMessages = df.select($"value" cast "string")
// Step 2. Strip the avro layer off
val from_avro = udf { (s: String) => ...processing here... }
val cleanDataset = avroMessages.withColumn("no_avro_anymore", from_avro($"value"))
That would require developing a from_avro custom UDF that would do what you want (and would be similar to how Spark handles JSON format using from_json standard function!)
Alternatively (and in a slightly more advanced? / convoluted approach) write your own custom streaming Source for datasets in Avro format in Kafka and use it instead.
Dataset<Row> df = sparkSession.readStream()
.format("avro-kafka") // <-- HERE YOUR CUSTOM Source
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load();
I'm yet to find out how doable avro-kafka format is. It is indeed doable, but does two things at once, i.e. reading from Kafka and doing Avro conversion, and am not convinced that's the way to do things in Spark Structured Streaming and in software engineering in general. I wished there were a way to apply one format after another, but that's not possible in Spark 2.2.1 (and is not planned for 2.3 either).
I think then that a UDF is the best solution for the time being.
Just a thought, you could also write a custom Kafka Deserializer that would do the deserialization while Spark loads messages.

how to check if stop streaming from kafka topic by a limited time duration or record count?

My ultimate goal is to see if a kafka topic is running and if the data in it is good, otherwise fail / throw an error
if I could pull just 100 messages, or pull for just 60 seconds I think I could accomplish what i wanted. But all the streaming examples / questions I have found online have no intention of shutting down the streaming connection.
Here is the best working code I have so far, that pulls data and displays it, but it keeps trying to pull for more data, and if I try to access it in the next line, it hasnt had a chance to pull the data yet. I assume I need some sort of call back. has anyone done something similar? is this the best way of going about this?
I am using databricks notebooks to run my code
import org.apache.spark.sql.functions.{explode, split}
val kafka = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<kafka server>:9092")
.option("subscribe", "<topic>")
.option("startingOffsets", "earliest")
.load()
val df = kafka.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
display(df.select($"word"))
The trick is you don't need streaming at all. Kafka source supports batch queries, if you replace readStream with read and adjust startingOffsets and endingOffsets.
val df = spark
.read
.format("kafka")
... // Remaining options
.load()
You can find examples in the Kafka streaming documentation.
For streaming queries you can use once trigger, although it might not be the best choice in this case:
df.writeStream
.trigger(Trigger.Once)
... // Handle the output, for example with foreach sink (?)
You could also use standard Kafka client to fetch some data without starting SparkSession.

Resources