Sink from Delta Live Table to Kafka, initial sink works, but any subsequent updates fail - apache-spark

I have a DLT pipeline that ingests a topic from my kafka stream, transforms it into a DLT, then I wish to write that table back into Kafka under a new topic.
So far, I have this working, however it only works on first load of the table, then after any subsequent updates will crash my read and write stream.
My DLT tables updates correctly, so I see updates from my pipeline into the Gold table,
CREATE OR REFRESH LIVE TABLE deal_gold1
TBLPROPERTIES ("quality" = "gold")
COMMENT "Gold Deals"
AS SELECT
documentId,
eventTimestamp,
substring(fullDocument.owner_id, 11, 24) as owner_id,
fullDocument.owner_type as owner_type,
substring(fullDocument.account_id, 11, 24) as account_id,
substring(fullDocument.manager_account_id, 11, 24) as manager_account_id,
fullDocument.hubspot_deal_id as hubspot_deal_id,
fullDocument.stage as stage,
fullDocument.status as status,
fullDocument.title as title
FROM LIVE.deal_bronze_cleansed
but then when I try to read from it via a separate notebook, these updates cause it to crash
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
# this one is the problem not the write stream
df = spark.readStream.format("delta").table("deal_stream_test.deal_gold1")
display(df)
writeStream= (
df
.selectExpr("CAST(documentId AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.outputMode("append")
.option("ignoreChanges", "true")
.option("checkpointLocation", "/tmp/benperram21/checkpoint")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("ignoreChanges", "true")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("topic", confluentTopicName)
.start()
)
I was looking and can see this might be as a result of it not being read as "Append". But yeah any thoughts on this? Everything works upset updates.

Right now DLT doesn't support output to the arbitrary sinks. Also, all Spark operations should be done inside the nodes of the execution graph (functions labeled with dlt.table or dlt.view).
Right now the workaround would be to run that notebook outside of the DLT pipeline, as a separate task in the multitask job (workflow).

Related

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example?
Here is the code I tried.
val schemaRegistryAddr = "https://..."
val avroSchema = buildSchema(topic) //defined this method
val Df = spark.readStream.format("delta").load("path..")
.withColumn("key", col("lskey").cast(StringType))
.withColumn("topLevelRecord",struct(col("col1"),col("col2")...)
.select(
to_avro($"key", lit("topic-key"), schemaRegistryAddr).as("key"),
to_avro($"topLevelRecord", lit("topic-value"), schemaRegistryAddr, avroSchema).as("value"))
Df.writeStream
.format("kafka")
.option("checkpointLocation",checkpointPath)
.option("kafka.bootstrap.servers", bootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
.option("kafka.ssl.keystore.password", keystorePassword)
.option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
.option("topic",topic)
.option("batch.size",262144)
.option("linger.ms",5000)
.trigger(ProcessingTime("25 seconds"))
.start()
But it fails with: org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
But when I try to write to the same topic using a Batch Producer it goes through successfully. Can anyone please let me know what am I missing in the streaming write to Kafka topic?
Later I found this old blog which says that current Structured Streaming API does not support 'kakfa' format.
https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333

Continous data generator from Azure Databricks to Azure Event Hubs using Spark with Kafka API but no data is streamed

I'm trying to implement a continuous data generator from Databricks to an Event Hub.
My idea was to generate some data in a .csv file and then create a data frame with the data. In a loop I call a function that executes a query to stream that data to the Event Hub. Not sure if the idea was good or if spark can handle writing from the same data frame or if I understood correctly how queries work.
The code looks like this:
def write_to_event_hub(
df: DataFrame,
topic: str,
bootstrap_servers: str,
config: str,
checkpoint_path: str,
):
return (
df.writeStream.format("kafka")
.option("topic", topic)
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", config)
.option("checkpointLocation", checkpoint_path)
.trigger(once=True)
.start()
)
while True:
query = write_to_event_hub(
streaming_df,
topic,
bootstrap_servers,
sasl_jaas_config,
"/checkpoint",
)
query.awaitTermination()
print("Wrote once")
time.sleep(5)
I want to mention that this is how I read data from the CSV file (I have it in DBFS) and I also have the schema for it:
streaming_df = (
spark.readStream.format("csv")
.option("header", "true")
.schema(location_schema)
.load(f"{path}")
)
It looks like no data is written event though I have the message "Wrote once" printed. Any ideas how to handle this? Thank you!
The problem is that you're using readStream to get the CSV data, so it will wait until new data will be pushed to the directory with CSV files. But really, you don't need to use readStream/writeStream - Kafka connector works just fine in batch mode, so your code should be:
df = read_csv_file()
while True:
write_to_kafka(df)
sleep(5)

Change filter/where condition when restarting a Structured Streaming query reading data from Delta Table

In Structured Streaming, will the checkpoints keep track of which data has already been processed from a Delta Table?
def fetch_data_streaming(source_table: str):
print("Fetching now")
streamingInputDF = (
spark
.readStream
.format("delta")
.option("maxBytesPerTrigger",1024)
.table(source_table)
.where("measurementId IN (1351,1350)")
.where("year >= '2021'")
)
query = (
streamingInputDF
.writeStream
.outputMode("append")
.option("checkpointLocation", "/streaming_checkpoints/5")
.foreachBatch(customWriter)
.start()
.awaitTermination()
)
return query
def customWriter(batchDF,batchId):
print(batchId)
print(batchDF.count())
batchDF.show(10)
length = batchDF.count()
print("batchId,batch size:",batchId,length)
If I change the where clause in the streamingInputDF to add more measurentId, the structured streaming job doesn't always acknowledge the change and fetch the new data values. It continues to run as if nothing has changed, whereas at times it starts fetching new values.
Isn't the checkpoint supposed to identify the change?
Edit: Schema of delta table:
col_name
data_type
measurementId
int
year
int
time
timestamp
q
smallint
v
string
"In structured streaming, will the checkpoints will keep track of which data has already been processed?"
Yes, the Structured Streaming job will store the read version of the Delta table in its checkpoint files to avoid producing duplicates.
Within the checkpoint directory in the folder "offsets", you will see that Spark stored the progress per batchId. For example it will look like below:
v1
{"batchWatermarkMs":0,"batchTimestampMs":1619695775288,"conf":[...]}
{"sourceVersion":1,"reservoirId":"d910a260-6aa2-4a7c-9f5c-1be3164127c0","reservoirVersion":2,"index":2,"isStartingVersion":true}
Here, the important part is the "reservoirVersion":2 which tells you that the streaming job has consumed all data from the Delta Table as of version 2.
Re-starting your Structured Streaming query with an additional filter condition will therefore not be applied to historic records but only to those that were added to the Delta Table after version 2.
In order to see this behavior in action you can use below code and analyse the content in the checkpoint files.
val deltaPath = "file:///tmp/delta/table"
val checkpointLocation = "file:///tmp/checkpoint/"
// run the following two lines once
val deltaDf = Seq(("1", "foo1"), ("2", "foo2"), ("3", "foo2")).toDF("id", "value")
deltaDf.write.format("delta").mode("append").save(deltaPath)
// run this code for the first time, then add filter condition, then run again
val query = spark.readStream
.format("delta")
.load(deltaPath)
.filter(col("id").isin("1")) // in the second run add "2"
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
query.awaitTermination()
Now, if you append some more data to the Delta table while the streaming query is shut down and then restart is with the new filter condition it will be applied to the new data.

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

groupby ideal strategy in Spark Streaming

I am reading data using Spark Streaming from a Kafka Source, from where I create a dataframe with columns wsid, year, month, day, oneHourPrecip:
val df = spark.readStream
.format("kafka")
.option("subscribe", "raw_weather")
.option("kafka.bootstrap.servers", "<host1:port1,host2:port2>...")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism" , "PLAIN")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"token\" password=\"" + "<some password>" + "\";")
.option("kafka.ssl.protocol", "TLSv1.2")
.option("kafka.ssl.enabled.protocols", "TLSv1.2")
.option("kafka.ssl.endpoint.identification.algorithm", "HTTPS")
.load()
.selectExpr("CAST(value as STRING)")
.as[String]
.withColumn("_tmp", split(col("value"), "\\,"))
.select(
$"_tmp".getItem(0).as("wsid"),
$"_tmp".getItem(1).as("year").cast("int"),
$"_tmp".getItem(2).as("month").cast("int"),
$"_tmp".getItem(3).as("day").cast("int"),
$"_tmp".getItem(11).as("oneHourPrecip").cast("double")
)
.drop("_tmp")
I then perform a groupby and then try to write this stream data into a table using JDBC. For that purpose, this is my code:
val query= df.writeStream
.outputMode(OutputMode.Append())
.foreachBatch((df: DataFrame , id: Long) => {
println(df.count())
df.groupBy($"wsid" , $"year" , $"month" , $"day")
.agg(sum($"oneHourPrecip").as("precipitation"))
.write
.mode(SaveMode.Append)
.jdbc(url , s"$schema.$table" , getProperties)
})
.trigger(Trigger.ProcessingTime(1))
.start()
The problem comes with the batch. With Spark Streaming, we cannot predict the number of rows that come every batch in a dataframe. So quite a lot of times, I get data that is disjointed (ie. for the given common values (wsid,year,month,day), some rows appear in one batch while some others appear in another batch).
Then when I groupby and try to write it using JDBC, this is the error I get:
com.ibm.db2.jcc.am.BatchUpdateException: [jcc][t4][102][10040][4.25.13] Batch failure. The batch was submitted, but at least one exception occurred on an individual member of the batch.
Use getNextException() to retrieve the exceptions for specific batched elements. ERRORCODE=-4229, SQLSTATE=null
at com.ibm.db2.jcc.am.b6.a(b6.java:502)
at com.ibm.db2.jcc.am.Agent.endBatchedReadChain(Agent.java:434)
at com.ibm.db2.jcc.am.k4.a(k4.java:5452)
at com.ibm.db2.jcc.am.k4.c(k4.java:5026)
at com.ibm.db2.jcc.am.k4.executeBatch(k4.java:3058)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:672)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.ibm.db2.jcc.am.SqlIntegrityConstraintViolationException: Error for batch element #1: DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, SQLERRMC=1;SPARK.DAILY_PRECIPITATION_DATA, DRIVER=4.25.13
at com.ibm.db2.jcc.am.b6.a(b6.java:806)
at com.ibm.db2.jcc.am.b6.a(b6.java:66)
at com.ibm.db2.jcc.am.b6.a(b6.java:140)
at com.ibm.db2.jcc.t4.ab.a(ab.java:1283)
at com.ibm.db2.jcc.t4.ab.a(ab.java:128)
at com.ibm.db2.jcc.t4.p.a(p.java:57)
at com.ibm.db2.jcc.t4.aw.a(aw.java:225)
at com.ibm.db2.jcc.am.k4.a(k4.java:3605)
at com.ibm.db2.jcc.am.k4.d(k4.java:6020)
at com.ibm.db2.jcc.am.k4.a(k4.java:5372)
... 17 more
As evident from the SqlIntegrityConstraintViolationException above, it is because after one batch writes the groupbyed values using JDBC, insertion for the next set of values fail because of the primary key (wsid,year,month,day).
Given that there will be a fixed number of oneHourPrecip values (24) for a given (wsid,year,month,day) from the source, how do we ensure that groupBy works properly for all data that is streamed from the source, so that insertion into Database is not a problem?
SaveMode.Upsert is not available :-)
There is nothing to do with groupBy. group by just groups the values. integrity violation (com.ibm.db2.jcc.am.SqlIntegrityConstraintViolationException) you need to take care at sql level.
Option 1:
You can do insert update to avoid integrety violation.
for this you need to use like below pseudo code...
dataframe.foreachPartition {
update TABLE_NAME set FIELD_NAME=xxxxx where MyID=XXX;
INSERT INTO TABLE_NAME values (colid,col1,col2)
WHERE NOT EXISTS(select 1 from TABLE_NAME where colid=xxxx);
}
Option 2 :
or check merge statement in db2
one way is create a empty temp table ( with out any connstraints) which has same schema and populate it and at the end you can execute a script which will merge in to the target
table.
I did figure something out, but this may have some performance concerns. Anyways, it worked for me so am posting the answer:
I figured out that in order to store a groupbyed data into a DB2 table, we would have to wait until we retrieve all the data from the source. For that, I utilize OutputMode.Complete().
Then I realized if I were to write it into DB2 after grouping in the current method, it would still throw me the same error. For that, I had to use SaveMode.Overwrite inside foreachBatch.
I tried running my program with this approach, but it threw this error:
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets
So I decided to do groupby and aggregation during readStream itself. Thus my code looks like this:
readStream part:
val df = spark.readStream
.format("kafka")
.option("subscribe", "raw_weather")
.option("kafka.bootstrap.servers", "<host1:port1,host2:port2>...")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism" , "PLAIN")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"token\" password=\"" + "<some password>" + "\";")
.option("kafka.ssl.protocol", "TLSv1.2")
.option("kafka.ssl.enabled.protocols", "TLSv1.2")
.option("kafka.ssl.endpoint.identification.algorithm", "HTTPS")
.load()
.selectExpr("CAST(value as STRING)")
.as[String]
.withColumn("_tmp", split(col("value"), "\\,"))
.select(
$"_tmp".getItem(0).as("wsid"),
$"_tmp".getItem(1).as("year").cast("int"),
$"_tmp".getItem(2).as("month").cast("int"),
$"_tmp".getItem(3).as("day").cast("int"),
$"_tmp".getItem(11).as("oneHourPrecip").cast("double")
)
.drop("_tmp")
.groupBy($"wsid" , $"year" , $"month" , $"day")
.agg(sum($"oneHourPrecip").as("precipitation"))
writeStream part:
val query= df.writeStream
.outputMode(OutputMode.Complete())
.foreachBatch((df: DataFrame , id: Long) => {
println(df.count())
df.write
.mode(SaveMode.Overwrite)
.jdbc(url , s"$schema.$table" , getProperties)
})
.trigger(Trigger.ProcessingTime(1))
.start()
query.awaitTermination()

Resources