KAFKA's topic message's timestamp cannot merge with the timestamp value within the Kafka's value - apache-spark

Hello everyone, the above picture is how Databricks recommend people to connect the kafka topic message to a delta live table (DLT). My current situation is I have a kafka topic that I am subcribed to, but I cannot load directly in a delta live table.
My code is as below":
#dlt.table(name = "tableOne", table_properties={"pipelines.reset.allowed":"false"})
def stream_bronze():
return(
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafkaserver address") #this is the kafka server address
.option("subscribe","topic_name") #this is the topic name of the kafka server
.option("kafka.security.protocol", "SASL_SSL") #in order to properly "subscribe" to a kafka topic, you need to input the proper security protocals
.option("kafka.sasl.mechanism", "PLAIN") #Like above
.option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="api key" password="secret";""") #This is where you need to input the Kafka API key (username) and the Kafka API secret (password)
.load()
)
When I run the above command, I get the error that there is an issue with merging the timestamp of the kafka message and the timeStamp of the kafka message value (so the content inside the kafka message. It goes by the format of a json).
The kafka schema that is fixed is like this:
And the schema of the details inside value looks like this:
schema = StructType([
StructField("event", StringType()),
StructField("ecommerce" , StringType()),
StructField("gtm.uniqueEventId" , StringType()),
StructField("eventId" , StringType()),
StructField("eventType" , StringType()),
StructField("pagePath" , StringType()),
StructField("pageType" , StringType()),
StructField("isRegistered" , StringType()),
StructField("memberId" , StringType()),
StructField("sessionId" , StringType()),
StructField("campaignId" , StringType()),
StructField("elabRecommType" , StringType()),
StructField("elabRecommId" , StringType()),
StructField("gaUserId" , StringType()),
StructField("gaSessionId" , StringType()),
StructField("siteCode" , StringType()),
StructField("timeStamp" , StringType()),
])
The issue that I get is that timestamp from the kafka and timeStamp from the schema of the json inside of the kafka message's value is different. One is TimestampType and the second one is StringType.
The error message that I get is:
org.apache.spark.sql.AnalysisException: Failed to merge fields 'timeStamp' and 'timestamp'. Failed to merge incompatible data types StringType and TimestampType
Of course, if I could modify the type directly from the kafka message producer it would be great but I do not have that kind of permission. Is there a way for me to make it possible fr the streaming of the kafka message successful on a delta live table?
Any help is appreciated. Thanks!
Edit:
I run the following code to apply the schema, but this is not really what is preventing me. even with this bit out. The action of simply asking the dlt to return the kafka readStream is not working. The error being the one mentioned above
wtchk_value_df = wtchk_kafka_df.select(from_json(col("value").cast("string"), schema).alias("value"))

Related

Spark Streaming HUDI HoodieException: Config conflict(key current value existing value): RecordKey:

As I am connecting to the kafka topic with spark and creating the dataframe and then storing into Hudi:
df
.selectExpr("key", "topic", "partition", "offset", "timestamp", "timestampType", "CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("hudi")
.options(getQuickstartWriteConfigs)
.option(PRECOMBINE_FIELD.key(), "essDateTime")
.option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator")
.option(RECORDKEY_FIELD.key(), "offset,timestamp")//"offset,essDateTime")
.option(TBL_NAME.key, streamingTableName)
.option("path", baseStreamingPath)
.trigger(ProcessingTime(10000))
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
I am getting the following exception:
9:43
ERROR] 2023-01-31 09:35:25.474 [stream execution thread for [id = 8b30fd4b-8506-490b-80ad-76868c14594f, runId = 25d34e6f-10e2-42c2-b094-654797f5d79c]] HoodieStreamingSink - Micro batch id=1 threw following exception:
org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value):
RecordKey: offset,timestamp uuid
KeyGenerator: org.apache.hudi.keygen.ComplexKeyGenerator org.apache.hudi.keygen.SimpleKeyGenerator
at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:167) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:90) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:129) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.15.jar:?]
at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:128) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:214) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:127) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:666) ~[spark-sql_2.12-3.3.1.jar:3.3.1]
To store all kafka Data into Hudi table
In apache Hudi, there are some configurations which you cannot override, like the KeyGenerator. It seems you have already wrote to the table with org.apache.hudi.keygen.SimpleKeyGenerator, so you need to recreate the table to change this config and the partition keys.
If you want to quick test, you can change baseStreamingPath to write the data into a new Hudi table.

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example?
Here is the code I tried.
val schemaRegistryAddr = "https://..."
val avroSchema = buildSchema(topic) //defined this method
val Df = spark.readStream.format("delta").load("path..")
.withColumn("key", col("lskey").cast(StringType))
.withColumn("topLevelRecord",struct(col("col1"),col("col2")...)
.select(
to_avro($"key", lit("topic-key"), schemaRegistryAddr).as("key"),
to_avro($"topLevelRecord", lit("topic-value"), schemaRegistryAddr, avroSchema).as("value"))
Df.writeStream
.format("kafka")
.option("checkpointLocation",checkpointPath)
.option("kafka.bootstrap.servers", bootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
.option("kafka.ssl.keystore.password", keystorePassword)
.option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
.option("topic",topic)
.option("batch.size",262144)
.option("linger.ms",5000)
.trigger(ProcessingTime("25 seconds"))
.start()
But it fails with: org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
But when I try to write to the same topic using a Batch Producer it goes through successfully. Can anyone please let me know what am I missing in the streaming write to Kafka topic?
Later I found this old blog which says that current Structured Streaming API does not support 'kakfa' format.
https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333

Spark always bradcasts tables greater than spark.sql.autoBroadcastJoinThreshold when performing streaming merge on DeltaTable sink

I am trying to do a streaming merge between delta tables using this guide - https://docs.delta.io/latest/delta-update.html#upsert-from-streaming-queries-using-foreachbatch
Our Code Sample (Java):
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
DeltaTable deltaTable = DeltaTable.forPath(sparkSession, targetPath);
sourceDf.createOrReplaceTempView("vTempView");
StreamingQuery sq = sparkSession.sql("select * from vTempView").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
deltaTable.alias("e").merge(microDf.alias("d"), "e.SALE_ID = d.SALE_ID")
.whenMatched().updateAll()
.whenNotMatched().insertAll()
.execute();
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
Problem:
Here Source path and Target path is already in sync using the checkpoint folder. Which has around 8 million rows of data amounting to around 450mb of parquet files.
When new data comes in Source Path (let's say 987 rows), then above code will pick that up and perform a merge with target table. During this operation spark is trying to perform a BroadCastHashJoin, and broadcasts the target table which has 8M rows.
Here's a DAG snippet for merge operation (with table with 1M rows),
Expectation:
I am expecting smaller dataset (i.e: 987 rows) to be broadcasted. If not then atleast spark should not broadcast target table, as it is larger than provided spark.sql.autoBroadcastJoinThreshold setting and neither are we providing any broadcast hint anywhere.
Things I have tried:
I searched around and got this article - https://learn.microsoft.com/en-us/azure/databricks/kb/sql/bchashjoin-exceeds-bcjointhreshold-oom.
It provides 2 solutions,
Run "ANALYZE TABLE ..." (but since we are reading target table from path and not from a table this is not possible)
Cache the table you are broadcasting, DeltaTable does not have any provision to cache table, so can't do this.
I thought this was because we are using DeltaTable.forPath() method for reading target table and spark is unable to calculate target table metrics. So I also tried a different approach,
Dataset<Row> sourceDf = sparkSession
.readStream()
.format("delta")
.option("inferSchema", "true")
.load(sourcePath);
Dataset<Row> targetDf = sparkSession
.read()
.format("delta")
.option("inferSchema", "true")
.load(targetPath);
sourceDf.createOrReplaceTempView("vtempview");
targetDf.createOrReplaceTempView("vtemptarget");
targetDf.cache();
StreamingQuery sq = sparkSession.sql("select * from vtempview").writeStream()
.format("delta")
.foreachBatch((microDf, id) -> {
microDf.createOrReplaceTempView("vtempmicrodf");
microDf.sparkSession().sql(
"MERGE INTO vtemptarget as t USING vtempmicrodf as s ON t.SALE_ID = s.SALE_ID WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * "
);
})
.outputMode("update")
.option("checkpointLocation", util.getFullS3Path(target)+"/_checkpoint")
.trigger(Trigger.Once())
.start();
In above snippet I am also caching the targetDf so that Spark can calculate metrics and not broadcast target table. But it didn't help and spark still broadcasts it.
Now I am out of options. Can anyone give me some guidance on this?

Sink from Delta Live Table to Kafka, initial sink works, but any subsequent updates fail

I have a DLT pipeline that ingests a topic from my kafka stream, transforms it into a DLT, then I wish to write that table back into Kafka under a new topic.
So far, I have this working, however it only works on first load of the table, then after any subsequent updates will crash my read and write stream.
My DLT tables updates correctly, so I see updates from my pipeline into the Gold table,
CREATE OR REFRESH LIVE TABLE deal_gold1
TBLPROPERTIES ("quality" = "gold")
COMMENT "Gold Deals"
AS SELECT
documentId,
eventTimestamp,
substring(fullDocument.owner_id, 11, 24) as owner_id,
fullDocument.owner_type as owner_type,
substring(fullDocument.account_id, 11, 24) as account_id,
substring(fullDocument.manager_account_id, 11, 24) as manager_account_id,
fullDocument.hubspot_deal_id as hubspot_deal_id,
fullDocument.stage as stage,
fullDocument.status as status,
fullDocument.title as title
FROM LIVE.deal_bronze_cleansed
but then when I try to read from it via a separate notebook, these updates cause it to crash
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
# this one is the problem not the write stream
df = spark.readStream.format("delta").table("deal_stream_test.deal_gold1")
display(df)
writeStream= (
df
.selectExpr("CAST(documentId AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.outputMode("append")
.option("ignoreChanges", "true")
.option("checkpointLocation", "/tmp/benperram21/checkpoint")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("ignoreChanges", "true")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("topic", confluentTopicName)
.start()
)
I was looking and can see this might be as a result of it not being read as "Append". But yeah any thoughts on this? Everything works upset updates.
Right now DLT doesn't support output to the arbitrary sinks. Also, all Spark operations should be done inside the nodes of the execution graph (functions labeled with dlt.table or dlt.view).
Right now the workaround would be to run that notebook outside of the DLT pipeline, as a separate task in the multitask job (workflow).

groupby ideal strategy in Spark Streaming

I am reading data using Spark Streaming from a Kafka Source, from where I create a dataframe with columns wsid, year, month, day, oneHourPrecip:
val df = spark.readStream
.format("kafka")
.option("subscribe", "raw_weather")
.option("kafka.bootstrap.servers", "<host1:port1,host2:port2>...")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism" , "PLAIN")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"token\" password=\"" + "<some password>" + "\";")
.option("kafka.ssl.protocol", "TLSv1.2")
.option("kafka.ssl.enabled.protocols", "TLSv1.2")
.option("kafka.ssl.endpoint.identification.algorithm", "HTTPS")
.load()
.selectExpr("CAST(value as STRING)")
.as[String]
.withColumn("_tmp", split(col("value"), "\\,"))
.select(
$"_tmp".getItem(0).as("wsid"),
$"_tmp".getItem(1).as("year").cast("int"),
$"_tmp".getItem(2).as("month").cast("int"),
$"_tmp".getItem(3).as("day").cast("int"),
$"_tmp".getItem(11).as("oneHourPrecip").cast("double")
)
.drop("_tmp")
I then perform a groupby and then try to write this stream data into a table using JDBC. For that purpose, this is my code:
val query= df.writeStream
.outputMode(OutputMode.Append())
.foreachBatch((df: DataFrame , id: Long) => {
println(df.count())
df.groupBy($"wsid" , $"year" , $"month" , $"day")
.agg(sum($"oneHourPrecip").as("precipitation"))
.write
.mode(SaveMode.Append)
.jdbc(url , s"$schema.$table" , getProperties)
})
.trigger(Trigger.ProcessingTime(1))
.start()
The problem comes with the batch. With Spark Streaming, we cannot predict the number of rows that come every batch in a dataframe. So quite a lot of times, I get data that is disjointed (ie. for the given common values (wsid,year,month,day), some rows appear in one batch while some others appear in another batch).
Then when I groupby and try to write it using JDBC, this is the error I get:
com.ibm.db2.jcc.am.BatchUpdateException: [jcc][t4][102][10040][4.25.13] Batch failure. The batch was submitted, but at least one exception occurred on an individual member of the batch.
Use getNextException() to retrieve the exceptions for specific batched elements. ERRORCODE=-4229, SQLSTATE=null
at com.ibm.db2.jcc.am.b6.a(b6.java:502)
at com.ibm.db2.jcc.am.Agent.endBatchedReadChain(Agent.java:434)
at com.ibm.db2.jcc.am.k4.a(k4.java:5452)
at com.ibm.db2.jcc.am.k4.c(k4.java:5026)
at com.ibm.db2.jcc.am.k4.executeBatch(k4.java:3058)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:672)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: com.ibm.db2.jcc.am.SqlIntegrityConstraintViolationException: Error for batch element #1: DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, SQLERRMC=1;SPARK.DAILY_PRECIPITATION_DATA, DRIVER=4.25.13
at com.ibm.db2.jcc.am.b6.a(b6.java:806)
at com.ibm.db2.jcc.am.b6.a(b6.java:66)
at com.ibm.db2.jcc.am.b6.a(b6.java:140)
at com.ibm.db2.jcc.t4.ab.a(ab.java:1283)
at com.ibm.db2.jcc.t4.ab.a(ab.java:128)
at com.ibm.db2.jcc.t4.p.a(p.java:57)
at com.ibm.db2.jcc.t4.aw.a(aw.java:225)
at com.ibm.db2.jcc.am.k4.a(k4.java:3605)
at com.ibm.db2.jcc.am.k4.d(k4.java:6020)
at com.ibm.db2.jcc.am.k4.a(k4.java:5372)
... 17 more
As evident from the SqlIntegrityConstraintViolationException above, it is because after one batch writes the groupbyed values using JDBC, insertion for the next set of values fail because of the primary key (wsid,year,month,day).
Given that there will be a fixed number of oneHourPrecip values (24) for a given (wsid,year,month,day) from the source, how do we ensure that groupBy works properly for all data that is streamed from the source, so that insertion into Database is not a problem?
SaveMode.Upsert is not available :-)
There is nothing to do with groupBy. group by just groups the values. integrity violation (com.ibm.db2.jcc.am.SqlIntegrityConstraintViolationException) you need to take care at sql level.
Option 1:
You can do insert update to avoid integrety violation.
for this you need to use like below pseudo code...
dataframe.foreachPartition {
update TABLE_NAME set FIELD_NAME=xxxxx where MyID=XXX;
INSERT INTO TABLE_NAME values (colid,col1,col2)
WHERE NOT EXISTS(select 1 from TABLE_NAME where colid=xxxx);
}
Option 2 :
or check merge statement in db2
one way is create a empty temp table ( with out any connstraints) which has same schema and populate it and at the end you can execute a script which will merge in to the target
table.
I did figure something out, but this may have some performance concerns. Anyways, it worked for me so am posting the answer:
I figured out that in order to store a groupbyed data into a DB2 table, we would have to wait until we retrieve all the data from the source. For that, I utilize OutputMode.Complete().
Then I realized if I were to write it into DB2 after grouping in the current method, it would still throw me the same error. For that, I had to use SaveMode.Overwrite inside foreachBatch.
I tried running my program with this approach, but it threw this error:
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets
So I decided to do groupby and aggregation during readStream itself. Thus my code looks like this:
readStream part:
val df = spark.readStream
.format("kafka")
.option("subscribe", "raw_weather")
.option("kafka.bootstrap.servers", "<host1:port1,host2:port2>...")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.mechanism" , "PLAIN")
.option("kafka.sasl.jaas.config", "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"token\" password=\"" + "<some password>" + "\";")
.option("kafka.ssl.protocol", "TLSv1.2")
.option("kafka.ssl.enabled.protocols", "TLSv1.2")
.option("kafka.ssl.endpoint.identification.algorithm", "HTTPS")
.load()
.selectExpr("CAST(value as STRING)")
.as[String]
.withColumn("_tmp", split(col("value"), "\\,"))
.select(
$"_tmp".getItem(0).as("wsid"),
$"_tmp".getItem(1).as("year").cast("int"),
$"_tmp".getItem(2).as("month").cast("int"),
$"_tmp".getItem(3).as("day").cast("int"),
$"_tmp".getItem(11).as("oneHourPrecip").cast("double")
)
.drop("_tmp")
.groupBy($"wsid" , $"year" , $"month" , $"day")
.agg(sum($"oneHourPrecip").as("precipitation"))
writeStream part:
val query= df.writeStream
.outputMode(OutputMode.Complete())
.foreachBatch((df: DataFrame , id: Long) => {
println(df.count())
df.write
.mode(SaveMode.Overwrite)
.jdbc(url , s"$schema.$table" , getProperties)
})
.trigger(Trigger.ProcessingTime(1))
.start()
query.awaitTermination()

Resources