Duplicates while publishing data to kafka topic using spark-streaming - apache-spark

I have spark-streaming application which consumes data from topic1 and parse it then publish same records into 2 processes one is into topic2 and other is to hive table. while publishing data to kafka topic2 I see duplicates and i don't see duplicates in hive table
using
spark 2.2, Kafka 0.10.0
KafkaWriter.write(spark, storeSalesStreamingFinalDF, config)
writeToHIVE(spark, storeSalesStreamingFinalDF, config)
object KafkaWriter {
def write(spark: SparkSession, df: DataFrame, config: Config)
{
df.select(to_json(struct("*")) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", config.getString("kafka.dev.bootstrap.servers"))
.option("topic",config.getString("kafka.topic"))
.option("kafka.compression.type",config.getString("kafka.compression.type"))
.option("kafka.session.timeout.ms",config.getString("kafka.session.timeout.ms"))
.option("kafka.request.timeout.ms",config.getString("kafka.request.timeout.ms"))
.save()
}
}
Can some one help on this,
Expecting no duplicates in kafka topic2.

To handle the duplicate data ,we should set the .option("kafka.processing.guarantee","exactly_once")

Related

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example?
Here is the code I tried.
val schemaRegistryAddr = "https://..."
val avroSchema = buildSchema(topic) //defined this method
val Df = spark.readStream.format("delta").load("path..")
.withColumn("key", col("lskey").cast(StringType))
.withColumn("topLevelRecord",struct(col("col1"),col("col2")...)
.select(
to_avro($"key", lit("topic-key"), schemaRegistryAddr).as("key"),
to_avro($"topLevelRecord", lit("topic-value"), schemaRegistryAddr, avroSchema).as("value"))
Df.writeStream
.format("kafka")
.option("checkpointLocation",checkpointPath)
.option("kafka.bootstrap.servers", bootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
.option("kafka.ssl.keystore.password", keystorePassword)
.option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
.option("topic",topic)
.option("batch.size",262144)
.option("linger.ms",5000)
.trigger(ProcessingTime("25 seconds"))
.start()
But it fails with: org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
But when I try to write to the same topic using a Batch Producer it goes through successfully. Can anyone please let me know what am I missing in the streaming write to Kafka topic?
Later I found this old blog which says that current Structured Streaming API does not support 'kakfa' format.
https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333

Writing Spark DataFrame to Kafka is ignoring the partition column and kafka.partitioner.class

I am trying to write a Spark DF (batch DF) to Kafka and i need to write the data to specific partitions.
I tried the following code
myDF.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaProps.getBootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.truststore.location", kafkaProps.getTrustStoreLocation)
.option("kafka.truststore.password", kafkaProps.getTrustStorePassword)
.option("kafka.keystore.location", kafkaProps.getKeyStoreLocation)
.option("kafka.keystore.password", kafkaProps.getKeyStorePassword)
.option("kafka.partitioner.class", "util.MyCustomPartitioner")
.option("topic",kafkaProps.getTopicName)
.save()
And the Schema of the DF i am writing is
+---+---------+-----+
|key|partition|value|
+---+---------+-----+
+---+---------+-----+
I had to repartition (to 1 partition) the "myDF" since i need to order the data based on date column.
It is writing the data to a Single partition but not the one that is in the DF's "partition" column or the one returned by the Custom Partitioner (which is same as the value in the partition column).
Thanks
Sateesh
The feature to use the column "partition" in your Dataframe is only available with version 3.x and not earlier according to the 2.4.7 docs
However, using the option kafka.partitioner.class will still work. Spark Structured Streaming allows you to use plain KafkaConsumer configuration when using the prefix kafka., so this will also work on version 2.4.4.
Below code runs fine with Spark 3.0.1 and Confluent community edition 5.5.0. On Spark 2.4.4, the "partition" column does not have any impact, but my custom partitioner class applies.
case class KafkaRecord(partition: Int, value: String)
val spark = SparkSession.builder()
.appName("test")
.master("local[*]")
.getOrCreate()
// create DataFrame
import spark.implicits._
val df = Seq((0, "Alice"), (1, "Bob")).toDF("partition", "value").as[KafkaRecord]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test")
.save()
What you then see in the console-consumer:
# partition 0
$ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic test --partition 0
Alice
and
# partition 1
$ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic test --partition 1
Bob
Also getting the same results when using a custom Partitioner
.option("kafka.partitioner.class", "org.test.CustomPartitioner")
where my custom Partitioner is defined as
package org.test
class CustomPartitioner extends Partitioner {
override def partition(topic: String, key: Any, keyBytes: Array[Byte], value: Any,valueBytes: Array[Byte],cluster: Cluster): Int = {
if (!valueBytes.isEmpty && valueBytes.map(_.toChar).mkString == "Bob") {
0
} else {
1
}
}
}

Structured Streaming: Reading from multiple Kafka topics at once

I have a Spark Structured Streaming Application which has to read from 12 Kafka topics (Different Schemas, Avro format) at once, deserialize the data and store in HDFS. When I read from a single topic using my code, it works fine and without errors but on running multiple queries together, I'm getting the following error
java.lang.IllegalStateException: Race while writing batch 0
My code is as follows:
def main(args: Array[String]): Unit = {
val kafkaProps = Util.loadProperties(kafkaConfigFile).asScala
val topic_list = ("topic1", "topic2", "topic3", "topic4")
topic_list.foreach(x => {
kafkaProps.update("subscribe", x)
val source= Source.fromInputStream(Util.getInputStream("/schema/topics/" + x)).getLines.mkString
val schemaParser = new Schema.Parser
val schema = schemaParser.parse(source)
val sqlTypeSchema = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
val kafkaStreamData = spark
.readStream
.format("kafka")
.options(kafkaProps)
.load()
val udfDeserialize = udf(deserialize(source), DataTypes.createStructType(sqlTypeSchema.fields))
val transformedDeserializedData = kafkaStreamData.select("value").as(Encoders.BINARY)
.withColumn("rows", udfDeserialize(col("value")))
.select("rows.*")
val query = transformedDeserializedData
.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.format("parquet")
.option("path", "/output/topics/" + x)
.option("checkpointLocation", checkpointLocation + "//" + x)
.start()
})
spark.streams.awaitAnyTermination()
}
Alternative. You can use KAFKA Connect (from Confluent), NIFI, StreamSets etc. as your use case seems to fit "dump/persist to HDFS". That said, you need to have these tools (installed). The small files problem you state is not an issue, so be it.
From Apache Kafka 0.9 or later version you can Kafka Connect API for KAFKA --> HDFS Sink (various supported HDFS formats). You need a KAFKA Connect Cluster though, but that is based on your existing Cluster in any event, so not a big deal. But someone needs to maintain.
Some links to get you on your way:
https://data-flair.training/blogs/kafka-connect/
https://github.com/confluentinc/kafka-connect-hdfs

How to save kafka data into different location based on a column value in spark structured streaming?

I have a usecase in which I am consuming data from Kafka using spark structured streaming. I have multiple topics to subscribe and based on the topic name the dataframe should be dumped to a defined location(different location for different topics). I saw if this can be solved using some kind of split/filter function in spark dataframe but could not find any.
As of now I am only subscribed to one topic and I am using my own written method to dump the data into a location in parquet's format. Here is the code I am currently using :
def save_as_parquet(cast_dataframe: DataFrame,output_path:
String,checkpointLocation: String): Unit = {
val query = cast_dataframe.writeStream
.format("parquet")
.option("failOnDataLoss",true)
.option("path",output_path)
.option("checkpointLocation",checkpointLocation)
.start()
.awaitTermination()
}
When I will be subscribed to different topics, then this cast_dataframe will also have values from different topics. I wish to dump the data from a topic to only the location it is assigned location. How can this be done ?
As explained in the official documentation Dataset to be written might contain optional topic column, which can be used for message routing:
* The topic column is required if the “topic” configuration option is not specified.
The value column is the only required option. If a key column is not specified then a null valued key column will be automatically added (see Kafka semantics on how null valued key values are handled). If a topic column exists then its value is used as the topic when writing the given row to Kafka, unless the “topic” configuration option is set i.e., the “topic” configuration option overrides the topic column.
According to the documentation each Row from Kafka source has the following schema:
Column
Type
key
binary
value
binary
topic
string
...
...
Assuming you are reading from multiple topics using the source option
val kafkaInputDf = spark.readStream.format("kafka").[...]
.option("subscribe", "topic1, topic2, topic3")
.start()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "topic")
you can then apply a filter to on the column topic to split the data accordingly:
val df1 = kafkaInputDf.filter(col("topic") === "topic1")
val df2 = kafkaInputDf.filter(col("topic") === "topic2")
val df3 = kafkaInputDf.filter(col("topic") === "topic3")
Then you can sink those three streaming Dataframes df1, df2 and df3 into their required sinks. As this will create three in parallel running streaming queries it is important that each writeStream get its own checkpoint location.

How to print Json encoded messages using Spark Structured Streaming

I have a DataSet[Row] where each row is JSON string. I want to just print the JSON stream or count the JSON stream per batch.
Here is my code so far
val ds = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers"))
.option("subscribe", topicName)
.option("checkpointLocation", hdfsCheckPointDir)
.load();
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as 'payload)
val ds2 = ds1.select($"payload.info")
val query = ds2.writeStream.outputMode("append").queryName("table").format("memory").start()
query.awaitTermination()
select * from table; -- don't see anything and there are no errors. However when I run my Kafka consumer separately (independent ofSpark I can see the data)
My question really is what do I need to do just print the data I am receiving from Kafka using Structured Streaming? The messages in Kafka are JSON encoded strings so I am converting JSON encoded strings to some struct and eventually to a dataset. I am using Spark 2.1.0
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as payload).select($"payload.*")
That will print your data on the console.
ds1.writeStream.format("console").option("truncate", "false").start().awaitTermination()
Always use something like awaitTermination() or thread.Sleep(time in seconds) in these type of situations.

Resources