I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible.
I have seen this question, but unable to get it working with the Confluent Schema Registry. Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)
It took me a couple months of reading source code and testing things out. In a nutshell, Spark can only handle String and Binary serialization. You must manually deserialize the data. In spark, create the confluent rest service object to get the schema. Convert the schema string in the response object into an Avro schema using the Avro parser. Next, read the Kafka topic as normal. Then map over the binary typed "value" column with the Confluent KafkaAvroDeSerializer. I strongly suggest getting into the source code for these classes because there is a lot going on here, so for brevity I'll leave out many details.
//Used Confluent version 3.2.2 to write this.
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "Schema-Registry-Example-topic1"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sql.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.key, keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueString = valueDeserializer.deserialize(topicName, row.value, topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Disclaimer
This code was only tested on a local master, and has been reported runs into serializer issues in a clustered environment. There's an alternative solution (step 7-9, with Scala code in step 10) that extracts out the schema ids to columns, looks up each unique ID, and then uses schema broadcast variables, which will work better, at scale.
Also, there is an external library AbsaOSS/ABRiS that also addresses using the Registry with Spark
Since the other answer that was mostly useful was removed, I wanted to re-add it with some refactoring and comments.
Here are the dependencies needed. Code tested with Confluent 5.x and Spark 2.4
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
<exclusions>
<!-- Conflicts with Spark's version -->
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
And here is the Scala implementation (only tested locally on master=local[*])
First section, define the imports, some fields, and a few helper methods to get schemas
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.cli.CommandLine
import org.apache.spark.sql._
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.streaming.OutputMode
object App {
private var schemaRegistryClient: SchemaRegistryClient = _
private var kafkaAvroDeserializer: AvroDeserializer = _
def lookupTopicSchema(topic: String, isKey: Boolean = false) = {
schemaRegistryClient.getLatestSchemaMetadata(topic + (if (isKey) "-key" else "-value")).getSchema
}
def avroSchemaToSparkSchema(avroSchema: String) = {
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
}
// ... continues below
Then define a simple main method that parses the CMD args to get Kafka details
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args)
val master = cmd.getOptionValue("master", "local[*]")
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
val bootstrapServers = cmd.getOptionValue("bootstrap-server")
val topic = cmd.getOptionValue("topic")
val schemaRegistryUrl = cmd.getOptionValue("schema-registry")
consumeAvro(spark, bootstrapServers, topic, schemaRegistryUrl)
spark.stop()
}
// ... still continues
Then, the important method that consumes the Kafka topic and deserializes it
private def consumeAvro(spark: SparkSession, bootstrapServers: String, topic: String, schemaRegistryUrl: String): Unit = {
import spark.implicits._
// Setup the Avro deserialization UDF
schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
kafkaAvroDeserializer.deserialize(bytes)
)
// Load the raw Kafka topic (byte stream)
val rawDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
// Deserialize byte stream into strings (Avro fields become JSON)
import org.apache.spark.sql.functions._
val jsonDf = rawDf.select(
// 'key.cast(DataTypes.StringType), // string keys are simplest to use
callUDF("deserialize", 'key).as("key"), // but sometimes they are avro
callUDF("deserialize", 'value).as("value")
// excluding topic, partition, offset, timestamp, etc
)
// Get the Avro schema for the topic from the Schema Registry and convert it into a Spark schema type
val dfValueSchema = {
val rawSchema = lookupTopicSchema(topic)
avroSchemaToSparkSchema(rawSchema)
}
// Apply structured schema to JSON stream
val parsedDf = jsonDf.select(
'key, // keys are usually plain strings
// values are JSONified Avro records
from_json('value, dfValueSchema.dataType).alias("value")
).select(
'key,
$"value.*" // flatten out the value
)
// parsedDf.printSchema()
// Sample schema output
// root
// |-- key: string (nullable = true)
// |-- header: struct (nullable = true) // Not a Kafka record "header". This is part of our value schema
// | |-- time: long (nullable = true)
// | ...
// TODO: Do something interesting with this stream
parsedDf.writeStream
.format("console")
.outputMode(OutputMode.Append())
.option("truncate", false)
.start()
.awaitTermination()
}
// still continues
The command line parser allows for passing in bootstrap servers, schema registry, topic name, and Spark master.
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
val masterOption = new Option("m", "master", true, "Spark master")
masterOption.setRequired(false)
options.addOption(masterOption)
val bootstrapOption = new Option("b", "bootstrap-server", true, "Bootstrap servers")
bootstrapOption.setRequired(true)
options.addOption(bootstrapOption)
val topicOption = new Option("t", "topic", true, "Kafka topic")
topicOption.setRequired(true)
options.addOption(topicOption)
val schemaRegOption = new Option("s", "schema-registry", true, "Schema Registry URL")
schemaRegOption.setRequired(true)
options.addOption(schemaRegOption)
val parser = new BasicParser
parser.parse(options, args)
}
// still continues
In order for the UDF above to work, then there needed to be a deserializer to take the DataFrame of bytes to one containing deserialized Avro
// Simple wrapper around Confluent deserializer
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
// TODO: configure the deserializer for authentication
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val value = super.deserialize(bytes)
value match {
case str: String =>
str
case _ =>
val genericRecord = value.asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
} // end 'object App'
Put each of these blocks together, and it works in IntelliJ after adding -b localhost:9092 -s http://localhost:8081 -t myTopic to Run Configurations > Program Arguments
This is an example of my code integrating spark structured streaming with kafka and schema registry (code in scala)
import org.apache.spark.sql.SparkSession
import io.confluent.kafka.schemaregistry.client.rest.RestService // <artifactId>kafka-schema-registry</artifactId>
import org.apache.spark.sql.avro.from_avro // <artifactId>spark-avro_${scala.compat.version}</artifactId>
import org.apache.spark.sql.functions.col
object KafkaConsumerAvro {
def main(args: Array[String]): Unit = {
val KAFKA_BOOTSTRAP_SERVERS = "localhost:9092"
val SCHEMA_REGISTRY_URL = "http://localhost:8081"
val TOPIC = "transactions"
val spark: SparkSession = SparkSession.builder().appName("KafkaConsumerAvro").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest") // from starting
.load()
// Prints Kafka schema with columns (topic, offset, partition e.t.c)
df.printSchema()
// Create REST service to access schema registry and retrieve topic schema (latest)
val restService = new RestService(SCHEMA_REGISTRY_URL)
val valueRestResponseSchema = restService.getLatestVersion(TOPIC + "-value")
val jsonSchema = valueRestResponseSchema.getSchema
val transactionDF = df.select(
col("key").cast("string"), // cast to string from binary value
from_avro(col("value"), jsonSchema).as("transaction"), // convert from avro value
col("topic"),
col("offset"),
col("timestamp"),
col("timestampType"))
transactionDF.printSchema()
// Stream data to console for testing
transactionDF.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
}
}
When reading from kafka topic, we have this kind of schema:
key: binary | value: binary | topic: string | partition: integer | offset: long | timestamp: timestamp | timestampType: integer |
As we can see, key and value are binary so we need to cast key as string and in this case, value is avro formatted so we can achieve this by calling from_avro function.
In adition to Spark and Kafka dependencies, we need this dependencies:
<!-- READ AND WRITE AVRO DATA -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- INTEGRATION WITH SCHEMA REGISTRY -->
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry</artifactId>
<version>${confluent.version}</version>
</dependency>
This library will do the job for you. It connects to Confluent Schema Registry through Spark Structured Stream.
For Confluent, it copes with the schema id that is sent along with the payload.
In the README you will find a code snippet of how to do it.
DISCLOSURE: I work for ABSA and I developed this library.
Another very simple alternative for pyspark (without full support for schema registry like schema registration, compatibility check, etc.) could be:
import requests
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.avro.functions import *
# variables
topic = "my-topic"
schemaregistry = "http://localhost:8081"
kafka_brokers = "kafka1:9092,kafka2:9092"
# retrieve the latest schema
response = requests.get('{}/subjects/{}-value/versions/latest/schema'.format(schemaregistry, topic))
# error check
response.raise_for_status()
# extract the schema from the response
schema = response.text
# run the query
query = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe", topic) \
.load() \
# The magic goes here:
# Skip the first 5 bytes (reserved by schema registry encoding protocol)
.selectExpr("substring(value, 6) as avro_value") \
.select(from_avro(col("avro_value"), schema).alias("data")) \
.select(col("data.my_field")) \
.writeStream \
.format("console") \
.outputMode("complete") \
.start()
Databricks now provide this functionality but you have to pay for it :-(
dataDF
.select(
to_avro($"key", lit("t-key"), schemaRegistryAddr).as("key"),
to_avro($"value", lit("t-value"), schemaRegistryAddr).as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("topic", "t")
.save()
See:
https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html for more info
A good free alternative is ABRIS. See: https://github.com/AbsaOSS/ABRiS the only downside we can see that you need to provide a file of your avro schema at runtime so the framework can enforce this schema on your dataframe before it publishes it to the Kafka topic.
Based on #cricket_007's answers I created the following solution which could run in our cluster environment, including the following new features:
You need add broadcast variables to transfer some values into map operations for cluster environment. Neither Schema.Parser nor KafkaAvroDeserializer could be serialized in spark, so it is why you need initialize them in map operations
My structured streaming used foreachBatch output sink.
I applied org.apache.spark.sql.avro.SchemaConverters to convert avro schema format to spark StructType, so that you could use it in from_json column function to parse dataframe in Kafka topic fields (key and value).
Firstly, you need load some packages:
SCALA_VERSION="2.11"
SPARK_VERSION="2.4.4"
CONFLUENT_VERSION="5.2.2"
jars=(
"org.apache.spark:spark-sql-kafka-0-10_${SCALA_VERSION}:${SPARK_VERSION}" ## format("kafka")
"org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION}" ## SchemaConverters
"io.confluent:kafka-schema-registry:${CONFLUENT_VERSION}" ## import io.confluent.kafka.schemaregistry.client.rest.RestService
"io.confluent:kafka-avro-serializer:${CONFLUENT_VERSION}" ## import io.confluent.kafka.serializers.KafkaAvroDeserializer
)
./bin/spark-shell --packages ${"${jars[*]}"// /,}
Here are the whole codes I tested in spark-shell:
import org.apache.avro.Schema
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import io.confluent.kafka.schemaregistry.client.rest.RestService
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.avro.SchemaConverters
import scala.collection.JavaConverters._
import java.time.LocalDateTime
spark.sparkContext.setLogLevel("Error")
val brokerServers = "xxx.yyy.zzz:9092"
val topicName = "mytopic"
val schemaRegistryURL = "http://xxx.yyy.zzz:8081"
val restService = new RestService(schemaRegistryURL)
val exParser = new Schema.Parser
//-- For both key and value
val schemaNames = Seq("key", "value")
val schemaStrings = schemaNames.map(i => (i -> restService.getLatestVersion(s"$topicName-$i").getSchema)).toMap
val tempStructMap = schemaStrings.transform((k,v) => SchemaConverters.toSqlType(exParser.parse(v)).dataType)
val schemaStruct = new StructType().add("key", tempStructMap("key")).add("value", tempStructMap("value"))
//-- For key only
// val schemaStrings = restService.getLatestVersion(s"$topicName-key").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
//-- For value only
// val schemaStrings = restService.getLatestVersion(s"$topicName-value").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
val query = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load()
.writeStream
.outputMode("append")
//.option("checkpointLocation", s"cos://$bucket.service/checkpoints/$tableName")
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val bcTopicName = sc.broadcast(topicName)
val bcSchemaRegistryURL = sc.broadcast(schemaRegistryURL)
val bcSchemaStrings = sc.broadcast(schemaStrings)
val rstDF = batchDF.map {
row =>
val props = Map("schema.registry.url" -> bcSchemaRegistryURL.value)
//-- For both key and value
val isKeys = Map("key" -> true, "value" -> false)
val deserializers = isKeys.transform{ (k,v) =>
val des = new KafkaAvroDeserializer
des.configure(props.asJava, v)
des
}
//-- For key only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, true)
//-- For value only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, false)
val inParser = new Schema.Parser
//-- For both key and value
val values = bcSchemaStrings.value.transform( (k,v) =>
deserializers(k).deserialize(bcTopicName.value, row.getAs[Array[Byte]](k), inParser.parse(v)).toString)
s"""{"key": ${values("key")}, "value": ${values("value")} }"""
//-- For key only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("key"), inParser.parse(bcSchemaStrings.value)).toString
//-- For value only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("value"), inParser.parse(bcSchemaStrings.value)).toString
}
.select(from_json(col("value"), schemaStruct).as("root"))
.select("root.*")
println(s"${LocalDateTime.now} --- Batch $batchId: ${rstDF.count} rows")
rstDF.printSchema
rstDF.show(false)
})
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
query.awaitTermination()
Summarizing some of answer above and adding some of my own experience, those are the options at the time of writing:
3rd party Abris library. This is what we used initially, but it doesn't seem to support a permissive mode where you can drop malformed packages. It will crash the stream when it encounters a malformed message. If you can guarantee the message validity, that is okay, but it was an issue for us as it kept trying to parse the malformed message after stream restart.
Custom UDF which parses the Avro data, as outlined by OneCricketeer's answer. Gives the most flexibility but also requires the most custom code.
Using Databrick's from_avro variant which allows you to simply pass the URL, and it will find the right schema and parse it for you. Works really well, but only available in their environment, thus hard to test in a codebase.
Using Spark's built-in from_avro function. This functions allows you to pass a JSON schema and parse it from there. Only fix that you have to apply is that in Confluent's Wire format, there is one magic byte and 4 schema bytes before the actual Avro binary data starts as also pointed out in dudssource's answer. You can parse it like this in Scala:
val restService = new RestService(espConfig.schemaRegistryUrl)
val valueRestResponseSchema = restService.getVersion(espConfig.fullTopicName + "-value", schemaVersion)
valueRestResponseSchema.getSchema
streamDf
.withColumn("binary_data", substring(6, Int.MaxValue))
.withColumn("parsed_data", from_avr('binary_data, jsonSchema, Map("MODE" -> "PERMISSIVE")))
For anyone that want's to do this in pyspark: The library that felipe referenced worked nicely on the JVM for me before, so i wrote a small wrapper function that integrates it in python. This looks very hacky, because a lot of types that are implicit in the scala language have to be specified explicitly in py4j. Has been working nicely so far, though, even in spark 2.4.1.
def expand_avro(spark_context, sql_context, data_frame, schema_registry_url, topic):
j = spark_context._gateway.jvm
dataframe_deserializer = j.za.co.absa.abris.avro.AvroSerDe.DataframeDeserializer(data_frame._jdf)
naming_strategy = getattr(
getattr(j.za.co.absa.abris.avro.read.confluent.SchemaManager,
"SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
conf = getattr(getattr(j.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.url", schema_registry_url))
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.topic", topic))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.id", "latest"))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.naming.strategy", naming_strategy))
schema_path = j.scala.Option.apply(None)
conf = j.scala.Option.apply(conf)
policy = getattr(j.za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies, "RETAIN_SELECTED_COLUMN_ONLY$")()
data_frame = dataframe_deserializer.fromConfluentAvro("value", schema_path, conf, policy)
data_frame = DataFrame(data_frame, sql_context)
return data_frame
For that to work, you have to add the library to the spark packages, e.g.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' \
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.1,' \
'org.apache.spark:spark-avro_2.11:2.4.1,' \
'za.co.absa:abris_2.11:2.2.2 ' \
'--repositories https://packages.confluent.io/maven/ ' \
'pyspark-shell'
#RvdV Great summary. I was trying Abris library and consuming CDC record generated by Debezium.
val abrisConfig: FromAvroConfig = (AbrisConfig
.fromConfluentAvro
.downloadReaderSchemaByLatestVersion
.andTopicNameStrategy(topicName)
.usingSchemaRegistry(schemaRegistryURL))
val df=(spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load())
val deserializedAvro = (df
.select(from_avro(col("value"), abrisConfig)
.as("data"))
.select(col("data.after.*")))
deserializedAvro.printSchema()
val query = (deserializedAvro
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", s"s3://$bucketName/checkpoints/$tableName")
.trigger(Trigger.ProcessingTime("60 seconds"))
.start())
I added column while the streaming job is running. I was expecting it to print the new col that I added. It did not. Does it not dynamically refresh the schema from the version information in the payload ?
Related
I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible.
I have seen this question, but unable to get it working with the Confluent Schema Registry. Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)
It took me a couple months of reading source code and testing things out. In a nutshell, Spark can only handle String and Binary serialization. You must manually deserialize the data. In spark, create the confluent rest service object to get the schema. Convert the schema string in the response object into an Avro schema using the Avro parser. Next, read the Kafka topic as normal. Then map over the binary typed "value" column with the Confluent KafkaAvroDeSerializer. I strongly suggest getting into the source code for these classes because there is a lot going on here, so for brevity I'll leave out many details.
//Used Confluent version 3.2.2 to write this.
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "Schema-Registry-Example-topic1"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sql.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.key, keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueString = valueDeserializer.deserialize(topicName, row.value, topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Disclaimer
This code was only tested on a local master, and has been reported runs into serializer issues in a clustered environment. There's an alternative solution (step 7-9, with Scala code in step 10) that extracts out the schema ids to columns, looks up each unique ID, and then uses schema broadcast variables, which will work better, at scale.
Also, there is an external library AbsaOSS/ABRiS that also addresses using the Registry with Spark
Since the other answer that was mostly useful was removed, I wanted to re-add it with some refactoring and comments.
Here are the dependencies needed. Code tested with Confluent 5.x and Spark 2.4
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
<exclusions>
<!-- Conflicts with Spark's version -->
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
And here is the Scala implementation (only tested locally on master=local[*])
First section, define the imports, some fields, and a few helper methods to get schemas
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.cli.CommandLine
import org.apache.spark.sql._
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.streaming.OutputMode
object App {
private var schemaRegistryClient: SchemaRegistryClient = _
private var kafkaAvroDeserializer: AvroDeserializer = _
def lookupTopicSchema(topic: String, isKey: Boolean = false) = {
schemaRegistryClient.getLatestSchemaMetadata(topic + (if (isKey) "-key" else "-value")).getSchema
}
def avroSchemaToSparkSchema(avroSchema: String) = {
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
}
// ... continues below
Then define a simple main method that parses the CMD args to get Kafka details
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args)
val master = cmd.getOptionValue("master", "local[*]")
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
val bootstrapServers = cmd.getOptionValue("bootstrap-server")
val topic = cmd.getOptionValue("topic")
val schemaRegistryUrl = cmd.getOptionValue("schema-registry")
consumeAvro(spark, bootstrapServers, topic, schemaRegistryUrl)
spark.stop()
}
// ... still continues
Then, the important method that consumes the Kafka topic and deserializes it
private def consumeAvro(spark: SparkSession, bootstrapServers: String, topic: String, schemaRegistryUrl: String): Unit = {
import spark.implicits._
// Setup the Avro deserialization UDF
schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
kafkaAvroDeserializer.deserialize(bytes)
)
// Load the raw Kafka topic (byte stream)
val rawDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
// Deserialize byte stream into strings (Avro fields become JSON)
import org.apache.spark.sql.functions._
val jsonDf = rawDf.select(
// 'key.cast(DataTypes.StringType), // string keys are simplest to use
callUDF("deserialize", 'key).as("key"), // but sometimes they are avro
callUDF("deserialize", 'value).as("value")
// excluding topic, partition, offset, timestamp, etc
)
// Get the Avro schema for the topic from the Schema Registry and convert it into a Spark schema type
val dfValueSchema = {
val rawSchema = lookupTopicSchema(topic)
avroSchemaToSparkSchema(rawSchema)
}
// Apply structured schema to JSON stream
val parsedDf = jsonDf.select(
'key, // keys are usually plain strings
// values are JSONified Avro records
from_json('value, dfValueSchema.dataType).alias("value")
).select(
'key,
$"value.*" // flatten out the value
)
// parsedDf.printSchema()
// Sample schema output
// root
// |-- key: string (nullable = true)
// |-- header: struct (nullable = true) // Not a Kafka record "header". This is part of our value schema
// | |-- time: long (nullable = true)
// | ...
// TODO: Do something interesting with this stream
parsedDf.writeStream
.format("console")
.outputMode(OutputMode.Append())
.option("truncate", false)
.start()
.awaitTermination()
}
// still continues
The command line parser allows for passing in bootstrap servers, schema registry, topic name, and Spark master.
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
val masterOption = new Option("m", "master", true, "Spark master")
masterOption.setRequired(false)
options.addOption(masterOption)
val bootstrapOption = new Option("b", "bootstrap-server", true, "Bootstrap servers")
bootstrapOption.setRequired(true)
options.addOption(bootstrapOption)
val topicOption = new Option("t", "topic", true, "Kafka topic")
topicOption.setRequired(true)
options.addOption(topicOption)
val schemaRegOption = new Option("s", "schema-registry", true, "Schema Registry URL")
schemaRegOption.setRequired(true)
options.addOption(schemaRegOption)
val parser = new BasicParser
parser.parse(options, args)
}
// still continues
In order for the UDF above to work, then there needed to be a deserializer to take the DataFrame of bytes to one containing deserialized Avro
// Simple wrapper around Confluent deserializer
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
// TODO: configure the deserializer for authentication
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val value = super.deserialize(bytes)
value match {
case str: String =>
str
case _ =>
val genericRecord = value.asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
} // end 'object App'
Put each of these blocks together, and it works in IntelliJ after adding -b localhost:9092 -s http://localhost:8081 -t myTopic to Run Configurations > Program Arguments
This is an example of my code integrating spark structured streaming with kafka and schema registry (code in scala)
import org.apache.spark.sql.SparkSession
import io.confluent.kafka.schemaregistry.client.rest.RestService // <artifactId>kafka-schema-registry</artifactId>
import org.apache.spark.sql.avro.from_avro // <artifactId>spark-avro_${scala.compat.version}</artifactId>
import org.apache.spark.sql.functions.col
object KafkaConsumerAvro {
def main(args: Array[String]): Unit = {
val KAFKA_BOOTSTRAP_SERVERS = "localhost:9092"
val SCHEMA_REGISTRY_URL = "http://localhost:8081"
val TOPIC = "transactions"
val spark: SparkSession = SparkSession.builder().appName("KafkaConsumerAvro").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest") // from starting
.load()
// Prints Kafka schema with columns (topic, offset, partition e.t.c)
df.printSchema()
// Create REST service to access schema registry and retrieve topic schema (latest)
val restService = new RestService(SCHEMA_REGISTRY_URL)
val valueRestResponseSchema = restService.getLatestVersion(TOPIC + "-value")
val jsonSchema = valueRestResponseSchema.getSchema
val transactionDF = df.select(
col("key").cast("string"), // cast to string from binary value
from_avro(col("value"), jsonSchema).as("transaction"), // convert from avro value
col("topic"),
col("offset"),
col("timestamp"),
col("timestampType"))
transactionDF.printSchema()
// Stream data to console for testing
transactionDF.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
}
}
When reading from kafka topic, we have this kind of schema:
key: binary | value: binary | topic: string | partition: integer | offset: long | timestamp: timestamp | timestampType: integer |
As we can see, key and value are binary so we need to cast key as string and in this case, value is avro formatted so we can achieve this by calling from_avro function.
In adition to Spark and Kafka dependencies, we need this dependencies:
<!-- READ AND WRITE AVRO DATA -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- INTEGRATION WITH SCHEMA REGISTRY -->
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry</artifactId>
<version>${confluent.version}</version>
</dependency>
This library will do the job for you. It connects to Confluent Schema Registry through Spark Structured Stream.
For Confluent, it copes with the schema id that is sent along with the payload.
In the README you will find a code snippet of how to do it.
DISCLOSURE: I work for ABSA and I developed this library.
Another very simple alternative for pyspark (without full support for schema registry like schema registration, compatibility check, etc.) could be:
import requests
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.avro.functions import *
# variables
topic = "my-topic"
schemaregistry = "http://localhost:8081"
kafka_brokers = "kafka1:9092,kafka2:9092"
# retrieve the latest schema
response = requests.get('{}/subjects/{}-value/versions/latest/schema'.format(schemaregistry, topic))
# error check
response.raise_for_status()
# extract the schema from the response
schema = response.text
# run the query
query = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe", topic) \
.load() \
# The magic goes here:
# Skip the first 5 bytes (reserved by schema registry encoding protocol)
.selectExpr("substring(value, 6) as avro_value") \
.select(from_avro(col("avro_value"), schema).alias("data")) \
.select(col("data.my_field")) \
.writeStream \
.format("console") \
.outputMode("complete") \
.start()
Databricks now provide this functionality but you have to pay for it :-(
dataDF
.select(
to_avro($"key", lit("t-key"), schemaRegistryAddr).as("key"),
to_avro($"value", lit("t-value"), schemaRegistryAddr).as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("topic", "t")
.save()
See:
https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html for more info
A good free alternative is ABRIS. See: https://github.com/AbsaOSS/ABRiS the only downside we can see that you need to provide a file of your avro schema at runtime so the framework can enforce this schema on your dataframe before it publishes it to the Kafka topic.
Based on #cricket_007's answers I created the following solution which could run in our cluster environment, including the following new features:
You need add broadcast variables to transfer some values into map operations for cluster environment. Neither Schema.Parser nor KafkaAvroDeserializer could be serialized in spark, so it is why you need initialize them in map operations
My structured streaming used foreachBatch output sink.
I applied org.apache.spark.sql.avro.SchemaConverters to convert avro schema format to spark StructType, so that you could use it in from_json column function to parse dataframe in Kafka topic fields (key and value).
Firstly, you need load some packages:
SCALA_VERSION="2.11"
SPARK_VERSION="2.4.4"
CONFLUENT_VERSION="5.2.2"
jars=(
"org.apache.spark:spark-sql-kafka-0-10_${SCALA_VERSION}:${SPARK_VERSION}" ## format("kafka")
"org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION}" ## SchemaConverters
"io.confluent:kafka-schema-registry:${CONFLUENT_VERSION}" ## import io.confluent.kafka.schemaregistry.client.rest.RestService
"io.confluent:kafka-avro-serializer:${CONFLUENT_VERSION}" ## import io.confluent.kafka.serializers.KafkaAvroDeserializer
)
./bin/spark-shell --packages ${"${jars[*]}"// /,}
Here are the whole codes I tested in spark-shell:
import org.apache.avro.Schema
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import io.confluent.kafka.schemaregistry.client.rest.RestService
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.avro.SchemaConverters
import scala.collection.JavaConverters._
import java.time.LocalDateTime
spark.sparkContext.setLogLevel("Error")
val brokerServers = "xxx.yyy.zzz:9092"
val topicName = "mytopic"
val schemaRegistryURL = "http://xxx.yyy.zzz:8081"
val restService = new RestService(schemaRegistryURL)
val exParser = new Schema.Parser
//-- For both key and value
val schemaNames = Seq("key", "value")
val schemaStrings = schemaNames.map(i => (i -> restService.getLatestVersion(s"$topicName-$i").getSchema)).toMap
val tempStructMap = schemaStrings.transform((k,v) => SchemaConverters.toSqlType(exParser.parse(v)).dataType)
val schemaStruct = new StructType().add("key", tempStructMap("key")).add("value", tempStructMap("value"))
//-- For key only
// val schemaStrings = restService.getLatestVersion(s"$topicName-key").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
//-- For value only
// val schemaStrings = restService.getLatestVersion(s"$topicName-value").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
val query = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load()
.writeStream
.outputMode("append")
//.option("checkpointLocation", s"cos://$bucket.service/checkpoints/$tableName")
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val bcTopicName = sc.broadcast(topicName)
val bcSchemaRegistryURL = sc.broadcast(schemaRegistryURL)
val bcSchemaStrings = sc.broadcast(schemaStrings)
val rstDF = batchDF.map {
row =>
val props = Map("schema.registry.url" -> bcSchemaRegistryURL.value)
//-- For both key and value
val isKeys = Map("key" -> true, "value" -> false)
val deserializers = isKeys.transform{ (k,v) =>
val des = new KafkaAvroDeserializer
des.configure(props.asJava, v)
des
}
//-- For key only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, true)
//-- For value only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, false)
val inParser = new Schema.Parser
//-- For both key and value
val values = bcSchemaStrings.value.transform( (k,v) =>
deserializers(k).deserialize(bcTopicName.value, row.getAs[Array[Byte]](k), inParser.parse(v)).toString)
s"""{"key": ${values("key")}, "value": ${values("value")} }"""
//-- For key only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("key"), inParser.parse(bcSchemaStrings.value)).toString
//-- For value only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("value"), inParser.parse(bcSchemaStrings.value)).toString
}
.select(from_json(col("value"), schemaStruct).as("root"))
.select("root.*")
println(s"${LocalDateTime.now} --- Batch $batchId: ${rstDF.count} rows")
rstDF.printSchema
rstDF.show(false)
})
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
query.awaitTermination()
Summarizing some of answer above and adding some of my own experience, those are the options at the time of writing:
3rd party Abris library. This is what we used initially, but it doesn't seem to support a permissive mode where you can drop malformed packages. It will crash the stream when it encounters a malformed message. If you can guarantee the message validity, that is okay, but it was an issue for us as it kept trying to parse the malformed message after stream restart.
Custom UDF which parses the Avro data, as outlined by OneCricketeer's answer. Gives the most flexibility but also requires the most custom code.
Using Databrick's from_avro variant which allows you to simply pass the URL, and it will find the right schema and parse it for you. Works really well, but only available in their environment, thus hard to test in a codebase.
Using Spark's built-in from_avro function. This functions allows you to pass a JSON schema and parse it from there. Only fix that you have to apply is that in Confluent's Wire format, there is one magic byte and 4 schema bytes before the actual Avro binary data starts as also pointed out in dudssource's answer. You can parse it like this in Scala:
val restService = new RestService(espConfig.schemaRegistryUrl)
val valueRestResponseSchema = restService.getVersion(espConfig.fullTopicName + "-value", schemaVersion)
valueRestResponseSchema.getSchema
streamDf
.withColumn("binary_data", substring(6, Int.MaxValue))
.withColumn("parsed_data", from_avr('binary_data, jsonSchema, Map("MODE" -> "PERMISSIVE")))
For anyone that want's to do this in pyspark: The library that felipe referenced worked nicely on the JVM for me before, so i wrote a small wrapper function that integrates it in python. This looks very hacky, because a lot of types that are implicit in the scala language have to be specified explicitly in py4j. Has been working nicely so far, though, even in spark 2.4.1.
def expand_avro(spark_context, sql_context, data_frame, schema_registry_url, topic):
j = spark_context._gateway.jvm
dataframe_deserializer = j.za.co.absa.abris.avro.AvroSerDe.DataframeDeserializer(data_frame._jdf)
naming_strategy = getattr(
getattr(j.za.co.absa.abris.avro.read.confluent.SchemaManager,
"SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
conf = getattr(getattr(j.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.url", schema_registry_url))
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.topic", topic))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.id", "latest"))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.naming.strategy", naming_strategy))
schema_path = j.scala.Option.apply(None)
conf = j.scala.Option.apply(conf)
policy = getattr(j.za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies, "RETAIN_SELECTED_COLUMN_ONLY$")()
data_frame = dataframe_deserializer.fromConfluentAvro("value", schema_path, conf, policy)
data_frame = DataFrame(data_frame, sql_context)
return data_frame
For that to work, you have to add the library to the spark packages, e.g.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' \
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.1,' \
'org.apache.spark:spark-avro_2.11:2.4.1,' \
'za.co.absa:abris_2.11:2.2.2 ' \
'--repositories https://packages.confluent.io/maven/ ' \
'pyspark-shell'
#RvdV Great summary. I was trying Abris library and consuming CDC record generated by Debezium.
val abrisConfig: FromAvroConfig = (AbrisConfig
.fromConfluentAvro
.downloadReaderSchemaByLatestVersion
.andTopicNameStrategy(topicName)
.usingSchemaRegistry(schemaRegistryURL))
val df=(spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load())
val deserializedAvro = (df
.select(from_avro(col("value"), abrisConfig)
.as("data"))
.select(col("data.after.*")))
deserializedAvro.printSchema()
val query = (deserializedAvro
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", s"s3://$bucketName/checkpoints/$tableName")
.trigger(Trigger.ProcessingTime("60 seconds"))
.start())
I added column while the streaming job is running. I was expecting it to print the new col that I added. It did not. Does it not dynamically refresh the schema from the version information in the payload ?
I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible.
I have seen this question, but unable to get it working with the Confluent Schema Registry. Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)
It took me a couple months of reading source code and testing things out. In a nutshell, Spark can only handle String and Binary serialization. You must manually deserialize the data. In spark, create the confluent rest service object to get the schema. Convert the schema string in the response object into an Avro schema using the Avro parser. Next, read the Kafka topic as normal. Then map over the binary typed "value" column with the Confluent KafkaAvroDeSerializer. I strongly suggest getting into the source code for these classes because there is a lot going on here, so for brevity I'll leave out many details.
//Used Confluent version 3.2.2 to write this.
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "Schema-Registry-Example-topic1"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sql.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.key, keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueString = valueDeserializer.deserialize(topicName, row.value, topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Disclaimer
This code was only tested on a local master, and has been reported runs into serializer issues in a clustered environment. There's an alternative solution (step 7-9, with Scala code in step 10) that extracts out the schema ids to columns, looks up each unique ID, and then uses schema broadcast variables, which will work better, at scale.
Also, there is an external library AbsaOSS/ABRiS that also addresses using the Registry with Spark
Since the other answer that was mostly useful was removed, I wanted to re-add it with some refactoring and comments.
Here are the dependencies needed. Code tested with Confluent 5.x and Spark 2.4
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
<exclusions>
<!-- Conflicts with Spark's version -->
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
And here is the Scala implementation (only tested locally on master=local[*])
First section, define the imports, some fields, and a few helper methods to get schemas
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.cli.CommandLine
import org.apache.spark.sql._
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.streaming.OutputMode
object App {
private var schemaRegistryClient: SchemaRegistryClient = _
private var kafkaAvroDeserializer: AvroDeserializer = _
def lookupTopicSchema(topic: String, isKey: Boolean = false) = {
schemaRegistryClient.getLatestSchemaMetadata(topic + (if (isKey) "-key" else "-value")).getSchema
}
def avroSchemaToSparkSchema(avroSchema: String) = {
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
}
// ... continues below
Then define a simple main method that parses the CMD args to get Kafka details
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args)
val master = cmd.getOptionValue("master", "local[*]")
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
val bootstrapServers = cmd.getOptionValue("bootstrap-server")
val topic = cmd.getOptionValue("topic")
val schemaRegistryUrl = cmd.getOptionValue("schema-registry")
consumeAvro(spark, bootstrapServers, topic, schemaRegistryUrl)
spark.stop()
}
// ... still continues
Then, the important method that consumes the Kafka topic and deserializes it
private def consumeAvro(spark: SparkSession, bootstrapServers: String, topic: String, schemaRegistryUrl: String): Unit = {
import spark.implicits._
// Setup the Avro deserialization UDF
schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
kafkaAvroDeserializer.deserialize(bytes)
)
// Load the raw Kafka topic (byte stream)
val rawDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
// Deserialize byte stream into strings (Avro fields become JSON)
import org.apache.spark.sql.functions._
val jsonDf = rawDf.select(
// 'key.cast(DataTypes.StringType), // string keys are simplest to use
callUDF("deserialize", 'key).as("key"), // but sometimes they are avro
callUDF("deserialize", 'value).as("value")
// excluding topic, partition, offset, timestamp, etc
)
// Get the Avro schema for the topic from the Schema Registry and convert it into a Spark schema type
val dfValueSchema = {
val rawSchema = lookupTopicSchema(topic)
avroSchemaToSparkSchema(rawSchema)
}
// Apply structured schema to JSON stream
val parsedDf = jsonDf.select(
'key, // keys are usually plain strings
// values are JSONified Avro records
from_json('value, dfValueSchema.dataType).alias("value")
).select(
'key,
$"value.*" // flatten out the value
)
// parsedDf.printSchema()
// Sample schema output
// root
// |-- key: string (nullable = true)
// |-- header: struct (nullable = true) // Not a Kafka record "header". This is part of our value schema
// | |-- time: long (nullable = true)
// | ...
// TODO: Do something interesting with this stream
parsedDf.writeStream
.format("console")
.outputMode(OutputMode.Append())
.option("truncate", false)
.start()
.awaitTermination()
}
// still continues
The command line parser allows for passing in bootstrap servers, schema registry, topic name, and Spark master.
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
val masterOption = new Option("m", "master", true, "Spark master")
masterOption.setRequired(false)
options.addOption(masterOption)
val bootstrapOption = new Option("b", "bootstrap-server", true, "Bootstrap servers")
bootstrapOption.setRequired(true)
options.addOption(bootstrapOption)
val topicOption = new Option("t", "topic", true, "Kafka topic")
topicOption.setRequired(true)
options.addOption(topicOption)
val schemaRegOption = new Option("s", "schema-registry", true, "Schema Registry URL")
schemaRegOption.setRequired(true)
options.addOption(schemaRegOption)
val parser = new BasicParser
parser.parse(options, args)
}
// still continues
In order for the UDF above to work, then there needed to be a deserializer to take the DataFrame of bytes to one containing deserialized Avro
// Simple wrapper around Confluent deserializer
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
// TODO: configure the deserializer for authentication
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val value = super.deserialize(bytes)
value match {
case str: String =>
str
case _ =>
val genericRecord = value.asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
} // end 'object App'
Put each of these blocks together, and it works in IntelliJ after adding -b localhost:9092 -s http://localhost:8081 -t myTopic to Run Configurations > Program Arguments
This is an example of my code integrating spark structured streaming with kafka and schema registry (code in scala)
import org.apache.spark.sql.SparkSession
import io.confluent.kafka.schemaregistry.client.rest.RestService // <artifactId>kafka-schema-registry</artifactId>
import org.apache.spark.sql.avro.from_avro // <artifactId>spark-avro_${scala.compat.version}</artifactId>
import org.apache.spark.sql.functions.col
object KafkaConsumerAvro {
def main(args: Array[String]): Unit = {
val KAFKA_BOOTSTRAP_SERVERS = "localhost:9092"
val SCHEMA_REGISTRY_URL = "http://localhost:8081"
val TOPIC = "transactions"
val spark: SparkSession = SparkSession.builder().appName("KafkaConsumerAvro").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest") // from starting
.load()
// Prints Kafka schema with columns (topic, offset, partition e.t.c)
df.printSchema()
// Create REST service to access schema registry and retrieve topic schema (latest)
val restService = new RestService(SCHEMA_REGISTRY_URL)
val valueRestResponseSchema = restService.getLatestVersion(TOPIC + "-value")
val jsonSchema = valueRestResponseSchema.getSchema
val transactionDF = df.select(
col("key").cast("string"), // cast to string from binary value
from_avro(col("value"), jsonSchema).as("transaction"), // convert from avro value
col("topic"),
col("offset"),
col("timestamp"),
col("timestampType"))
transactionDF.printSchema()
// Stream data to console for testing
transactionDF.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
}
}
When reading from kafka topic, we have this kind of schema:
key: binary | value: binary | topic: string | partition: integer | offset: long | timestamp: timestamp | timestampType: integer |
As we can see, key and value are binary so we need to cast key as string and in this case, value is avro formatted so we can achieve this by calling from_avro function.
In adition to Spark and Kafka dependencies, we need this dependencies:
<!-- READ AND WRITE AVRO DATA -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- INTEGRATION WITH SCHEMA REGISTRY -->
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry</artifactId>
<version>${confluent.version}</version>
</dependency>
This library will do the job for you. It connects to Confluent Schema Registry through Spark Structured Stream.
For Confluent, it copes with the schema id that is sent along with the payload.
In the README you will find a code snippet of how to do it.
DISCLOSURE: I work for ABSA and I developed this library.
Another very simple alternative for pyspark (without full support for schema registry like schema registration, compatibility check, etc.) could be:
import requests
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.avro.functions import *
# variables
topic = "my-topic"
schemaregistry = "http://localhost:8081"
kafka_brokers = "kafka1:9092,kafka2:9092"
# retrieve the latest schema
response = requests.get('{}/subjects/{}-value/versions/latest/schema'.format(schemaregistry, topic))
# error check
response.raise_for_status()
# extract the schema from the response
schema = response.text
# run the query
query = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe", topic) \
.load() \
# The magic goes here:
# Skip the first 5 bytes (reserved by schema registry encoding protocol)
.selectExpr("substring(value, 6) as avro_value") \
.select(from_avro(col("avro_value"), schema).alias("data")) \
.select(col("data.my_field")) \
.writeStream \
.format("console") \
.outputMode("complete") \
.start()
Databricks now provide this functionality but you have to pay for it :-(
dataDF
.select(
to_avro($"key", lit("t-key"), schemaRegistryAddr).as("key"),
to_avro($"value", lit("t-value"), schemaRegistryAddr).as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("topic", "t")
.save()
See:
https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html for more info
A good free alternative is ABRIS. See: https://github.com/AbsaOSS/ABRiS the only downside we can see that you need to provide a file of your avro schema at runtime so the framework can enforce this schema on your dataframe before it publishes it to the Kafka topic.
Based on #cricket_007's answers I created the following solution which could run in our cluster environment, including the following new features:
You need add broadcast variables to transfer some values into map operations for cluster environment. Neither Schema.Parser nor KafkaAvroDeserializer could be serialized in spark, so it is why you need initialize them in map operations
My structured streaming used foreachBatch output sink.
I applied org.apache.spark.sql.avro.SchemaConverters to convert avro schema format to spark StructType, so that you could use it in from_json column function to parse dataframe in Kafka topic fields (key and value).
Firstly, you need load some packages:
SCALA_VERSION="2.11"
SPARK_VERSION="2.4.4"
CONFLUENT_VERSION="5.2.2"
jars=(
"org.apache.spark:spark-sql-kafka-0-10_${SCALA_VERSION}:${SPARK_VERSION}" ## format("kafka")
"org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION}" ## SchemaConverters
"io.confluent:kafka-schema-registry:${CONFLUENT_VERSION}" ## import io.confluent.kafka.schemaregistry.client.rest.RestService
"io.confluent:kafka-avro-serializer:${CONFLUENT_VERSION}" ## import io.confluent.kafka.serializers.KafkaAvroDeserializer
)
./bin/spark-shell --packages ${"${jars[*]}"// /,}
Here are the whole codes I tested in spark-shell:
import org.apache.avro.Schema
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import io.confluent.kafka.schemaregistry.client.rest.RestService
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.avro.SchemaConverters
import scala.collection.JavaConverters._
import java.time.LocalDateTime
spark.sparkContext.setLogLevel("Error")
val brokerServers = "xxx.yyy.zzz:9092"
val topicName = "mytopic"
val schemaRegistryURL = "http://xxx.yyy.zzz:8081"
val restService = new RestService(schemaRegistryURL)
val exParser = new Schema.Parser
//-- For both key and value
val schemaNames = Seq("key", "value")
val schemaStrings = schemaNames.map(i => (i -> restService.getLatestVersion(s"$topicName-$i").getSchema)).toMap
val tempStructMap = schemaStrings.transform((k,v) => SchemaConverters.toSqlType(exParser.parse(v)).dataType)
val schemaStruct = new StructType().add("key", tempStructMap("key")).add("value", tempStructMap("value"))
//-- For key only
// val schemaStrings = restService.getLatestVersion(s"$topicName-key").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
//-- For value only
// val schemaStrings = restService.getLatestVersion(s"$topicName-value").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
val query = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load()
.writeStream
.outputMode("append")
//.option("checkpointLocation", s"cos://$bucket.service/checkpoints/$tableName")
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val bcTopicName = sc.broadcast(topicName)
val bcSchemaRegistryURL = sc.broadcast(schemaRegistryURL)
val bcSchemaStrings = sc.broadcast(schemaStrings)
val rstDF = batchDF.map {
row =>
val props = Map("schema.registry.url" -> bcSchemaRegistryURL.value)
//-- For both key and value
val isKeys = Map("key" -> true, "value" -> false)
val deserializers = isKeys.transform{ (k,v) =>
val des = new KafkaAvroDeserializer
des.configure(props.asJava, v)
des
}
//-- For key only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, true)
//-- For value only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, false)
val inParser = new Schema.Parser
//-- For both key and value
val values = bcSchemaStrings.value.transform( (k,v) =>
deserializers(k).deserialize(bcTopicName.value, row.getAs[Array[Byte]](k), inParser.parse(v)).toString)
s"""{"key": ${values("key")}, "value": ${values("value")} }"""
//-- For key only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("key"), inParser.parse(bcSchemaStrings.value)).toString
//-- For value only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("value"), inParser.parse(bcSchemaStrings.value)).toString
}
.select(from_json(col("value"), schemaStruct).as("root"))
.select("root.*")
println(s"${LocalDateTime.now} --- Batch $batchId: ${rstDF.count} rows")
rstDF.printSchema
rstDF.show(false)
})
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
query.awaitTermination()
Summarizing some of answer above and adding some of my own experience, those are the options at the time of writing:
3rd party Abris library. This is what we used initially, but it doesn't seem to support a permissive mode where you can drop malformed packages. It will crash the stream when it encounters a malformed message. If you can guarantee the message validity, that is okay, but it was an issue for us as it kept trying to parse the malformed message after stream restart.
Custom UDF which parses the Avro data, as outlined by OneCricketeer's answer. Gives the most flexibility but also requires the most custom code.
Using Databrick's from_avro variant which allows you to simply pass the URL, and it will find the right schema and parse it for you. Works really well, but only available in their environment, thus hard to test in a codebase.
Using Spark's built-in from_avro function. This functions allows you to pass a JSON schema and parse it from there. Only fix that you have to apply is that in Confluent's Wire format, there is one magic byte and 4 schema bytes before the actual Avro binary data starts as also pointed out in dudssource's answer. You can parse it like this in Scala:
val restService = new RestService(espConfig.schemaRegistryUrl)
val valueRestResponseSchema = restService.getVersion(espConfig.fullTopicName + "-value", schemaVersion)
valueRestResponseSchema.getSchema
streamDf
.withColumn("binary_data", substring(6, Int.MaxValue))
.withColumn("parsed_data", from_avr('binary_data, jsonSchema, Map("MODE" -> "PERMISSIVE")))
For anyone that want's to do this in pyspark: The library that felipe referenced worked nicely on the JVM for me before, so i wrote a small wrapper function that integrates it in python. This looks very hacky, because a lot of types that are implicit in the scala language have to be specified explicitly in py4j. Has been working nicely so far, though, even in spark 2.4.1.
def expand_avro(spark_context, sql_context, data_frame, schema_registry_url, topic):
j = spark_context._gateway.jvm
dataframe_deserializer = j.za.co.absa.abris.avro.AvroSerDe.DataframeDeserializer(data_frame._jdf)
naming_strategy = getattr(
getattr(j.za.co.absa.abris.avro.read.confluent.SchemaManager,
"SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
conf = getattr(getattr(j.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.url", schema_registry_url))
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.topic", topic))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.id", "latest"))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.naming.strategy", naming_strategy))
schema_path = j.scala.Option.apply(None)
conf = j.scala.Option.apply(conf)
policy = getattr(j.za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies, "RETAIN_SELECTED_COLUMN_ONLY$")()
data_frame = dataframe_deserializer.fromConfluentAvro("value", schema_path, conf, policy)
data_frame = DataFrame(data_frame, sql_context)
return data_frame
For that to work, you have to add the library to the spark packages, e.g.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' \
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.1,' \
'org.apache.spark:spark-avro_2.11:2.4.1,' \
'za.co.absa:abris_2.11:2.2.2 ' \
'--repositories https://packages.confluent.io/maven/ ' \
'pyspark-shell'
#RvdV Great summary. I was trying Abris library and consuming CDC record generated by Debezium.
val abrisConfig: FromAvroConfig = (AbrisConfig
.fromConfluentAvro
.downloadReaderSchemaByLatestVersion
.andTopicNameStrategy(topicName)
.usingSchemaRegistry(schemaRegistryURL))
val df=(spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load())
val deserializedAvro = (df
.select(from_avro(col("value"), abrisConfig)
.as("data"))
.select(col("data.after.*")))
deserializedAvro.printSchema()
val query = (deserializedAvro
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", s"s3://$bucketName/checkpoints/$tableName")
.trigger(Trigger.ProcessingTime("60 seconds"))
.start())
I added column while the streaming job is running. I was expecting it to print the new col that I added. It did not. Does it not dynamically refresh the schema from the version information in the payload ?
Using Spark 2.4.0
Confluent schema-Registry to receive schema
The message Key is serialized in String and Value in Avro, thus I am trying to de-serialize just the Value using io.confluent.kafka.serializers.KafkaAvroDeserializer, but it isn't working. Can anyone review my code to see whats wrong
libraries imported:
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.generic.GenericRecord
import org.apache.kafka.common.serialization.Deserializer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{ Encoder, SparkSession}
Code Body
val topics = "test_topic"
val spark: SparkSession = SparkSession.builder
.config("spark.streaming.stopGracefullyOnShutdown", "true")
.config("spark.streaming.backpressure.enabled", "true")
.config("spark.streaming.kafka.maxRatePerPartition", 2170)
.config("spark.streaming.kafka.maxRetries", 1)
.config("spark.streaming.kafka.consumer.poll.ms", "600000")
.appName("SparkStructuredStreamAvro")
.config("spark.sql.streaming.checkpointLocation", "/tmp/new_checkpoint/")
.enableHiveSupport()
.getOrCreate
//add settings for schema registry url, used to get deser
val schemaRegUrl = "http://xx.xx.xx.xxx:xxxx"
val client = new CachedSchemaRegistryClient(schemaRegUrl, 100)
//subscribe to kafka
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx.xx.xxxx")
.option("subscribe", "test.topic")
.option("kafka.startingOffsets", "latest")
.option("group.id", "use_a_separate_group_id_for_each_stream")
.load()
//add confluent kafka avro deserializer, needed to read messages appropriately
val deser = new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]]
//needed to convert column select into Array[Bytes]
import spark.implicits._
val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
//read the raw bytes from spark and then use the confluent deserializer to get the record back
val decoded = deser.deserialize(topics, rawBytes)
val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
recordId
}
results.writeStream
.outputMode("append")
.format("text")
.option("path", "/tmp/path_new/")
.option("truncate", "false")
.start()
.awaitTermination()
spark.stop()
It fails to deserialize, and Error Received is
Caused by: java.io.NotSerializableException: io.confluent.kafka.serializers.KafkaAvroDeserializer
Serialization stack:
- object not serializable (class: io.confluent.kafka.serializers.KafkaAvroDeserializer, value: io.confluent.kafka.serializers.KafkaAvroDeserializer#591024db)
- field (class: ca.bell.wireless.ingest$$anonfun$1, name: deser$1, type: interface org.apache.kafka.common.serialization.Deserializer)
- object (class ca.bell.wireless.ingest$$anonfun$1, <function1>)
- element of array (index: 1)
It works perfectly fine when I write a normal kafka consumer (not through spark) using
props.put("key.deserializer", classOf[StringDeserializer])
props.put("value.deserializer", classOf[KafkaAvroDeserializer])
You defined the variable('deser') for KafkaAvroDeserializer outside the map block.
it makes that exception.
Try to change the code like this:
val brdDeser = spark.sparkContext.broadcast(new KafkaAvroDeserializer(client).asInstanceOf[Deserializer[GenericRecord]])
val results = df.select(col("value").as[Array[Byte]]).map { rawBytes: Array[Byte] =>
val deser = brdDeser.value
val decoded = deser.deserialize(topics, rawBytes)
val recordId = decoded.get("nameId").asInstanceOf[org.apache.avro.util.Utf8].toString
recordId
}
I'm using a Kafka Source in Spark Structured Streaming to receive Confluent encoded Avro records. I intend to use Confluent Schema Registry, but the integration with spark structured streaming seems to be impossible.
I have seen this question, but unable to get it working with the Confluent Schema Registry. Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)
It took me a couple months of reading source code and testing things out. In a nutshell, Spark can only handle String and Binary serialization. You must manually deserialize the data. In spark, create the confluent rest service object to get the schema. Convert the schema string in the response object into an Avro schema using the Avro parser. Next, read the Kafka topic as normal. Then map over the binary typed "value" column with the Confluent KafkaAvroDeSerializer. I strongly suggest getting into the source code for these classes because there is a lot going on here, so for brevity I'll leave out many details.
//Used Confluent version 3.2.2 to write this.
import io.confluent.kafka.schemaregistry.client.rest.RestService
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import org.apache.avro.Schema
case class DeserializedFromKafkaRecord(key: String, value: String)
val schemaRegistryURL = "http://127.0.0.1:8081"
val topicName = "Schema-Registry-Example-topic1"
val subjectValueName = topicName + "-value"
//create RestService object
val restService = new RestService(schemaRegistryURL)
//.getLatestVersion returns io.confluent.kafka.schemaregistry.client.rest.entities.Schema object.
val valueRestResponseSchema = restService.getLatestVersion(subjectValueName)
//Use Avro parsing classes to get Avro Schema
val parser = new Schema.Parser
val topicValueAvroSchema: Schema = parser.parse(valueRestResponseSchema.getSchema)
//key schema is typically just string but you can do the same process for the key as the value
val keySchemaString = "\"string\""
val keySchema = parser.parse(keySchemaString)
//Create a map with the Schema registry url.
//This is the only Required configuration for Confluent's KafkaAvroDeserializer.
val props = Map("schema.registry.url" -> schemaRegistryURL)
//Declare SerDe vars before using Spark structured streaming map. Avoids non serializable class exception.
var keyDeserializer: KafkaAvroDeserializer = null
var valueDeserializer: KafkaAvroDeserializer = null
//Create structured streaming DF to read from the topic.
val rawTopicMessageDF = sql.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "127.0.0.1:9092")
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 20) //remove for prod
.load()
//instantiate the SerDe classes if not already, then deserialize!
val deserializedTopicMessageDS = rawTopicMessageDF.map{
row =>
if (keyDeserializer == null) {
keyDeserializer = new KafkaAvroDeserializer
keyDeserializer.configure(props.asJava, true) //isKey = true
}
if (valueDeserializer == null) {
valueDeserializer = new KafkaAvroDeserializer
valueDeserializer.configure(props.asJava, false) //isKey = false
}
//Pass the Avro schema.
val deserializedKeyString = keyDeserializer.deserialize(topicName, row.key, keySchema).toString //topic name is actually unused in the source code, just required by the signature. Weird right?
val deserializedValueString = valueDeserializer.deserialize(topicName, row.value, topicValueAvroSchema).toString
DeserializedFromKafkaRecord(deserializedKeyString, deserializedValueString)
}
val deserializedDSOutputStream = deserializedTopicMessageDS.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
Disclaimer
This code was only tested on a local master, and has been reported runs into serializer issues in a clustered environment. There's an alternative solution (step 7-9, with Scala code in step 10) that extracts out the schema ids to columns, looks up each unique ID, and then uses schema broadcast variables, which will work better, at scale.
Also, there is an external library AbsaOSS/ABRiS that also addresses using the Registry with Spark
Since the other answer that was mostly useful was removed, I wanted to re-add it with some refactoring and comments.
Here are the dependencies needed. Code tested with Confluent 5.x and Spark 2.4
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-avro-serializer</artifactId>
<version>${confluent.version}</version>
<exclusions>
<!-- Conflicts with Spark's version -->
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
And here is the Scala implementation (only tested locally on master=local[*])
First section, define the imports, some fields, and a few helper methods to get schemas
import io.confluent.kafka.schemaregistry.client.{CachedSchemaRegistryClient, SchemaRegistryClient}
import io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.commons.cli.CommandLine
import org.apache.spark.sql._
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.streaming.OutputMode
object App {
private var schemaRegistryClient: SchemaRegistryClient = _
private var kafkaAvroDeserializer: AvroDeserializer = _
def lookupTopicSchema(topic: String, isKey: Boolean = false) = {
schemaRegistryClient.getLatestSchemaMetadata(topic + (if (isKey) "-key" else "-value")).getSchema
}
def avroSchemaToSparkSchema(avroSchema: String) = {
SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
}
// ... continues below
Then define a simple main method that parses the CMD args to get Kafka details
def main(args: Array[String]): Unit = {
val cmd: CommandLine = parseArg(args)
val master = cmd.getOptionValue("master", "local[*]")
val spark = SparkSession.builder()
.appName(App.getClass.getName)
.master(master)
.getOrCreate()
val bootstrapServers = cmd.getOptionValue("bootstrap-server")
val topic = cmd.getOptionValue("topic")
val schemaRegistryUrl = cmd.getOptionValue("schema-registry")
consumeAvro(spark, bootstrapServers, topic, schemaRegistryUrl)
spark.stop()
}
// ... still continues
Then, the important method that consumes the Kafka topic and deserializes it
private def consumeAvro(spark: SparkSession, bootstrapServers: String, topic: String, schemaRegistryUrl: String): Unit = {
import spark.implicits._
// Setup the Avro deserialization UDF
schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
kafkaAvroDeserializer.deserialize(bytes)
)
// Load the raw Kafka topic (byte stream)
val rawDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
// Deserialize byte stream into strings (Avro fields become JSON)
import org.apache.spark.sql.functions._
val jsonDf = rawDf.select(
// 'key.cast(DataTypes.StringType), // string keys are simplest to use
callUDF("deserialize", 'key).as("key"), // but sometimes they are avro
callUDF("deserialize", 'value).as("value")
// excluding topic, partition, offset, timestamp, etc
)
// Get the Avro schema for the topic from the Schema Registry and convert it into a Spark schema type
val dfValueSchema = {
val rawSchema = lookupTopicSchema(topic)
avroSchemaToSparkSchema(rawSchema)
}
// Apply structured schema to JSON stream
val parsedDf = jsonDf.select(
'key, // keys are usually plain strings
// values are JSONified Avro records
from_json('value, dfValueSchema.dataType).alias("value")
).select(
'key,
$"value.*" // flatten out the value
)
// parsedDf.printSchema()
// Sample schema output
// root
// |-- key: string (nullable = true)
// |-- header: struct (nullable = true) // Not a Kafka record "header". This is part of our value schema
// | |-- time: long (nullable = true)
// | ...
// TODO: Do something interesting with this stream
parsedDf.writeStream
.format("console")
.outputMode(OutputMode.Append())
.option("truncate", false)
.start()
.awaitTermination()
}
// still continues
The command line parser allows for passing in bootstrap servers, schema registry, topic name, and Spark master.
private def parseArg(args: Array[String]): CommandLine = {
import org.apache.commons.cli._
val options = new Options
val masterOption = new Option("m", "master", true, "Spark master")
masterOption.setRequired(false)
options.addOption(masterOption)
val bootstrapOption = new Option("b", "bootstrap-server", true, "Bootstrap servers")
bootstrapOption.setRequired(true)
options.addOption(bootstrapOption)
val topicOption = new Option("t", "topic", true, "Kafka topic")
topicOption.setRequired(true)
options.addOption(topicOption)
val schemaRegOption = new Option("s", "schema-registry", true, "Schema Registry URL")
schemaRegOption.setRequired(true)
options.addOption(schemaRegOption)
val parser = new BasicParser
parser.parse(options, args)
}
// still continues
In order for the UDF above to work, then there needed to be a deserializer to take the DataFrame of bytes to one containing deserialized Avro
// Simple wrapper around Confluent deserializer
class AvroDeserializer extends AbstractKafkaAvroDeserializer {
def this(client: SchemaRegistryClient) {
this()
// TODO: configure the deserializer for authentication
this.schemaRegistry = client
}
override def deserialize(bytes: Array[Byte]): String = {
val value = super.deserialize(bytes)
value match {
case str: String =>
str
case _ =>
val genericRecord = value.asInstanceOf[GenericRecord]
genericRecord.toString
}
}
}
} // end 'object App'
Put each of these blocks together, and it works in IntelliJ after adding -b localhost:9092 -s http://localhost:8081 -t myTopic to Run Configurations > Program Arguments
This is an example of my code integrating spark structured streaming with kafka and schema registry (code in scala)
import org.apache.spark.sql.SparkSession
import io.confluent.kafka.schemaregistry.client.rest.RestService // <artifactId>kafka-schema-registry</artifactId>
import org.apache.spark.sql.avro.from_avro // <artifactId>spark-avro_${scala.compat.version}</artifactId>
import org.apache.spark.sql.functions.col
object KafkaConsumerAvro {
def main(args: Array[String]): Unit = {
val KAFKA_BOOTSTRAP_SERVERS = "localhost:9092"
val SCHEMA_REGISTRY_URL = "http://localhost:8081"
val TOPIC = "transactions"
val spark: SparkSession = SparkSession.builder().appName("KafkaConsumerAvro").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest") // from starting
.load()
// Prints Kafka schema with columns (topic, offset, partition e.t.c)
df.printSchema()
// Create REST service to access schema registry and retrieve topic schema (latest)
val restService = new RestService(SCHEMA_REGISTRY_URL)
val valueRestResponseSchema = restService.getLatestVersion(TOPIC + "-value")
val jsonSchema = valueRestResponseSchema.getSchema
val transactionDF = df.select(
col("key").cast("string"), // cast to string from binary value
from_avro(col("value"), jsonSchema).as("transaction"), // convert from avro value
col("topic"),
col("offset"),
col("timestamp"),
col("timestampType"))
transactionDF.printSchema()
// Stream data to console for testing
transactionDF.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
}
}
When reading from kafka topic, we have this kind of schema:
key: binary | value: binary | topic: string | partition: integer | offset: long | timestamp: timestamp | timestampType: integer |
As we can see, key and value are binary so we need to cast key as string and in this case, value is avro formatted so we can achieve this by calling from_avro function.
In adition to Spark and Kafka dependencies, we need this dependencies:
<!-- READ AND WRITE AVRO DATA -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- INTEGRATION WITH SCHEMA REGISTRY -->
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-schema-registry</artifactId>
<version>${confluent.version}</version>
</dependency>
This library will do the job for you. It connects to Confluent Schema Registry through Spark Structured Stream.
For Confluent, it copes with the schema id that is sent along with the payload.
In the README you will find a code snippet of how to do it.
DISCLOSURE: I work for ABSA and I developed this library.
Another very simple alternative for pyspark (without full support for schema registry like schema registration, compatibility check, etc.) could be:
import requests
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.avro.functions import *
# variables
topic = "my-topic"
schemaregistry = "http://localhost:8081"
kafka_brokers = "kafka1:9092,kafka2:9092"
# retrieve the latest schema
response = requests.get('{}/subjects/{}-value/versions/latest/schema'.format(schemaregistry, topic))
# error check
response.raise_for_status()
# extract the schema from the response
schema = response.text
# run the query
query = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe", topic) \
.load() \
# The magic goes here:
# Skip the first 5 bytes (reserved by schema registry encoding protocol)
.selectExpr("substring(value, 6) as avro_value") \
.select(from_avro(col("avro_value"), schema).alias("data")) \
.select(col("data.my_field")) \
.writeStream \
.format("console") \
.outputMode("complete") \
.start()
Databricks now provide this functionality but you have to pay for it :-(
dataDF
.select(
to_avro($"key", lit("t-key"), schemaRegistryAddr).as("key"),
to_avro($"value", lit("t-value"), schemaRegistryAddr).as("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", servers)
.option("topic", "t")
.save()
See:
https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html for more info
A good free alternative is ABRIS. See: https://github.com/AbsaOSS/ABRiS the only downside we can see that you need to provide a file of your avro schema at runtime so the framework can enforce this schema on your dataframe before it publishes it to the Kafka topic.
Based on #cricket_007's answers I created the following solution which could run in our cluster environment, including the following new features:
You need add broadcast variables to transfer some values into map operations for cluster environment. Neither Schema.Parser nor KafkaAvroDeserializer could be serialized in spark, so it is why you need initialize them in map operations
My structured streaming used foreachBatch output sink.
I applied org.apache.spark.sql.avro.SchemaConverters to convert avro schema format to spark StructType, so that you could use it in from_json column function to parse dataframe in Kafka topic fields (key and value).
Firstly, you need load some packages:
SCALA_VERSION="2.11"
SPARK_VERSION="2.4.4"
CONFLUENT_VERSION="5.2.2"
jars=(
"org.apache.spark:spark-sql-kafka-0-10_${SCALA_VERSION}:${SPARK_VERSION}" ## format("kafka")
"org.apache.spark:spark-avro_${SCALA_VERSION}:${SPARK_VERSION}" ## SchemaConverters
"io.confluent:kafka-schema-registry:${CONFLUENT_VERSION}" ## import io.confluent.kafka.schemaregistry.client.rest.RestService
"io.confluent:kafka-avro-serializer:${CONFLUENT_VERSION}" ## import io.confluent.kafka.serializers.KafkaAvroDeserializer
)
./bin/spark-shell --packages ${"${jars[*]}"// /,}
Here are the whole codes I tested in spark-shell:
import org.apache.avro.Schema
import io.confluent.kafka.serializers.KafkaAvroDeserializer
import io.confluent.kafka.schemaregistry.client.rest.RestService
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.avro.SchemaConverters
import scala.collection.JavaConverters._
import java.time.LocalDateTime
spark.sparkContext.setLogLevel("Error")
val brokerServers = "xxx.yyy.zzz:9092"
val topicName = "mytopic"
val schemaRegistryURL = "http://xxx.yyy.zzz:8081"
val restService = new RestService(schemaRegistryURL)
val exParser = new Schema.Parser
//-- For both key and value
val schemaNames = Seq("key", "value")
val schemaStrings = schemaNames.map(i => (i -> restService.getLatestVersion(s"$topicName-$i").getSchema)).toMap
val tempStructMap = schemaStrings.transform((k,v) => SchemaConverters.toSqlType(exParser.parse(v)).dataType)
val schemaStruct = new StructType().add("key", tempStructMap("key")).add("value", tempStructMap("value"))
//-- For key only
// val schemaStrings = restService.getLatestVersion(s"$topicName-key").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
//-- For value only
// val schemaStrings = restService.getLatestVersion(s"$topicName-value").getSchema
// val schemaStruct = SchemaConverters.toSqlType(exParser.parse(schemaStrings)).dataType
val query = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load()
.writeStream
.outputMode("append")
//.option("checkpointLocation", s"cos://$bucket.service/checkpoints/$tableName")
.foreachBatch((batchDF: DataFrame, batchId: Long) => {
val bcTopicName = sc.broadcast(topicName)
val bcSchemaRegistryURL = sc.broadcast(schemaRegistryURL)
val bcSchemaStrings = sc.broadcast(schemaStrings)
val rstDF = batchDF.map {
row =>
val props = Map("schema.registry.url" -> bcSchemaRegistryURL.value)
//-- For both key and value
val isKeys = Map("key" -> true, "value" -> false)
val deserializers = isKeys.transform{ (k,v) =>
val des = new KafkaAvroDeserializer
des.configure(props.asJava, v)
des
}
//-- For key only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, true)
//-- For value only
// val deserializer = new KafkaAvroDeserializer
// deserializer.configure(props.asJava, false)
val inParser = new Schema.Parser
//-- For both key and value
val values = bcSchemaStrings.value.transform( (k,v) =>
deserializers(k).deserialize(bcTopicName.value, row.getAs[Array[Byte]](k), inParser.parse(v)).toString)
s"""{"key": ${values("key")}, "value": ${values("value")} }"""
//-- For key only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("key"), inParser.parse(bcSchemaStrings.value)).toString
//-- For value only
// deserializer.deserialize(bcTopicName.value, row.getAs[Array[Byte]]("value"), inParser.parse(bcSchemaStrings.value)).toString
}
.select(from_json(col("value"), schemaStruct).as("root"))
.select("root.*")
println(s"${LocalDateTime.now} --- Batch $batchId: ${rstDF.count} rows")
rstDF.printSchema
rstDF.show(false)
})
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
query.awaitTermination()
Summarizing some of answer above and adding some of my own experience, those are the options at the time of writing:
3rd party Abris library. This is what we used initially, but it doesn't seem to support a permissive mode where you can drop malformed packages. It will crash the stream when it encounters a malformed message. If you can guarantee the message validity, that is okay, but it was an issue for us as it kept trying to parse the malformed message after stream restart.
Custom UDF which parses the Avro data, as outlined by OneCricketeer's answer. Gives the most flexibility but also requires the most custom code.
Using Databrick's from_avro variant which allows you to simply pass the URL, and it will find the right schema and parse it for you. Works really well, but only available in their environment, thus hard to test in a codebase.
Using Spark's built-in from_avro function. This functions allows you to pass a JSON schema and parse it from there. Only fix that you have to apply is that in Confluent's Wire format, there is one magic byte and 4 schema bytes before the actual Avro binary data starts as also pointed out in dudssource's answer. You can parse it like this in Scala:
val restService = new RestService(espConfig.schemaRegistryUrl)
val valueRestResponseSchema = restService.getVersion(espConfig.fullTopicName + "-value", schemaVersion)
valueRestResponseSchema.getSchema
streamDf
.withColumn("binary_data", substring(6, Int.MaxValue))
.withColumn("parsed_data", from_avr('binary_data, jsonSchema, Map("MODE" -> "PERMISSIVE")))
For anyone that want's to do this in pyspark: The library that felipe referenced worked nicely on the JVM for me before, so i wrote a small wrapper function that integrates it in python. This looks very hacky, because a lot of types that are implicit in the scala language have to be specified explicitly in py4j. Has been working nicely so far, though, even in spark 2.4.1.
def expand_avro(spark_context, sql_context, data_frame, schema_registry_url, topic):
j = spark_context._gateway.jvm
dataframe_deserializer = j.za.co.absa.abris.avro.AvroSerDe.DataframeDeserializer(data_frame._jdf)
naming_strategy = getattr(
getattr(j.za.co.absa.abris.avro.read.confluent.SchemaManager,
"SchemaStorageNamingStrategies$"), "MODULE$").TOPIC_NAME()
conf = getattr(getattr(j.scala.collection.immutable.Map, "EmptyMap$"), "MODULE$")
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.url", schema_registry_url))
conf = getattr(conf, "$plus")(j.scala.Tuple2("schema.registry.topic", topic))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.id", "latest"))
conf = getattr(conf, "$plus")(j.scala.Tuple2("value.schema.naming.strategy", naming_strategy))
schema_path = j.scala.Option.apply(None)
conf = j.scala.Option.apply(conf)
policy = getattr(j.za.co.absa.abris.avro.schemas.policy.SchemaRetentionPolicies, "RETAIN_SELECTED_COLUMN_ONLY$")()
data_frame = dataframe_deserializer.fromConfluentAvro("value", schema_path, conf, policy)
data_frame = DataFrame(data_frame, sql_context)
return data_frame
For that to work, you have to add the library to the spark packages, e.g.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ' \
'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.1,' \
'org.apache.spark:spark-avro_2.11:2.4.1,' \
'za.co.absa:abris_2.11:2.2.2 ' \
'--repositories https://packages.confluent.io/maven/ ' \
'pyspark-shell'
#RvdV Great summary. I was trying Abris library and consuming CDC record generated by Debezium.
val abrisConfig: FromAvroConfig = (AbrisConfig
.fromConfluentAvro
.downloadReaderSchemaByLatestVersion
.andTopicNameStrategy(topicName)
.usingSchemaRegistry(schemaRegistryURL))
val df=(spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokerServers)
.option("subscribe", topicName)
.load())
val deserializedAvro = (df
.select(from_avro(col("value"), abrisConfig)
.as("data"))
.select(col("data.after.*")))
deserializedAvro.printSchema()
val query = (deserializedAvro
.writeStream
.format("console")
.outputMode("append")
.option("checkpointLocation", s"s3://$bucketName/checkpoints/$tableName")
.trigger(Trigger.ProcessingTime("60 seconds"))
.start())
I added column while the streaming job is running. I was expecting it to print the new col that I added. It did not. Does it not dynamically refresh the schema from the version information in the payload ?
We are trying to write spark streaming application that will write to the hdfs. However whenever we are writing the files lots of duplicates shows up. This behavior is with or without we crashing application using the kill. And also for both Dstream and Structured apis. The source is kafka topic. The behavior of checkpoint directory sounds very random. I have not come across very relevant information on the issue.
Question is: Can checkpoint directory provide exactly once behavior?
scala version: 2.11.8
spark version: 2.3.1.3.0.1.0-187
kafka version : 2.11-1.1.0
zookeeper version : 3.4.8-1
HDP : 3.1
Any help is appreciate.
Thanks,
Gautam
object sparkStructuredDownloading {
val kafka_brokers="kfk01.*.com:9092,kfk02.*.com:9092,kfk03.*.com:9092"
def main(args: Array[String]): Unit = {
var topic = args(0).trim().toString()
new downloadingAnalysis(kafka_brokers ,topic).process()
}
}
class downloadingAnalysis(brokers: String,topic: String) {
def process(): Unit = {
// try{
val spark = SparkSession.builder()
.appName("sparkStructuredDownloading")
// .appName("kafka_duplicate")
.getOrCreate()
spark.conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
println("Application Started")
import spark.implicits._
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", topic)
.option("startingOffsets", "latest")
//.option("kafka.group.id", "testduplicate")
.load()
val personJsonDf = inputDf.selectExpr("CAST(value AS STRING)") //Converting binary to text
println("READ STREAM INITIATED")
import scala.concurrent.duration._
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.streaming.Trigger
import spark.implicits._
val filteredDF= personJsonDf.filter(line=> new ParseLogs().validateLogLine(line.get(0).toString()))
spark.sqlContext.udf.register("parseLogLine", (logLine: String) => {
val df1 = filteredDF.selectExpr("parseLogLine(value) as result")
println(df1.schema)
println("WRITE STREAM INITIATED")
val checkpoint_loc="/warehouse/test_duplicate/download/checkpoint1"
val kafkaOutput = result.writeStream
.outputMode("append")
.format("orc")
.option("path", "/warehouse/test_duplicate/download/data1")
.option("maxRecordsPerFile", 10)
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
.awaitTermination()
}