Spark structured streaming how to write to Kafka in Protobuf format - apache-spark

Spark: 3.0.0
Scala: 2.12
confluent
I am having spark structured streaming job and looking for an example for writing data frames to Kafka in Protbuf format.
I read messages from PostgreSQL and after doing all the transformations have a data frame with Key and Value:
root
|-- key: string (nullable = true)
|-- value: binary (nullable = false)
Code to push message to kafka:
val kafkaOptions = Seq(
KAFKA_BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringSerializer",
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG -> "io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer",
"schema.registry.url" -> "http://localhost:8081",
"topic" -> "test_users"
)
tDF
.write
.format(KAFKA)
.options(kafkaOptions.toMap)
.save()
Message in binary format is posted, but I am not able to deserialized as there no schema in confluent
Is there a lib that can simply things for me? Or a sample code that I can refer.

Related

kafka custom partitioner with pyspark structured streaming

I want to use kafka custom partitioner for my pyspark application reading from kafka pushing to another kafka topic. Data is transformed from source to sink with pyspark processing. I want to have control over partition to which data should be pushed based on some key in data/message. In spark structured streaming documentation I couldn't find any reference or example for such use case. I am using python processing along with pyspark and confluent-kafka-python is used as kafka client there, but it also lacks documentation/example for custom partitioner.
is solution available to achieve this?
Below spark code is tried with partition column and is not pushing data as per partition column.
df = spark.range(5)
df = (df
.withColumn("topic", F.lit("test_temp"))
.withColumn("partition", (F.col("id")%2).cast("int"))
.withColumn("key", F.lit("test"))
.withColumn("value", F.lit("test_data"))
).select(["topic", "key", "value", "partition"])
df.printSchema()
(df.write.format("kafka").partitionBy("partition")
.option("kafka.bootstrap.servers", kafka_endpoint)
#.option("topic", "test_temp")
.save())
Output:
+---------+----+---------+---------+
| topic| key| value|partition|
+---------+----+---------+---------+
|test_temp|test|test_data| 0|
|test_temp|test|test_data| 1|
|test_temp|test|test_data| 0|
|test_temp|test|test_data| 1|
|test_temp|test|test_data| 0|
+---------+----+---------+---------+
root
|-- topic: string (nullable = false)
|-- key: string (nullable = false)
|-- value: string (nullable = false)
|-- partition: integer (nullable = true)
Kafka console consumer output:
./kafka-console-consumer --bootstrap-server <broker>:9092 --topic test_temp --partition 1
As written in the Structured Streaming documentation, you'd either add a column to the dataframe being written called partition of an int type, and this controls which partition that Spark will write to. Or you can add a JVM paritioner to the classpath.
If a “partition” column is not specified (or its value is null) then the partition is calculated by the Kafka producer. A Kafka partitioner can be specified in Spark by setting the kafka.partitioner.class option. If not present, Kafka default partitioner will be used.
Regarding the Confluent Python client - https://github.com/confluentinc/confluent-kafka-python/issues/1107
But kafka-python module does support custom partitioners

Spark Structured streaming - reading timestamp from file using schema

I am working on a Structured Streaming job.
The data I am reading from files contains the timestamp (in millis), deviceId and a value reported by that device.
Multiple devices report data.
I am trying to write a job that aggregates (sums) values sent by all devices into tumbling windows of 1 minute.
The issue that I am having is with timestamp.
When I am trying to parse "timestamp" into Long, window function complains that it expects "timestamp type".
When I am trying to parse into TimestampType as in the snippet below I am getting .MatchError exception (the full exception can be seen below) and I am struggling to figure out why and what is the correct way to handle it
// Create schema
StructType readSchema = new StructType().add("value" , "integer")
.add("deviceId", "long")
.add("timestamp", new TimestampType());
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
Dataset<Row> aggregations = inputDataFrame.groupBy(window(inputDataFrame.col("timestamp"), "1 minutes"),
inputDataFrame.col("deviceId"))
.agg(sum("value"));
The exception:
org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
scala.MatchError: org.apache.spark.sql.types.TimestampType#3eeac696 (of class org.apache.spark.sql.types.TimestampType)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1692)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:92)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:89)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:232)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:242)
at org.apache.spark.sql.streaming.DataStreamReader.parquet(DataStreamReader.scala:450)
Typically, when your timestamp is stored in milis as a long you would convert it into a timestamp type as shown below:
// Create schema and keep column 'timestamp' as long
StructType readSchema = new StructType()
.add("value", "integer")
.add("deviceId", "long")
.add("timestamp", "long");
// Read data from file
Dataset<Row> inputDataFrame = sparkSession.readStream()
.schema(readSchema)
.parquet(path);
// convert timestamp column into a proper timestamp type
Dataset<Row> df1 = inputDataFrame.withColumn("new_timestamp", expr("timestamp/1000").cast(DataTypes.TimestampType));
df1.show(false)
+-----+--------+-------------+-----------------------+
|value|deviceId|timestamp |new_timestamp |
+-----+--------+-------------+-----------------------+
|1 |1337 |1618836775397|2021-04-19 14:52:55.397|
+-----+--------+-------------+-----------------------+
df1.printSchema();
root
|-- value: integer (nullable = true)
|-- deviceId: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- new_timestamp: timestamp (nullable = true)

Does Spark 2.2.0 support Streaming Self-Joins?

I understand JOINS of two different dataframes are not supported in Spark 2.2.0 but I am trying to do self-join so only one stream. Below is my code
val jdf = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "join_test")
.option("startingOffsets", "earliest")
.load();
jdf.printSchema
which print the following
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Now I run the join query below after reading through this SO post
jdf.as("jdf1").join(jdf.as("jdf2"), $"jdf1.key" === $"jdf2.key")
And I get the following Exception
org.apache.spark.sql.AnalysisException: cannot resolve '`jdf1.key`' given input columns: [timestamp, value, partition, timestampType, topic, offset, key];;
'Join Inner, ('jdf1.key = 'jdf2.key)
:- SubqueryAlias jdf1
: +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession#f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
+- SubqueryAlias jdf2
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession#f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
I think it will not create any difference if we try to join same streaming data frame or different dataframe.So, It will not be supported.
There are two ways to achieve it.
First, you can join static and streaming dataframe. So, read once as batch data and next as streaming df.
The second solution, you can use Kafka streams. It provides joining of streaming data.

How to read streaming data in XML format from Kafka?

I am trying to read XML data from Kafka topic using Spark Structured streaming.
I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming?
My current code:
df = spark \
.readStream \
.format("kafka") \
.format('com.databricks.spark.xml') \
.options(rowTag="MainElement")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
The error:
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)
.format("kafka") \
.format('com.databricks.spark.xml') \
The last one with com.databricks.spark.xml wins and becomes the streaming source (hiding Kafka as the source).
In order words, the above is equivalent to .format('com.databricks.spark.xml') alone.
As you may have experienced, the Databricks spark-xml package does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.
Is there any way I can extract XML data from Kafka topic using structured streaming?
You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.
That should not be a big deal anyway. A Scala code could look as follows.
val input = spark.
readStream.
format("kafka").
...
load
val values = input.select('value cast "string")
val extractValuesFromXML = udf { (xml: String) => ??? }
val numbersFromXML = values.withColumn("number", extractValuesFromXML('value))
// print XMLs and numbers to the stdout
val q = numbersFromXML.
writeStream.
format("console").
start
Another possible solution could be to write your own custom streaming Source that would deal with the XML format in def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.
import xml.etree.ElementTree as ET
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
Then I wrote a python UDF
def parse(s):
xml = ET.fromstring(s)
ns = {'real_person': 'http://people.example.com',
'role': 'http://characters.example.com'}
actor_el = xml.find("DNmS:actor",ns)
if(actor_el ):
actor = actor_el.text
role_el.find('real_person:role', ns)
if(role_el):
role = role_el.text
return actor+"|"+role
Register this UDF
extractValuesFromXML = udf(parse)
XML_DF= df .withColumn("mergedCol",extractroot("value"))
AllCol_DF= xml_DF.withColumn("actorName", split(col("mergedCol"), "\\|").getItem(0))\
.withColumn("Role", split(col("mergedCol"), "\\|").getItem(1))
You cannot mix format this way. Kafka source is loaded as Row including number of values, like key, value and topic, with value column storing payload as a binary type:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
...
value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
Parsing this content is the user responsibility and cannot be delegated to other data sources. See for example my answer to How to read records in JSON format from Kafka using Structured Streaming?.
For XML you'll likely need an UDF (UserDefinedFunction), although you can try Hive XPath functions first. You should also decode binary data.
Looks like the above approach works but it is not using the passed schema to parse the XML Document.
If you print the relation schema it is always
INFO XmlToAvroConverter - .convert() : XmlRelation Schema ={} root
|-- fields: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- nullable: boolean (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
For ex: I am streaming following XML Documents from Kafka Topic
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Book>
<Author>John Doe</Author>
<Title>Test</Title>
<PubishedDate></PublishedDate>
</Book>
And here is the code i have to parse the XML into a DataFrame
kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING) msgKey","CAST(value AS STRING) xmlString")
var parameters = collection.mutable.Map.empty[String, String]
parameters.put("rowTag", "Book")
kafkaValueAsStringDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")
xmlStringDF.printSchema()
val rdd: RDD[String] = xmlStringDF.as[String].rdd
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
logger.info(".convert() : XmlRelation Schema ={} "+relation.schema.treeString)
}
.start()
.awaitTermination()
When i read the same XML Documents from File System or S3 and use the spark-xml and it is parsing the schema as expected.
Thanks
Sateesh
You can use the SQL built-in functions xpath and the like to extract data from a nested XML structure that comes as the value of a Kafka message.
Given a nested XML like
<root>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S</FilterClass>
<InputData>
<Finance>
<HeaderSegment>
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
</Finance>
</InputData>
</root>
you can then just use those SQL functions in your selectExpr statment as below:
df.readStream.format("kafka").options(...).load()
.selectExpr("CAST(value AS STRING) as value")
.selectExpr(
"xpath(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsArryString",
"xpath_long(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsLong",
"xpath_string(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsString",
"xpath_int(value, '/CofiResults/InputData/Finance/HeaderSegment/Version/text()') as VersionAsInt")
Remember that the xpath function will return an Array of Strings whereas you may find it more convenient to extract the value as String or even Long. Applying the code above in Spark 3.0.1 with a console sink stream will result in:
+-------------------------+-------------------+---------------------+------------+
|ExecutionTimeAsArryString|ExecutionTimeAsLong|ExecutionTimeAsString|VersionAsInt|
+-------------------------+-------------------+---------------------+------------+
|[20201103153839] |20201103153839 |20201103153839 |6 |
+-------------------------+-------------------+---------------------+------------+

Why do Spark DataFrames not change their schema and what to do about it?

I'm using Spark 2.1's Structured Streaming to read from a Kafka topic whose contents are binary avro-encoded.
Thus, after setting up the DataFrame:
val messages = spark
.readStream
.format("kafka")
.options(kafkaConf)
.option("subscribe", config.getString("kafka.topic"))
.load()
If I print the schema of this DataFrame (messages.printSchema()), I get the following:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- timestampType: integer (nullable = true)
This question should be orthogonal to the problem of avro-decoding, but let's assume I want to somehow convert the value content from the messages DataFrame into a Dataset[BusinessObject], by a function Array[Byte] => BusinessObject. For example completeness, the function may just be (using avro4s):
case class BusinessObject(userId: String, eventId: String)
def fromAvro(bytes: Array[Byte]): BusinessObject =
AvroInputStream.binary[BusinessObject](
new ByteArrayInputStream(bytes)
).iterator.next
Of course, as miguno says in this related question I cannot just apply the transformation with a DataFrame.map(), because I need to provide an implicit Encoder for such a BusinessObject.
That can be defined as:
implicit val myEncoder : Encoder[BusinessObject] = org.apache.spark.sql.Encoders.kryo[BusinessObject]
Now, perform the map:
val transformedMessages : Dataset[BusinessObjecŧ] = messages.map(row => fromAvro(row.getAs[Array[Byte]]("value")))
But if I query the new schema, I get the following:
root
|-- value: binary (nullable = true)
And I think that does not make any sense, as the dataset should use the Product properties of the BusinessObject case-class and get the correct values.
I've seen some examples on Spark SQL using .schema(StructType) in the reader, but I cannot do that, not just because I'm using readStream, but because I actually have to transform the column before being able to operate in such fields.
I am hoping to tell the Spark SQL engine that the transformedMessages Dataset schema is a StructField with the case class' fields.
I would say you get exactly what you ask for. As I already explained today Encoders.kryo generates a blob with serialized object. Its internal structure is opaque for the SQL engine and cannot be accessed without deserializing the object. So effectively what your code does is taking one serialization format and replacing it with another.
Another problem you have is that you try to mix dynamically typed DataFrame (Dataset[Row]) with statically typed object. Excluding UDT API Spark SQL doesn't work like this. Either you use statically Dataset or DataFrame with object structure encoded using struct hierarchy.
Good news is simple product types like BusinessObject should work just fine without any need for clumsy Encoders.kryo. Just skip Kryo encoder definition and be sure to import implicit encoders:
import spark.implicits._

Resources