I am trying to read XML data from Kafka topic using Spark Structured streaming.
I tried using the Databricks spark-xml package, but I got an error saying that this package does not support streamed reading. Is there any way I can extract XML data from Kafka topic using structured streaming?
My current code:
df = spark \
.readStream \
.format("kafka") \
.format('com.databricks.spark.xml') \
.options(rowTag="MainElement")\
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
The error:
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.lang.UnsupportedOperationException: Data source com.databricks.spark.xml does not support streamed reading
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)
.format("kafka") \
.format('com.databricks.spark.xml') \
The last one with com.databricks.spark.xml wins and becomes the streaming source (hiding Kafka as the source).
In order words, the above is equivalent to .format('com.databricks.spark.xml') alone.
As you may have experienced, the Databricks spark-xml package does not support streaming reading (i.e. cannot act as a streaming source). The package is not for streaming.
Is there any way I can extract XML data from Kafka topic using structured streaming?
You are left with accessing and processing the XML yourself with a standard function or a UDF. There's no built-in support for streaming XML processing in Structured Streaming up to Spark 2.2.0.
That should not be a big deal anyway. A Scala code could look as follows.
val input = spark.
readStream.
format("kafka").
...
load
val values = input.select('value cast "string")
val extractValuesFromXML = udf { (xml: String) => ??? }
val numbersFromXML = values.withColumn("number", extractValuesFromXML('value))
// print XMLs and numbers to the stdout
val q = numbersFromXML.
writeStream.
format("console").
start
Another possible solution could be to write your own custom streaming Source that would deal with the XML format in def getBatch(start: Option[Offset], end: Offset): DataFrame. That is supposed to work.
import xml.etree.ElementTree as ET
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option(subscribeType, "test") \
.load()
Then I wrote a python UDF
def parse(s):
xml = ET.fromstring(s)
ns = {'real_person': 'http://people.example.com',
'role': 'http://characters.example.com'}
actor_el = xml.find("DNmS:actor",ns)
if(actor_el ):
actor = actor_el.text
role_el.find('real_person:role', ns)
if(role_el):
role = role_el.text
return actor+"|"+role
Register this UDF
extractValuesFromXML = udf(parse)
XML_DF= df .withColumn("mergedCol",extractroot("value"))
AllCol_DF= xml_DF.withColumn("actorName", split(col("mergedCol"), "\\|").getItem(0))\
.withColumn("Role", split(col("mergedCol"), "\\|").getItem(1))
You cannot mix format this way. Kafka source is loaded as Row including number of values, like key, value and topic, with value column storing payload as a binary type:
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:
...
value.deserializer: Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values.
Parsing this content is the user responsibility and cannot be delegated to other data sources. See for example my answer to How to read records in JSON format from Kafka using Structured Streaming?.
For XML you'll likely need an UDF (UserDefinedFunction), although you can try Hive XPath functions first. You should also decode binary data.
Looks like the above approach works but it is not using the passed schema to parse the XML Document.
If you print the relation schema it is always
INFO XmlToAvroConverter - .convert() : XmlRelation Schema ={} root
|-- fields: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- nullable: boolean (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
For ex: I am streaming following XML Documents from Kafka Topic
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Book>
<Author>John Doe</Author>
<Title>Test</Title>
<PubishedDate></PublishedDate>
</Book>
And here is the code i have to parse the XML into a DataFrame
kafkaValueAsStringDF = kafakDF.selectExpr("CAST(key AS STRING) msgKey","CAST(value AS STRING) xmlString")
var parameters = collection.mutable.Map.empty[String, String]
parameters.put("rowTag", "Book")
kafkaValueAsStringDF.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
val xmlStringDF:DataFrame = batchDF.selectExpr("xmlString")
xmlStringDF.printSchema()
val rdd: RDD[String] = xmlStringDF.as[String].rdd
val relation = XmlRelation(
() => rdd,
None,
parameters.toMap,
xmlSchema)(spark.sqlContext)
logger.info(".convert() : XmlRelation Schema ={} "+relation.schema.treeString)
}
.start()
.awaitTermination()
When i read the same XML Documents from File System or S3 and use the spark-xml and it is parsing the schema as expected.
Thanks
Sateesh
You can use the SQL built-in functions xpath and the like to extract data from a nested XML structure that comes as the value of a Kafka message.
Given a nested XML like
<root>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S</FilterClass>
<InputData>
<Finance>
<HeaderSegment>
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
</Finance>
</InputData>
</root>
you can then just use those SQL functions in your selectExpr statment as below:
df.readStream.format("kafka").options(...).load()
.selectExpr("CAST(value AS STRING) as value")
.selectExpr(
"xpath(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsArryString",
"xpath_long(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsLong",
"xpath_string(value, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsString",
"xpath_int(value, '/CofiResults/InputData/Finance/HeaderSegment/Version/text()') as VersionAsInt")
Remember that the xpath function will return an Array of Strings whereas you may find it more convenient to extract the value as String or even Long. Applying the code above in Spark 3.0.1 with a console sink stream will result in:
+-------------------------+-------------------+---------------------+------------+
|ExecutionTimeAsArryString|ExecutionTimeAsLong|ExecutionTimeAsString|VersionAsInt|
+-------------------------+-------------------+---------------------+------------+
|[20201103153839] |20201103153839 |20201103153839 |6 |
+-------------------------+-------------------+---------------------+------------+
Related
I have a spark-structured application connected to ActiveMQ. The application receives messages from a topic. These messages are in the form of a StringXML. I want to extract information from this nested-XML. How can I do this?
I referred to this post, but was not able to implement something similar in Scala.
XML Format:
<CofiResults>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S </FilterClass>
<InputData format="something" id="someID"><ns2:FrdReq xmlns:ns2="http://someone.com">
<HeaderSegment xmlns="https://somelink.com">
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
.
.
.
My Code:
val df = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("brokerUrl", brokerUrl_)
.option("topic", topicName_)
.option("persistence", "memory")
.option("cleanSession", "true")
.option("username", username_)
.option("password", password_)
.load()
val payload_ = df.select('payload cast "string") // This payload IS the XMLString
Now I need to extract ExecutionTime, Version, and other fields from the above XML.
You can use the SQL built-in functions xpath and the like to extract data from a nested XML structure.
Given a nested XML like (for simplicity, I have omitted any tag parameters)
<CofiResults>
<ExecutionTime>20201103153839</ExecutionTime>
<FilterClass>S</FilterClass>
<InputData>
<ns2>
<HeaderSegment>
<Version>6</Version>
<SequenceNb>1</SequenceNb>
</HeaderSegment>
</ns2>
</InputData>
</CofiResults>
you can then just use those SQL functions (without createOrReplaceTempView) in your selectExpr statment as below:
.selectExpr("CAST(payload AS STRING) as payload")
.selectExpr(
"xpath(payload, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsArryString",
"xpath_long(payload, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsLong",
"xpath_string(payload, '/CofiResults/ExecutionTime/text()') as ExecutionTimeAsString",
"xpath_int(payload, '/CofiResults/InputData/ns2/HeaderSegment/Version/text()') as VersionAsInt")
Remember that the xpath function will return an Array of Strings whereas you may find it more convenient to extract the value as String or even Long. Applying the code above in Spark 3.0.1 with a console sink stream will result in:
+-------------------------+-------------------+---------------------+------------+
|ExecutionTimeAsArryString|ExecutionTimeAsLong|ExecutionTimeAsString|VersionAsInt|
+-------------------------+-------------------+---------------------+------------+
|[20201103153839] |20201103153839 |20201103153839 |6 |
+-------------------------+-------------------+---------------------+------------+
Spark: 3.0.0
Scala: 2.12
confluent
I am having spark structured streaming job and looking for an example for writing data frames to Kafka in Protbuf format.
I read messages from PostgreSQL and after doing all the transformations have a data frame with Key and Value:
root
|-- key: string (nullable = true)
|-- value: binary (nullable = false)
Code to push message to kafka:
val kafkaOptions = Seq(
KAFKA_BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092",
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringSerializer",
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG -> "io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer",
"schema.registry.url" -> "http://localhost:8081",
"topic" -> "test_users"
)
tDF
.write
.format(KAFKA)
.options(kafkaOptions.toMap)
.save()
Message in binary format is posted, but I am not able to deserialized as there no schema in confluent
Is there a lib that can simply things for me? Or a sample code that I can refer.
In spark batch jobs I usually have a JSON datasource written to a file and can use corrupt column features of the DataFrame reader to write the corrupt data out in a seperate location, and another reader to write the valid data both from the same job. ( The data is written as parquet )
But in Spark Structred Streaming I'm first reading the stream in via kafka as a string and then using from_json to get my DataFrame. Then from_json uses JsonToStructs which uses a FailFast mode in the parser and does not return the unparsed string to a column in the DataFrame. (see Note in Ref) Then how can I write corrupt data that doesn't match my schema and possibly invalid JSON to another location using SSS?
Finally in the batch job the same job can write both dataframes. But Spark Structured Streaming requires special handling for multiple sinks. Then in Spark 2.3.1 (my current version) we should include details about how to write both corrupt and invalid streams properly...
Ref: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Expression-JsonToStructs.html
val rawKafkaDataFrame=spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.broker)
.option("kafka.ssl.truststore.location", path.toString)
.option("kafka.ssl.truststore.password", config.pass)
.option("kafka.ssl.truststore.type", "JKS")
.option("kafka.security.protocol", "SSL")
.option("subscribe", config.topic)
.option("startingOffsets", "earliest")
.load()
val jsonDataFrame = rawKafkaDataFrame.select(col("value").cast("string"))
// does not provide a corrupt column or way to work with corrupt
jsonDataFrame.select(from_json(col("value"), schema)).select("jsontostructs(value).*")
When you convert to json from string, and if it is not be able to parse with the schema provided, it will return null. You can filter the null values and select the string. Something like this.
val jsonDF = jsonDataFrame.withColumn("json", from_json(col("value"), schema))
val invalidJsonDF = jsonDF.filter(col("json").isNull).select("value")
I was just trying to figure out the _corrupt_record equivalent for structured streaming as well. Here's what I came up with; hopefully it gets you closer to what you're looking for:
// add a status column to partition our output by
// optional: only keep the unparsed json if it was corrupt
// writes up to 2 subdirs: 'out.par/status=OK' and 'out.par/status=CORRUPT'
// additional status codes for validation of nested fields could be added in similar fashion
df.withColumn("struct", from_json($"value", schema))
.withColumn("status", when($"struct".isNull, lit("CORRUPT")).otherwise(lit("OK")))
.withColumn("value", when($"status" <=> lit("CORRUPT"), $"value"))
.write
.partitionBy("status")
.parquet("out.par")
I'm using Spark Structured Streaming as described on
this page.
I get correct message from Kafka topic but value is in Avro format. Is there some way to deserialize Avro records (something like KafkaAvroDeserializer approach)?
Spark >= 2.4
You can use from_avro function from spark-avro library.
import org.apache.spark.sql.avro._
val schema: String = ???
df.withColumn("value", from_avro($"value", schema))
Spark < 2.4
Define a function which takes Array[Byte] (serialized object):
import scala.reflect.runtime.universe.TypeTag
def decode[T : TypeTag](bytes: Array[Byte]): T = ???
which will deserialize Avro data and create object, that can be stored in a Dataset.
Create udf based on the function.
val decodeUdf = udf(decode _)
Call udf on value
val df = spark
.readStream
.format("kafka")
...
.load()
df.withColumn("value", decodeUdf($"value"))
I have a DataSet[Row] where each row is JSON string. I want to just print the JSON stream or count the JSON stream per batch.
Here is my code so far
val ds = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers"))
.option("subscribe", topicName)
.option("checkpointLocation", hdfsCheckPointDir)
.load();
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as 'payload)
val ds2 = ds1.select($"payload.info")
val query = ds2.writeStream.outputMode("append").queryName("table").format("memory").start()
query.awaitTermination()
select * from table; -- don't see anything and there are no errors. However when I run my Kafka consumer separately (independent ofSpark I can see the data)
My question really is what do I need to do just print the data I am receiving from Kafka using Structured Streaming? The messages in Kafka are JSON encoded strings so I am converting JSON encoded strings to some struct and eventually to a dataset. I am using Spark 2.1.0
val ds1 = ds.select(from_json(col("value").cast("string"), schema) as payload).select($"payload.*")
That will print your data on the console.
ds1.writeStream.format("console").option("truncate", "false").start().awaitTermination()
Always use something like awaitTermination() or thread.Sleep(time in seconds) in these type of situations.