How to parse an XML coming from Kafka topic via Spark Streaming? - apache-spark

I want to Parse XML coming from Kafka topic using Spark Streaming.
com.databricks:spark-xml_2.10:0.4.1 is able to parse XML but only from files in HDFS.
Already tried with library : com.databricks:spark-xml_2.10:0.4.1 ;
val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "ServiceRequest").load("/tmp/sanal/gems/gem_opr.xml") ;
Actual Results :
1) Take the stream in Spark
2) Parse the XML Stream in the poutput

You can use com.databricks.spark.xml.XmlReader.xmlRdd(spark: SparkSession, xmlRDD: RDD[String]): DataFrame method to read xml from RDD<String>. For example:
import com.databricks.spark.xml
// setting up sample data
List<ConsumerRecord<String, String>> recordsList = new ArrayList<>();
recordsList.add(new ConsumerRecord<String, String>("topic", 1, 0, "key",
"<?xml version=\"1.0\"?><catalog><book id=\"bk101\"><genre>Computer</genre></book></catalog>"));
JavaRDD<ConsumerRecord<String, String>> rdd = spark.parallelize(recordsList);
// map ConsumerRecord rdd to String rdd
JavaRDD<String> xmlRdd = rdd.map(r -> {
return r.value();
});
// read xml rdd
new XmlReader().xmlRdd(spark, xmlRdd)

Related

Spark Structured Streaming to read nested Kafka Connect jsonConverter message

I have ingested xml file using KafkaConnect file-pulse connector 1.5.3
Then I want to read it with Spark Streaming to parse/flatten it. As it is quite nested.
the string I read out of the kafka (I used the consumer console read this out, and put an Enter/new line before the payload for illustration) is like below:
{
"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"string","optional":true,"field":"city"},{"type":"array","items":{"type":"struct","fields":[{"type":"array","items":{"type":"struct","fields":[{"type":"string","optional":true,"field":"unit"},{"type":"string","optional":true,"field":"value"}],"optional":true,"name":"Value"},"optional":true,"field":"value"}],"optional":true,"name":"ForcedArrayType"},"optional":true,"field":"forcedArrayField"},{"type":"string","optional":true,"field":"lastField"}],"optional":true,"name":"Data","field":"data"}],"optional":true}
,"payload":{"data":{"city":"someCity","forcedArrayField":[{"value":[{"unit":"unitField1","value":"123"},{"unit":"unitField1","value":"456"}]}],"lastField":"2020-08-02T18:02:00"}}
}
datatype I attempted:
StructType schema = new StructType();
schema = schema.add( "schema", StringType, false);
schema = schema.add( "payload", StringType, false);
StructType Data = new StructType();
StructType ValueArray = new StructType(new StructField[]{
new StructField("unit", StringType,true,Metadata.empty()),
new StructField("value", StringType,true,Metadata.empty())
});
StructType ForcedArrayType = new StructType(new StructField[]{
new StructField("valueArray", ValueArray,true,Metadata.empty())
});
Data = Data.add("city",StringType,true);
Data = Data.add("forcedArrayField",ForcedArrayType,true);
Data = Data.add("lastField",StringType,true);
StructType Record = new StructType();
Record = Record.add("data", Data, false);
query I attempted:
//below worked for payload
Dataset<Row> parsePayload = lines
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema=schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload");
System.out.println(parsePayload.isStreaming());
//below makes the output empty:
Dataset<Row> parseValue = parsePayload.select(functions.from_json(functions.col("payload"), Record).as("cols"))
.select(functions.col("cols.data.city"));
//.select(functions.col("cols.*"));
StreamingQuery query = parseValue
.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
query.awaitTermination();
when I oupput the parsePayload stream, i could see the data(still json struture), but when i want to select certain/all field like above city. it is empty.
help needed
Is the cause data type defined wrong? or the query is wrong?
Ps.
at the console, when i tried to output the 'parsePayload', instead of 'parseValue', it displays some data, which made me think the 'payload' part worked.
|{"data":{"city":"...|
...
Your schema definition seems to be not fully correct. I was replicating your problem and was able to parse the JSON with the following schema
val payloadSchema = new StructType()
.add("data", new StructType()
.add("city", StringType)
.add("forcedArrayField", ArrayType(new StructType()
.add("value", ArrayType(new StructType()
.add("unit", StringType)
.add("value", StringType)))))
.add("lastField", StringType))
When I then access individual fields I used the following selection:
val parsePayload = df
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload")
.select(functions.from_json(functions.col("payload"), payloadSchema).as("cols"))
.select(col("cols.data.city").as("city"), explode(col("cols.data.forcedArrayField")).as("forcedArrayField"), col("cols.data.lastField").as("lastField"))
.select(col("city"), explode(col("forcedArrayField.value").as("middleFields")), col("lastField"))
This gives the output
+--------+-----------------+-------------------+
| city| col| lastField|
+--------+-----------------+-------------------+
|someCity|[unitField1, 123]|2020-08-02T18:02:00|
|someCity|[unitField1, 456]|2020-08-02T18:02:00|
+--------+-----------------+-------------------+
Your Schema Definition is wrong.
payload and schema might not be a column/field
Read it as a static Json ( Spark.read.json) and get the schema then use it in structured streaming.

Avro write java.sql.Timestamp conversion error

I need to write a timestamp to Kafka partition and then read it from it. I have defined an Avro schema for that:
{ "namespace":"sample",
"type":"record",
"name":"TestData",
"fields":[
{"name": "update_database_time", "type": "long", "logicalType": "timestamp-millis"}
]
}
However, I get a conversion error in the producer.send line:
java.lang.ClassCastException: java.sql.Timestamp cannot be cast to java.lang.Long
How can I fix this?
Here is the code for writing timestamp to Kafka:
val tmstpOffset = testDataDF
.select("update_database_time")
.orderBy(desc("update_database_time"))
.head()
.getTimestamp(0)
val avroRecord = new GenericData.Record(parseAvroSchemaFromFile("/avro-offset-schema.json"))
avroRecord.put("update_database_time", tmstpOffset)
val producer = new KafkaProducer[String, GenericRecord](kafkaParams().asJava)
val data = new ProducerRecord[String, GenericRecord]("app_state_test7", avroRecord)
producer.send(data)
Avro doesn't support time for timestamp directly, but logically by long. So you can convert it to long and use it as below. unix_timestamp() function is used for conversion, but if you have a specific date format, use the unix_timestamp(col, dataformat) overloaded function.
import org.apache.spark.sql.functions._
val tmstpOffset = testDataDF
.select((unix_timestamp("update_database_time")*1000).as("update_database_time"))
.orderBy(desc("update_database_time"))
.head()
.getTimestamp(0)

Is it possible to build spark code on fly and execute?

I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.

How to write every record to multiple kafka topics in Spark Streaming 2.3.1?

How to write every record to multiple kafka topics in Spark Streaming 2.3.1? other words say I got 5 records and two output kafka topics I want all 5 records in both output topics.
The question here doesn't talk about structured streaming case. I am looking specific for structured streaming.
Not sure if you are using java or scala. Below is code to generate message to two different topic. You'll have to call
dataset.foreachPartition(partionsrows => {
val props = new util.HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServer)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
partionsrows.foreach(row => {
val offerId = row.get(0).toString.replace("[", "").replace("]", "")
val message1 = new ProducerRecord[String, String]("topic1", "message")
producer.send(message1)
val message2 = new ProducerRecord[String, String]("topic2", "message")
producer.send(message2)
})
})

Trying to understand spark streaming flow

I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow

Resources