Spark Structured Streaming to read nested Kafka Connect jsonConverter message - apache-spark

I have ingested xml file using KafkaConnect file-pulse connector 1.5.3
Then I want to read it with Spark Streaming to parse/flatten it. As it is quite nested.
the string I read out of the kafka (I used the consumer console read this out, and put an Enter/new line before the payload for illustration) is like below:
{
"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"string","optional":true,"field":"city"},{"type":"array","items":{"type":"struct","fields":[{"type":"array","items":{"type":"struct","fields":[{"type":"string","optional":true,"field":"unit"},{"type":"string","optional":true,"field":"value"}],"optional":true,"name":"Value"},"optional":true,"field":"value"}],"optional":true,"name":"ForcedArrayType"},"optional":true,"field":"forcedArrayField"},{"type":"string","optional":true,"field":"lastField"}],"optional":true,"name":"Data","field":"data"}],"optional":true}
,"payload":{"data":{"city":"someCity","forcedArrayField":[{"value":[{"unit":"unitField1","value":"123"},{"unit":"unitField1","value":"456"}]}],"lastField":"2020-08-02T18:02:00"}}
}
datatype I attempted:
StructType schema = new StructType();
schema = schema.add( "schema", StringType, false);
schema = schema.add( "payload", StringType, false);
StructType Data = new StructType();
StructType ValueArray = new StructType(new StructField[]{
new StructField("unit", StringType,true,Metadata.empty()),
new StructField("value", StringType,true,Metadata.empty())
});
StructType ForcedArrayType = new StructType(new StructField[]{
new StructField("valueArray", ValueArray,true,Metadata.empty())
});
Data = Data.add("city",StringType,true);
Data = Data.add("forcedArrayField",ForcedArrayType,true);
Data = Data.add("lastField",StringType,true);
StructType Record = new StructType();
Record = Record.add("data", Data, false);
query I attempted:
//below worked for payload
Dataset<Row> parsePayload = lines
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema=schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload");
System.out.println(parsePayload.isStreaming());
//below makes the output empty:
Dataset<Row> parseValue = parsePayload.select(functions.from_json(functions.col("payload"), Record).as("cols"))
.select(functions.col("cols.data.city"));
//.select(functions.col("cols.*"));
StreamingQuery query = parseValue
.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
query.awaitTermination();
when I oupput the parsePayload stream, i could see the data(still json struture), but when i want to select certain/all field like above city. it is empty.
help needed
Is the cause data type defined wrong? or the query is wrong?
Ps.
at the console, when i tried to output the 'parsePayload', instead of 'parseValue', it displays some data, which made me think the 'payload' part worked.
|{"data":{"city":"...|
...

Your schema definition seems to be not fully correct. I was replicating your problem and was able to parse the JSON with the following schema
val payloadSchema = new StructType()
.add("data", new StructType()
.add("city", StringType)
.add("forcedArrayField", ArrayType(new StructType()
.add("value", ArrayType(new StructType()
.add("unit", StringType)
.add("value", StringType)))))
.add("lastField", StringType))
When I then access individual fields I used the following selection:
val parsePayload = df
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload")
.select(functions.from_json(functions.col("payload"), payloadSchema).as("cols"))
.select(col("cols.data.city").as("city"), explode(col("cols.data.forcedArrayField")).as("forcedArrayField"), col("cols.data.lastField").as("lastField"))
.select(col("city"), explode(col("forcedArrayField.value").as("middleFields")), col("lastField"))
This gives the output
+--------+-----------------+-------------------+
| city| col| lastField|
+--------+-----------------+-------------------+
|someCity|[unitField1, 123]|2020-08-02T18:02:00|
|someCity|[unitField1, 456]|2020-08-02T18:02:00|
+--------+-----------------+-------------------+

Your Schema Definition is wrong.
payload and schema might not be a column/field
Read it as a static Json ( Spark.read.json) and get the schema then use it in structured streaming.

Related

How to parse an XML coming from Kafka topic via Spark Streaming?

I want to Parse XML coming from Kafka topic using Spark Streaming.
com.databricks:spark-xml_2.10:0.4.1 is able to parse XML but only from files in HDFS.
Already tried with library : com.databricks:spark-xml_2.10:0.4.1 ;
val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "ServiceRequest").load("/tmp/sanal/gems/gem_opr.xml") ;
Actual Results :
1) Take the stream in Spark
2) Parse the XML Stream in the poutput
You can use com.databricks.spark.xml.XmlReader.xmlRdd(spark: SparkSession, xmlRDD: RDD[String]): DataFrame method to read xml from RDD<String>. For example:
import com.databricks.spark.xml
// setting up sample data
List<ConsumerRecord<String, String>> recordsList = new ArrayList<>();
recordsList.add(new ConsumerRecord<String, String>("topic", 1, 0, "key",
"<?xml version=\"1.0\"?><catalog><book id=\"bk101\"><genre>Computer</genre></book></catalog>"));
JavaRDD<ConsumerRecord<String, String>> rdd = spark.parallelize(recordsList);
// map ConsumerRecord rdd to String rdd
JavaRDD<String> xmlRdd = rdd.map(r -> {
return r.value();
});
// read xml rdd
new XmlReader().xmlRdd(spark, xmlRdd)

How to set variables in "Where" clause when reading cassandra table by spark streaming?

I'm doing some statistics using spark streaming and cassandra. When reading cassandra tables by spark-cassandra-connector and make the cassandra row RDD to a DStreamRDD by ConstantInputDStream, the "CurrentDate" variable in where clause still stays the same day as the program starts.
The purpose is to analyze the total score by some dimensions till current date, but now the code runs analysis just till the day it start running. I run the code in 2019-05-25 and data inserted into table after that time cannot be take in.
The code I use is like below:
class TestJob extends Serializable {
def test(ssc : StreamingContext) : Unit={
val readTableRdd = ssc.cassandraTable(Configurations.getInstance().keySpace1,Constants.testTable)
.select(
"code",
"date",
"time",
"score"
).where("date<= ?",new Utils().getCurrentDate())
val DStreamRdd = new ConstantInputDStream(ssc,readTableRdd)
DStreamRdd.foreachRDD{r=>
//DO SOMETHING
}
}
}
object GetSSC extends Serializable {
def getSSC() : StreamingContext ={
val conf = new SparkConf()
.setMaster(Configurations.getInstance().sparkHost)
.setAppName(Configurations.getInstance().appName)
.set("spark.cassandra.connection.host", Configurations.getInstance().casHost)
.set("spark.cleaner.ttl", "3600")
.set("spark.default.parallelism","3")
.set("spark.ui.port","5050")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
#transient lazy val ssc = new StreamingContext(sc,Seconds(30))
ssc
}
}
object Main {
val logger : Log = LogFactory.getLog(Main.getClass)
def main(args : Array[String]) : Unit={
val ssc = GetSSC.getSSC()
try{
new TestJob().test(ssc)
ssc.start()
ssc.awaitTermination()
}catch {
case e : Exception =>
logger.error(Main.getClass.getSimpleName+"error :
"+e.printStackTrace())
}
}
}
Table used in this Demo like:
CREATE TABLE test.test_table (
code text PRIMARY KEY, //UUID
date text, // '20190520'
time text, // '12:00:00'
score int); // 90
Any help is appreciated!
In general, RDDs that are returned by Spark Cassandra Connector aren't the streaming RDDs - there is no such functionality in Cassandra that will allow to subscribe to the changes feed and analyze it. You can implement something like by explicitly looping and fetching the data, but it will require careful design of the tables, but it's hard to say something without digging more deeply into requirements for latency, amount of data, etc.

Avro write java.sql.Timestamp conversion error

I need to write a timestamp to Kafka partition and then read it from it. I have defined an Avro schema for that:
{ "namespace":"sample",
"type":"record",
"name":"TestData",
"fields":[
{"name": "update_database_time", "type": "long", "logicalType": "timestamp-millis"}
]
}
However, I get a conversion error in the producer.send line:
java.lang.ClassCastException: java.sql.Timestamp cannot be cast to java.lang.Long
How can I fix this?
Here is the code for writing timestamp to Kafka:
val tmstpOffset = testDataDF
.select("update_database_time")
.orderBy(desc("update_database_time"))
.head()
.getTimestamp(0)
val avroRecord = new GenericData.Record(parseAvroSchemaFromFile("/avro-offset-schema.json"))
avroRecord.put("update_database_time", tmstpOffset)
val producer = new KafkaProducer[String, GenericRecord](kafkaParams().asJava)
val data = new ProducerRecord[String, GenericRecord]("app_state_test7", avroRecord)
producer.send(data)
Avro doesn't support time for timestamp directly, but logically by long. So you can convert it to long and use it as below. unix_timestamp() function is used for conversion, but if you have a specific date format, use the unix_timestamp(col, dataformat) overloaded function.
import org.apache.spark.sql.functions._
val tmstpOffset = testDataDF
.select((unix_timestamp("update_database_time")*1000).as("update_database_time"))
.orderBy(desc("update_database_time"))
.head()
.getTimestamp(0)

NPE in Spark Java toBytes(row.<Long>getAs("value2"))

Why does the following generate a NPE?
final SparkSession spark = SparkSession.builder()
.appName(" test")
.getOrCreate();
spark.createDataFrame(singletonList(new GenericRow(new Object[] { 1L, null, 3L })),
new StructType(new StructField[] { DataTypes.createStructField("value1", DataTypes.LongType, true),
DataTypes.createStructField("value2", DataTypes.LongType, true),
DataTypes.createStructField("value3", DataTypes.LongType, true)}))
.foreach((ForeachFunction<Row>) row -> {
System.out.println("###" + row.getAs("value1"));
System.out.println(row.<Long>getAs("value2"));
System.out.println(toBytes(row.<Long>getAs("value2")));
System.out.println("###" + row.getAs("value3"));
});
I think this does not occur in Spark 1.6, but unsure, could just be better test data.
So line
System.out.println(toBytes(row.<Long>getAs("value2")));
The line
row.<Long>getAs(“value2”))
returns a null “Long” Object
but then
toBytes(long l)
wants a “long”, so java unboxes the Long into a null long => NPE as shown in this answer.
To protect against this we can use Java Optional
toBytes(Optional.ofNullable(row.<Long>getAs(name)).orElse(0L)));

Trying to understand spark streaming flow

I have this piece of code:
val lines: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
ssc.start()
ssc.awaitTermination()
The way I understand it is, foreachRDD is happening at the driver level? So basically all that block of code:
lines.foreachRDD { rdd =>
val df = cassandraSQLContext.read.json(rdd.map(x => x._2))
sparkStreamingService.run(df)
}
is happening at the driver level? The sparkStreamingService.run(df) method basically does some transformations on the current dataframe to yield a new dataframe and then calls another method (on another jar) which stores the dataframe to cassandra.
So if this is happening all at the driver level, we are not utilizing the spark executors and how can I make it so that the executors are being used in parallel to process each partition of the RDD in parallel
My spark streaming service run method:
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
The invocation of foreachRDD does happen on the driver node. But, since we're operating at the RDD level, any transformation on it will be distributed. In your example, rdd.map will cause each partition to be sent to a particular worker node for computation.
Since we don't know what your sparkStreamingService.run method is doing, we cant tell you about the locality of its execution.
The foreachRDD may run locally, but that just means the setup. The RDD itself is a distributed collection, so the actual work is distributed.
To comment directly on the code from the docs:
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Notice that the part of the code that is NOT based around the RDD is executed at the driver. It's the code built up using RDD that is distributed to the workers.
Your code specifically is commented below:
//df.select will be distributed, but collect will pull it all back in
var metadataDataframe = df.select("customer", "tableName", "messageContent", "initialLoadRunning").collect()
//Since collect created a local collection then this is done on the driver
metadataDataframe.foreach(rowD => {
metaData = populateMetaDataService.populateSiteMetaData(rowD)
val headers = (rowD.getString(2).split(recordDelimiter)(0))
val fields = headers.split("\u0001").map(
fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
val listOfRawData = rowD.getString(2).indexOf(recordDelimiter)
val dataWithoutHeaders = rowD.getString(2).substring(listOfRawData + 1)
//This will run locally, creating a distributed record
val rawData = sparkContext.parallelize(dataWithoutHeaders.split(recordDelimiter))
// val rawData = dataWithoutHeaders.split(recordDelimiter)
//This will redistribute the work
val rowRDD = rawData
.map(_.split("\u0001"))
.map(attributes => Row(attributes: _*))
//again, setting this up locally, to be run distributed
val newDF = cassandraSQLContext.createDataFrame(rowRDD, schema)
dataFrameFilterService.processBasedOnOpType(metaData, newDF)
})
Ultimately, you probably can rewrite this to not need the collect and keep it all distributed, but that is for you not StackOverflow

Resources