Avro write java.sql.Timestamp conversion error - apache-spark

I need to write a timestamp to Kafka partition and then read it from it. I have defined an Avro schema for that:
{ "namespace":"sample",
"type":"record",
"name":"TestData",
"fields":[
{"name": "update_database_time", "type": "long", "logicalType": "timestamp-millis"}
]
}
However, I get a conversion error in the producer.send line:
java.lang.ClassCastException: java.sql.Timestamp cannot be cast to java.lang.Long
How can I fix this?
Here is the code for writing timestamp to Kafka:
val tmstpOffset = testDataDF
.select("update_database_time")
.orderBy(desc("update_database_time"))
.head()
.getTimestamp(0)
val avroRecord = new GenericData.Record(parseAvroSchemaFromFile("/avro-offset-schema.json"))
avroRecord.put("update_database_time", tmstpOffset)
val producer = new KafkaProducer[String, GenericRecord](kafkaParams().asJava)
val data = new ProducerRecord[String, GenericRecord]("app_state_test7", avroRecord)
producer.send(data)

Avro doesn't support time for timestamp directly, but logically by long. So you can convert it to long and use it as below. unix_timestamp() function is used for conversion, but if you have a specific date format, use the unix_timestamp(col, dataformat) overloaded function.
import org.apache.spark.sql.functions._
val tmstpOffset = testDataDF
.select((unix_timestamp("update_database_time")*1000).as("update_database_time"))
.orderBy(desc("update_database_time"))
.head()
.getTimestamp(0)

Related

Shareplex CDC output - complete after image possible?

Shareplex CDC offers 3 JSON sub-structs per CDC record:
meta operation type, insert, del, ...
data actual changed data with column names
key the before image, thus all fields including those that changed in "data"
This is what data engineers state and the documentation seems to state this possibility only, as well.
My question is how can we get the complete after image of the record including both changed and non-changed data? May be it is simply not possible.
{
"meta":{
"op":"upd",
"table":"BILL.PRODUCTS"
},
"data":{
"PRICE":"3599"
},
"key":{
"PRODUCT_ID":"230117",
"DESCRIPTION":"Hamsberry vintage tee, cherry",
"PRICE":"4099"
}
}
The above approach is unhandy with Spark schema's being computed in batch, or defining the complete schema in conjunction with NULL values issues, as far as I can see.
No, this is standardly not possible.
What you can do is the read the Kafka JSON, do as per below and set the after image on a new Kafka Topic and proceed:
import org.json4s._
import org.json4s.jackson.JsonMethods._
val jsonS =
"""
{
"meta":{
"op":"upd",
"table":"BILL.PRODUCTS"
},
"data":{
"PRICE":"3599"
},
"key":{
"PRODUCT_ID":"230117",
"DESCRIPTION":"Hamsberry vintage tee, cherry",
"PRICE":"4099"
}
}
""".stripMargin
val jsonNN = parse(jsonS)
val meta = jsonNN\"meta"
val data = jsonNN\"data"
val key = jsonNN\"key"
val Diff(changed, added, deleted) = key diff data
val afterImage = changed merge deleted
// Convert to JSON
println(pretty(render(afterImage)))

Spark Structured Streaming to read nested Kafka Connect jsonConverter message

I have ingested xml file using KafkaConnect file-pulse connector 1.5.3
Then I want to read it with Spark Streaming to parse/flatten it. As it is quite nested.
the string I read out of the kafka (I used the consumer console read this out, and put an Enter/new line before the payload for illustration) is like below:
{
"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"string","optional":true,"field":"city"},{"type":"array","items":{"type":"struct","fields":[{"type":"array","items":{"type":"struct","fields":[{"type":"string","optional":true,"field":"unit"},{"type":"string","optional":true,"field":"value"}],"optional":true,"name":"Value"},"optional":true,"field":"value"}],"optional":true,"name":"ForcedArrayType"},"optional":true,"field":"forcedArrayField"},{"type":"string","optional":true,"field":"lastField"}],"optional":true,"name":"Data","field":"data"}],"optional":true}
,"payload":{"data":{"city":"someCity","forcedArrayField":[{"value":[{"unit":"unitField1","value":"123"},{"unit":"unitField1","value":"456"}]}],"lastField":"2020-08-02T18:02:00"}}
}
datatype I attempted:
StructType schema = new StructType();
schema = schema.add( "schema", StringType, false);
schema = schema.add( "payload", StringType, false);
StructType Data = new StructType();
StructType ValueArray = new StructType(new StructField[]{
new StructField("unit", StringType,true,Metadata.empty()),
new StructField("value", StringType,true,Metadata.empty())
});
StructType ForcedArrayType = new StructType(new StructField[]{
new StructField("valueArray", ValueArray,true,Metadata.empty())
});
Data = Data.add("city",StringType,true);
Data = Data.add("forcedArrayField",ForcedArrayType,true);
Data = Data.add("lastField",StringType,true);
StructType Record = new StructType();
Record = Record.add("data", Data, false);
query I attempted:
//below worked for payload
Dataset<Row> parsePayload = lines
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema=schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload");
System.out.println(parsePayload.isStreaming());
//below makes the output empty:
Dataset<Row> parseValue = parsePayload.select(functions.from_json(functions.col("payload"), Record).as("cols"))
.select(functions.col("cols.data.city"));
//.select(functions.col("cols.*"));
StreamingQuery query = parseValue
.writeStream()
.format("console")
.outputMode(OutputMode.Append())
.start();
query.awaitTermination();
when I oupput the parsePayload stream, i could see the data(still json struture), but when i want to select certain/all field like above city. it is empty.
help needed
Is the cause data type defined wrong? or the query is wrong?
Ps.
at the console, when i tried to output the 'parsePayload', instead of 'parseValue', it displays some data, which made me think the 'payload' part worked.
|{"data":{"city":"...|
...
Your schema definition seems to be not fully correct. I was replicating your problem and was able to parse the JSON with the following schema
val payloadSchema = new StructType()
.add("data", new StructType()
.add("city", StringType)
.add("forcedArrayField", ArrayType(new StructType()
.add("value", ArrayType(new StructType()
.add("unit", StringType)
.add("value", StringType)))))
.add("lastField", StringType))
When I then access individual fields I used the following selection:
val parsePayload = df
.selectExpr("cast (value as string) as json")
.select(functions.from_json(functions.col("json"), schema).as("schemaAndPayload"))
.select("schemaAndPayload.payload").as("payload")
.select(functions.from_json(functions.col("payload"), payloadSchema).as("cols"))
.select(col("cols.data.city").as("city"), explode(col("cols.data.forcedArrayField")).as("forcedArrayField"), col("cols.data.lastField").as("lastField"))
.select(col("city"), explode(col("forcedArrayField.value").as("middleFields")), col("lastField"))
This gives the output
+--------+-----------------+-------------------+
| city| col| lastField|
+--------+-----------------+-------------------+
|someCity|[unitField1, 123]|2020-08-02T18:02:00|
|someCity|[unitField1, 456]|2020-08-02T18:02:00|
+--------+-----------------+-------------------+
Your Schema Definition is wrong.
payload and schema might not be a column/field
Read it as a static Json ( Spark.read.json) and get the schema then use it in structured streaming.

How to parse an XML coming from Kafka topic via Spark Streaming?

I want to Parse XML coming from Kafka topic using Spark Streaming.
com.databricks:spark-xml_2.10:0.4.1 is able to parse XML but only from files in HDFS.
Already tried with library : com.databricks:spark-xml_2.10:0.4.1 ;
val df = spark.read.format("com.databricks.spark.xml").option("rowTag", "ServiceRequest").load("/tmp/sanal/gems/gem_opr.xml") ;
Actual Results :
1) Take the stream in Spark
2) Parse the XML Stream in the poutput
You can use com.databricks.spark.xml.XmlReader.xmlRdd(spark: SparkSession, xmlRDD: RDD[String]): DataFrame method to read xml from RDD<String>. For example:
import com.databricks.spark.xml
// setting up sample data
List<ConsumerRecord<String, String>> recordsList = new ArrayList<>();
recordsList.add(new ConsumerRecord<String, String>("topic", 1, 0, "key",
"<?xml version=\"1.0\"?><catalog><book id=\"bk101\"><genre>Computer</genre></book></catalog>"));
JavaRDD<ConsumerRecord<String, String>> rdd = spark.parallelize(recordsList);
// map ConsumerRecord rdd to String rdd
JavaRDD<String> xmlRdd = rdd.map(r -> {
return r.value();
});
// read xml rdd
new XmlReader().xmlRdd(spark, xmlRdd)

How to set variables in "Where" clause when reading cassandra table by spark streaming?

I'm doing some statistics using spark streaming and cassandra. When reading cassandra tables by spark-cassandra-connector and make the cassandra row RDD to a DStreamRDD by ConstantInputDStream, the "CurrentDate" variable in where clause still stays the same day as the program starts.
The purpose is to analyze the total score by some dimensions till current date, but now the code runs analysis just till the day it start running. I run the code in 2019-05-25 and data inserted into table after that time cannot be take in.
The code I use is like below:
class TestJob extends Serializable {
def test(ssc : StreamingContext) : Unit={
val readTableRdd = ssc.cassandraTable(Configurations.getInstance().keySpace1,Constants.testTable)
.select(
"code",
"date",
"time",
"score"
).where("date<= ?",new Utils().getCurrentDate())
val DStreamRdd = new ConstantInputDStream(ssc,readTableRdd)
DStreamRdd.foreachRDD{r=>
//DO SOMETHING
}
}
}
object GetSSC extends Serializable {
def getSSC() : StreamingContext ={
val conf = new SparkConf()
.setMaster(Configurations.getInstance().sparkHost)
.setAppName(Configurations.getInstance().appName)
.set("spark.cassandra.connection.host", Configurations.getInstance().casHost)
.set("spark.cleaner.ttl", "3600")
.set("spark.default.parallelism","3")
.set("spark.ui.port","5050")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
#transient lazy val ssc = new StreamingContext(sc,Seconds(30))
ssc
}
}
object Main {
val logger : Log = LogFactory.getLog(Main.getClass)
def main(args : Array[String]) : Unit={
val ssc = GetSSC.getSSC()
try{
new TestJob().test(ssc)
ssc.start()
ssc.awaitTermination()
}catch {
case e : Exception =>
logger.error(Main.getClass.getSimpleName+"error :
"+e.printStackTrace())
}
}
}
Table used in this Demo like:
CREATE TABLE test.test_table (
code text PRIMARY KEY, //UUID
date text, // '20190520'
time text, // '12:00:00'
score int); // 90
Any help is appreciated!
In general, RDDs that are returned by Spark Cassandra Connector aren't the streaming RDDs - there is no such functionality in Cassandra that will allow to subscribe to the changes feed and analyze it. You can implement something like by explicitly looping and fetching the data, but it will require careful design of the tables, but it's hard to say something without digging more deeply into requirements for latency, amount of data, etc.

Is it possible to build spark code on fly and execute?

I am trying to create a generic function to read a csv file using databriks CSV READER.But the option's are not mandatory it can differ based on the my input json configuration file.
Example1 :
"ReaderOption":{
"delimiter":";",
"header":"true",
"inferSchema":"true",
"schema":"""some custome schema.."""
},
Example2:
"ReaderOption":{
"delimiter":";",
"schema":"""some custome schema.."""
},
Is it possible to construct options or the entire read statement in runtime and run in spark ?
like below,
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.option(options)
.load(inputPath)
readDF
}
def readCsvWithOptions(): DataFrame=
{
val options:Map[String,String]= Map("inferSchema"->"true")
val readDF = jobContext.spark.read.format("com.databricks.spark.csv")
.options(options)
.load(inputPath)
readDF
}
There is an options which takes key, value pair.

Resources