Spark Structed Streaming read nested json from kafka and flatten it - apache-spark

A json type data :
{
"id": "34cx34fs987",
"time_series": [
{
"time": "2020090300: 00: 00",
"value": 342342.12
},
{
"time": "2020090300: 00: 05",
"value": 342421.88
},
{
"time": "2020090300: 00: 10",
"value": 351232.92
}
]
}
I got the json from kafka:
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
df = spark.readStream.format("kafka")...
How can I manipulate df to get a DataFrame as shown below:
id time value
34cx34fs987 20200903 00:00:00 342342.12
34cx34fs987 20200903 00:00:05 342421.88
34cx34fs987 20200903 00:00:10 351232.92

Using Scala:
If you define your schema as
val schema: StructType = new StructType()
.add("id", StringType)
.add("time_series", ArrayType(new StructType()
.add("time", StringType)
.add("value", DoubleType)
))
you can then make use of Spark SQL built-in functions from_json and explode
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = df
.selectExpr("CAST(value as STRING) as json")
.select(from_json('json, schema).as("data"))
.select(col("data.id").as("id"), explode(col("data.time_series")).as("time_series"))
.select(col("id"), col("time_series.time").as("time"), col("time_series.value").as("value"))
Your output will be then
+-----------+-----------------+---------+
|id |time |value |
+-----------+-----------------+---------+
|34cx34fs987|20200903 00:00:00|342342.12|
|34cx34fs987|20200903 00:00:05|342421.88|
|34cx34fs987|20200903 00:00:10|351232.92|
+-----------+-----------------+---------+

Sample code in pyspark
df2 = df.select("id", f.explode("time_series").alias("col"))
df2.select("id", "col.time", "col.value").show()

Related

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

I've Spark Structured Streaming process build with Pyspark that reads a avro message from a kafka topic, make some transformations and load the data as avro in a target topic.
I use the ABRIS package (https://github.com/AbsaOSS/ABRiS) to serialize/deserialize the Avro from Confluent, integrating with Schema Registry.
The schema contains integer columns as follows:
{
"name": "total_images",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "total_videos",
"type": [
"null",
"int"
],
"default": null
},
The process raises the following error: Cannot convert Catalyst type IntegerType to Avro type ["null","int"].
I've tried to convert the columns to be nullable but the error persists.
If someone have a suggestion I would appreciate that
I burned hours on this one
Actually, It is unrelated to Abris dependency (behaviour is the same with native spark-avro apis)
There may be several root causes but in my case … using Spark 3.0.1, Scala with Dataset : it was related to encoder and wrong type in the case class handling datas.
Shortly, avro field defined with "type": ["null","int"] can’t be mapped to scala Int, it needs Option[Int]
Using the following code:
test("Avro Nullable field") {
val schema: String =
"""
|{
| "namespace": "com.mberchon.monitor.dto.avro",
| "type": "record",
| "name": "TestAvro",
| "fields": [
| {"name": "strVal", "type": ["null", "string"]},
| {"name": "longVal", "type": ["null", "long"]}
| ]
|}
""".stripMargin
val topicName = "TestNullableAvro"
val testInstance = TestAvro("foo",Some(Random.nextInt()))
import sparkSession.implicits._
val dsWrite:Dataset[TestAvro] = Seq(testInstance).toDS
val allColumns = struct(dsWrite.columns.head, dsWrite.columns.tail: _*)
dsWrite
.select(to_avro(allColumns,schema) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("topic", topicName)
.save()
val dsRead:Dataset[TestAvro] = sparkSession.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.load()
.select(from_avro(col("value"), schema) as 'Metric)
.select("Metric.*")
.as[TestAvro]
assert(dsRead.collect().contains(testInstance))
}
It fails if case class is defined as follow:
case class TestAvro(strVal:String,longVal:Long)
Cannot convert Catalyst type LongType to Avro type ["null","long"].
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type LongType to Avro type ["null","long"].
at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:219)
at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$1(AvroSerializer.scala:239)
It works properly with:
case class TestAvro(strVal:String,longVal:Option[Long])
Btw, it would be more than nice to have support for SpecificRecord within Spark encoders (you can use Kryo but it is sub efficient)
Since, in order to use efficiently typed Dataset with my avro data … I need to create additional case classes (which duplicates of my SpecificRecords).

Consuming Avro events from Kafka in Spark structured streaming

I have designed a Nifi flow to push JSON events serialized in Avro format into Kafka topic, then I am trying to consume it in Spark Structured streaming.
While Kafka part works fine, Spark Structured streaming is not able to read Avro events. It fails with below error.
[Stage 0:> (0 + 1) / 1]2019-07-19 16:56:57 ERROR Utils:91 - Aborting task
org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -62
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:422)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:414)
Spark code
import org.apache.spark.sql.types.{ StructField, StructType }
import org.apache.spark.sql.types.{ DecimalType, LongType, ByteType, StringType }
import org.apache.spark.sql.types.DataType._
import scala.collection.Seq
import org.apache.spark._
import spark.implicits._
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.avro._
import java.nio.file.{Files, Path, Paths}
val spark = SparkSession.builder.appName("Spark-Kafka-Integration").master("local").getOrCreate()
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("schema.avsc")))
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host:port").option("subscribe", "topic_name").load()
val df1 = df.select(from_avro(col("value"),jsonFormatSchema).as("data")).select("data.*")
df1.writeStream.format("console").option("truncate","false").start()
))
Schema used in Spark
{
"type": "record",
"name": "kafka_demo_new",
"fields": [
{
"name": "host",
"type": "string"
},
{
"name": "event",
"type": "string"
},
{
"name": "connectiontype",
"type": "string"
},
{
"name": "user",
"type": "string"
},
{
"name": "eventtimestamp",
"type": "string"
}
]
}
Sample topic data in Kafka
{"host":"localhost","event":"Qradar_Demo","connectiontype":"tcp/ip","user":"user","eventtimestamp":"2018-05-24 23:15:07"}
Below is version information
HDP - 3.1.0
Kafka - 2.0.0
Spark - 2.4.0
Any help is appreciated.
Had a similar issue and found out that Kafka / KSQL have a different version of AVRO that made other components complain.
This might be your case also:
Have a look: https://github.com/confluentinc/ksql/issues/1742

Read JSON Array string with schema returns null spark 2.2.0

When I trying to read a spark dataframe column containing JSON string as array, with a defined schema it returns null. I tried Array, Seq and List for the schema but all returns null. My spark version is 2.2.0
val dfdata= spark.sql("""select "\[{ \"id\":\"93993\", \"name\":\"Phil\" }, { \"id\":\"838\", \"name\":\"Don\" }]" as theJson""")
dfdata.show(5,false)
val sch = StructType(
Array(StructField("id", StringType, true),
StructField("name", StringType, true)))
print(sch.prettyJson )
dfdata.select(from_json($"theJson", sch)).show
and the output
+---------------------------------------------------------------+
|theJson |
+---------------------------------------------------------------+
|[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]|
+---------------------------------------------------------------+
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}+----------------------+
|jsontostructs(theJson)|
+----------------------+
| null|
+----------------------+
Your schema isn't quite right for your example. Your example is an array of structs. Try by wrapping it in an ArrayType:
val sch = ArrayType(StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true)
)))
Have you tried parsing your json string before obtaining a DF?
// obtaining this string should be easy:
val jsonStr = """[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]"""
// then you can take advantage of schema inference
val df2 = spark.read.json(Seq(jsonStr).toDS)
df2.show(false)
// it shows:
// +-----+----+
// |id |name|
// +-----+----+
// |93993|Phil|
// |838 |Don |
// +-----+----+

Spark sql streaming with Kafka on Json data : Function from_json not able to parse multiline json coming from kafka topic

Here, I am Sending the json data to kafka from "test" topic ,give the schema to json, do some transformation and print it on console.
Here is the code:-
val kafkadata = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("zookeeper.connect", "localhost:2181")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
val schema1 = new StructType()
.add("id_sales_order", StringType)
.add("item_collection",
MapType(
StringType,
new StructType()
.add("id", LongType)
.add("ip", StringType)
.add("description", StringType)
.add("temp", LongType)
.add("c02_level", LongType)
.add("geo",
new StructType()
.add("lat", DoubleType)
.add("long", DoubleType)
)
)
)
val df = kafkadata.selectExpr("cast (value as string) as
json")
.select(from_json($"json",
schema=schema1).as("data"))
.select($"data.id_sales_order",explode($"data.item_collection"))
val query = df.writeStream
.outputMode("append")
.queryName("table")
.format("console")
.start()
query.awaitTermination()
spark.stop()
I am sending data to kafka by 2 ways:-
1) Single line json:-
{"id_sales_order": "2", "item_collection": {"2": {"id": 10,"ip": "68.28.91.22","description": "Sensor attached to the container ceilings","temp":35,"c02_level": 1475,"geo": { "lat":38.00, "long":97.00}}}}
It is giving me output
+--------------+---+--------------------+
|id_sales_order|key| value|
+--------------+---+--------------------+
| 2| 2|[10,68.28.91.22,S...|
+--------------+---+--------------------+
2)Multiline json:-
{
"id_sales_order": "2",
"item_collection": {
"2": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo":
{ "lat":38.00, "long":97.00}
}
}
}
It is not giving me any output.
+--------------+---+-----+
|id_sales_order|key|value|
+--------------+---+-----+
+--------------+---+-----+
Json coming from source is like 2nd one.
How do you handle json while reading streaming data from kafka?
I think the problem may be that the from_json function doesn't understand multiline json.

Writing to Cosmos DB Graph API from Databricks (Apache Spark)

I have a DataFrame in Databricks which I want to use to create a graph in Cosmos, with one row in the DataFrame equating to 1 vertex in Cosmos.
When I write to Cosmos I can't see any properties on the vertices, just a generated id.
Get data:
data = spark.sql("select * from graph.testgraph")
Configuration:
writeConfig = {
"Endpoint" : "******",
"Masterkey" : "******",
"Database" : "graph",
"Collection" : "TestGraph",
"Upsert" : "true",
"query_pagesize" : "100000",
"bulkimport": "true",
"WritingBatchSize": "1000",
"ConnectionMaxPoolSize": "100",
"partitionkeydefinition": "/id"
}
Write to Cosmos:
data.write.
format("com.microsoft.azure.cosmosdb.spark").
options(**writeConfig).
save()
Below is the working code to insert records into cosmos DB.
go to the below site, click on the download option and select the uber.jar
https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.3.0_2.11/1.2.2/jar then add in your dependency
spark-shell --master yarn --executor-cores 5 --executor-memory 10g --num-executors 10 --driver-memory 10g --jars "path/to/jar/dependency/azure-cosmosdb-spark_2.3.0_2.11-1.2.2-uber.jar" --packages "com.google.guava:guava:18.0,com.google.code.gson:gson:2.3.1,com.microsoft.azure:azure-documentdb:1.16.1"
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val data = Seq(
Row(2, "Abb"),
Row(4, "Bcc"),
Row(6, "Cdd")
)
val schema = List(
StructField("partitionKey", IntegerType, true),
StructField("name", StringType, true)
)
val DF = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
val writeConfig = Map("Endpoint" -> "https://*******.documents.azure.com:443/",
"Masterkey" -> "**************",
"Database" -> "db_name",
"Collection" -> "collection_name",
"Upsert" -> "true",
"query_pagesize" -> "100000",
"bulkimport"-> "true",
"WritingBatchSize"-> "1000",
"ConnectionMaxPoolSize"-> "100",
"partitionkeydefinition"-> "/partitionKey")
DF.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(writeConfig).save()

Resources