I have been trying to define the schema for the incoming spark stream but its always returning NULLS below is the incoming schema, I am getting this by casting the binary values as string in the body (aka bodycast)
[
{
"Node":"mm=400;b=fjdhkjfhds5454",
"El":"jdj5544",
"DisplayName":"Tag1",
"Value":{
"Value":61.175327,
"SourceTimestamp":"2025-01-23T20:49:51.885Z"
}
},
{
"Node":"mm=44;b=kjlkjl",
"El":"fdsfdsfsfd",
"DisplayName":"tag2",
"Value":{
"Value":23.957619,
"SourceTimestamp":"2025-10-23T20:49:51.885Z"
}
},
{
"Node":"mm=44;b=gfgfdgfd",
"El":"fdsfdsfs",
"DisplayName":"tag3",
"Value":{
"Value":88.91336,
"SourceTimestamp":"2012-07-23T20:50:52.555Z"
}
},
{
"Node":"mm=87788;b=dsadad",
"El":"fdsfdsfsdf5454fd",
"DisplayName":"tag4",
"Value":{
"Value":89.044495,
"SourceTimestamp":"2012-08-23T20:49:51.885Z"
}
},
{
"Node":"mm=2;b=fdsfdsf",
"EL":"fdfsdfdsfsd",
"DisplayName":"tag9",
"Value":{
"Value":95.07896,
"SourceTimestamp":"2022-11-23T20:49:51.885Z"
}
},
{
"Node":"mm=2;b=AdsadasdasdasdAA=",
"EL":"54f54ds5f4ds5fds",
"DisplayName":"tag5",
"Value":{
"Value":19.866293,
"SourceTimestamp":"1998-08-23T20:49:51.885Z"
}
},
{
"Node":"mm=2;b=54f5sd4f5ds4fds",
"EL":"fkjhfjdshfsjdhjfds1221",
"DisplayName":"tag6",
"Value":{
"Value":611.9143,
"SourceTimestamp":"2201-09-22T20:44:51.017Z"
}
}
]
I have defined the below schema to use from_json function in the dataframe,
structureSchema = StructType([
StructField("Node", StringType()),
StructField("El", StringType()),
StructField("DisplayName", StringType()),
StructField("value_array", ArrayType(
StructType([
StructField("Value", StringType()),
StructField("SourceTimestamp", TimestampType())
])))])
stream_messages.select(stream_messages.Bodycast).withColumn('json', from_json(F.col('Bodycast'), structureSchema)).display()
Can anyone please suggest if the schema defined is wrong ?
Thanks in advance
I realized that the schema needs to be defined with ArrayType as there are multiple records in each row. Later I converted arrays to the required tabular form.
Related
I have a json file that has below's structure that I need to read in as pyspark dataframe.
I tried reading in using multiLine option but it doesn't seem to return more data than the columns and datatypes. What am I possibly doing wrong and how can I read in belows'structure?
df = spark.read.format("json").option("inferSchema", "true") \
.option("multiLine", "true") \
.load("/mnt/blob/input/jsonfile.json")
Structure:
[
[
{
"Key": "Val"
},
{
"Key": "Val"
}
],
[
{
"Key": "Val"
},
{
"Key": "Val"
}
]
]
I have avsc schema like below:
{
"name": "address",
"type": [
"null",
{
"type":"record",
"name":"Address",
"namespace":"com.data",
"fields":[
{
"name":"address",
"type":[ "null","com.data.Address"],
"default":null
}
]
}
],
"default": null
}
On loading this data in pyspark:
jsonFormatSchema = open("Address.avsc", "r").read()
spark = SparkSession.builder.appName('abc').getOrCreate()
df = spark.read.format("avro")\
.option("avroSchema", jsonFormatSchema)\
.load("xxx.avro")
I got such exception:
"Found recursive reference in Avro schema, which can not be processed by Spark"
I tried many other configurations, but without any success.
To execute I use with spark-submit:
--packages org.apache.spark:spark-avro_2.12:3.0.1
This is a intended feature, you can take a look at the "issue" :
https://issues.apache.org/jira/browse/SPARK-25718
I am new to spark and trying to explore Spark structured streaming. I will be consuming messages from Kafka(nested JSON), filter these messages based on certain conditions on the JSON attribute. Every message satisfying the filter should then be pushed to Cassandra.
I have read the documentation on spark Cassandra connector
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(value AS STRING)")
I only need a few of the many attributes present in this nested JSON. How do I apply a schema on top of it, so that I can use sparkSQL for filtering?
For the sample JSON, I need to persist name, age, experience, hobby_name,hobby_experience for players whose sum of playing frequency is more than 5.
{
"name": "Tom",
"age": "24",
"gender": "male",
"hobbies": [{
"name": "Tennis",
"experience": 5,
"places": [{
"city": "London",
"frequency": 4
}, {
"city": "Sydney",
"frequency": 3
}]
}]
}
I am relatively new to Spark, please forgive if this is a repeat. Also, I am looking for a solution in JAVA.
You can specify your schema like so:
import org.apache.spark.sql.types.{DataTypes, StructField, StructType};
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true),
DataTypes.createStructField("gender", DataTypes.StringType, true),
DataTypes.createStructField("hobbies", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("experience", DataTypes.IntegerType, true),
DataTypes.createStructField("places", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("city", DataTypes.StringType, true),
DataTypes.createStructField("frequency", DataTypes.IntegerType, true)
}), true)
}), true)
});
And then use the schema to create your dataframe as needed:
import org.apache.spark.sql.functions.{col, from_json};
df.select(from_json(col("value"), schema).as("data"))
.select(
col("data.name").as("name"),
col("data.hobbies.name").as("hobbies_name"))
When I trying to read a spark dataframe column containing JSON string as array, with a defined schema it returns null. I tried Array, Seq and List for the schema but all returns null. My spark version is 2.2.0
val dfdata= spark.sql("""select "\[{ \"id\":\"93993\", \"name\":\"Phil\" }, { \"id\":\"838\", \"name\":\"Don\" }]" as theJson""")
dfdata.show(5,false)
val sch = StructType(
Array(StructField("id", StringType, true),
StructField("name", StringType, true)))
print(sch.prettyJson )
dfdata.select(from_json($"theJson", sch)).show
and the output
+---------------------------------------------------------------+
|theJson |
+---------------------------------------------------------------+
|[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]|
+---------------------------------------------------------------+
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}+----------------------+
|jsontostructs(theJson)|
+----------------------+
| null|
+----------------------+
Your schema isn't quite right for your example. Your example is an array of structs. Try by wrapping it in an ArrayType:
val sch = ArrayType(StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true)
)))
Have you tried parsing your json string before obtaining a DF?
// obtaining this string should be easy:
val jsonStr = """[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]"""
// then you can take advantage of schema inference
val df2 = spark.read.json(Seq(jsonStr).toDS)
df2.show(false)
// it shows:
// +-----+----+
// |id |name|
// +-----+----+
// |93993|Phil|
// |838 |Don |
// +-----+----+
I am trying to call partitionBy on a nested field like below:
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)
I get the below error when I run it. I do see the 'name' listed as the field in the below schema. Is there a different format to specify the column name which is nested?
java.lang.RuntimeException: Partition column data.dataDetails.name not found in schema StructType(StructField(name,StringType,true), StructField(time,StringType,true), StructField(data,StructType(StructField(dataDetails,StructType(StructField(name,StringType,true), StructField(id,StringType,true),true)),true))
This is my json file:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
This appears to be a known issue listed here: https://issues.apache.org/jira/browse/SPARK-18084
I had this issue as well and to work around it I was able to un-nest the columns on my dataset. My dataset was a little different than your dataset, but here is the strategy...
Original Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
Modified Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data_type": "EventData",
"data_dataDetails_name" : "EventName",
"data_dataDetails_id": "1234"
}
}
Code to get to Modified Json:
def main(args: Array[String]) {
...
val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)
data.printSchema
data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}
Since the feature is un-available as of Spark 2.3.1, here's a workaround. Make sure to handle name conflicts between the nested fields and the fields at the root level.
{"date":"20180808","value":{"group":"xxx","team":"yyy"}}
df.select("date","value.group","value.team")
.write
.partitionBy("date","group","team")
.parquet(filenameParquet)
The partitions end up like
date=20180808/group=xxx/team=yyy/part-xxx.parquet