I'm working on spark java application and using spark 2.4.7 version. I have a json file that I'm loading into dataframe like
Dataset<Row> df = sparkSession().read().option("multiline",true).format(json).load(path_of_json);
The issue is that in my json file I have an attribute whose value is in number but when I printSchema() of the dataframe it is showing that attribute as StringType and not LongType.
Json file-
[
{"first": {
"id" :"fdfd",
"name":"temp",
"type":-1 --> reading it as LongType
},
"something":"something_else",
"data" : {
"key": {
"field":7569, --> reading it as StringType
"temp":"dfdfd"
}
}
}]
I tried reproducing the issue in my local spark shell but it is working fine there. Anyone has an idea why is it happening?
By default, Spark tries to infer the schema automatically when reading from a Json file data source. However, if you know it, you can specify the schema when loading the Dataframe.
You first need to define the schema, an instance of the StructType class, where you specify each field name and data type.
You can do it manually:
StructType keyType = new StructType()
.add(new StructField("field", DataTypes.LongType, true, Metadata.empty()))
.add(new StructField("temp", DataTypes.StringType, true, Metadata.empty()));
StructType dataType = new StructType()
.add(new StructField("key", keyType, true, Metadata.empty()));
StructType firstType = new StructType()
.add(new StructField("id", DataTypes.StringType, true, Metadata.empty()))
.add(new StructField("name", DataTypes.StringType, true, Metadata.empty()))
.add(new StructField("type", DataTypes.LongType, true, Metadata.empty()));
StructType schema = new StructType()
.add(new StructField("data", dataType, true, Metadata.empty()))
.add(new StructField("first", firstType, true, Metadata.empty()))
.add(new StructField("something", DataTypes.StringType, true, Metadata.empty()));
or from a DDL string:
StructType schema = StructType.fromDDL("data STRUCT<key: STRUCT<field: BIGINT, temp: STRING>>,first STRUCT<id: STRING, name: STRING, type: BIGINT>,something STRING"));
Then specify the schema when loading the Dataframe:
Dataset<Row> df = spark.read()
.option("multiline", true)
.format("json")
.schema(schema)
.load(jsonPath);
Related
I have some nested json that I have parallelized and spat out as a json. A complete record would look like:
{
"id":"1",
"type":"site",
"attributes":{
"description":"Number 1 Park",
"activeInactive":{
"text":"Active",
"colour":"#4CBB17"
},
"lastUpdated":"2019-12-05T08:51:39"
},
"relationships":{
"region":{
"data":{
"type":"region",
"id":"1061",
"meta":{
"displayValue":"Park Region"
}
}
}
}
}
However, the data is pending a data cleanse and currently the region field is not populated.
{
"id":"1",
"type":"site",
"attributes":{
"description":"Number 1 Park",
"activeInactive":{
"text":"Active",
"colour":"#4CBB17"
},
"lastUpdated":"2019-12-05T08:51:39"
},
"relationships":{
"region":{
"data": null
}
}
}
}
The data element will be null if the relationship doesn't exist (i.e. it is an orphaned site).
I run this JSON into a spark dataframe via an RDD. The schema of the dataframe is:
attributes:struct
activeInactive:struct
colour:string
text:string
description:string
lastUpdated:string
id:string
relationships:struct
region:struct
data:string
I get errors when coding for region using df.select(col('relationships.region.data.meta.displayValue')) as if the nested fields were there rather than data as per the topic heading. I'm going to assume this is because of the conflict with the dataframe's schema.
The question is how can I make this more dynamic and still obtain the displayValue as and when this is populated without needing to revisit the code?
While reading a json file, you can impose the schema on the output dataframe using this syntax:
df = spark.read.json("<path to json file>", schema = <schema object>)
This way the data field will still show you null, but it's gonna be StructType() with a complete nested structure.
Based on the data snippet that was provided the applicable schema object looks like this:
schemaObject = StructType([
StructField('id', StringType(), True),
StructField('type', StringType(), True),
StructField('attributes', StructType([
StructField('descrption', StringType(), True),
StructField('activeInactive', StructType([
StructField('text', StringType(), True),
StructField('colour', StringType(), True)
]), True),
StructField('lastUpdated', StringType(), True)
]), True),
StructField('relationships'StructType([
StructField('region', StructType([
StructField('data', StructType([
StructField('type', StringType(), True),
StructField('id', StringType(), True),
StructField('meta', StructType([
StructField('displayValue', StringType(), True)
]), True)
]), True)
]), True)
]), True)
])
A json type data :
{
"id": "34cx34fs987",
"time_series": [
{
"time": "2020090300: 00: 00",
"value": 342342.12
},
{
"time": "2020090300: 00: 05",
"value": 342421.88
},
{
"time": "2020090300: 00: 10",
"value": 351232.92
}
]
}
I got the json from kafka:
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
df = spark.readStream.format("kafka")...
How can I manipulate df to get a DataFrame as shown below:
id time value
34cx34fs987 20200903 00:00:00 342342.12
34cx34fs987 20200903 00:00:05 342421.88
34cx34fs987 20200903 00:00:10 351232.92
Using Scala:
If you define your schema as
val schema: StructType = new StructType()
.add("id", StringType)
.add("time_series", ArrayType(new StructType()
.add("time", StringType)
.add("value", DoubleType)
))
you can then make use of Spark SQL built-in functions from_json and explode
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = df
.selectExpr("CAST(value as STRING) as json")
.select(from_json('json, schema).as("data"))
.select(col("data.id").as("id"), explode(col("data.time_series")).as("time_series"))
.select(col("id"), col("time_series.time").as("time"), col("time_series.value").as("value"))
Your output will be then
+-----------+-----------------+---------+
|id |time |value |
+-----------+-----------------+---------+
|34cx34fs987|20200903 00:00:00|342342.12|
|34cx34fs987|20200903 00:00:05|342421.88|
|34cx34fs987|20200903 00:00:10|351232.92|
+-----------+-----------------+---------+
Sample code in pyspark
df2 = df.select("id", f.explode("time_series").alias("col"))
df2.select("id", "col.time", "col.value").show()
I am new to spark and trying to explore Spark structured streaming. I will be consuming messages from Kafka(nested JSON), filter these messages based on certain conditions on the JSON attribute. Every message satisfying the filter should then be pushed to Cassandra.
I have read the documentation on spark Cassandra connector
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(value AS STRING)")
I only need a few of the many attributes present in this nested JSON. How do I apply a schema on top of it, so that I can use sparkSQL for filtering?
For the sample JSON, I need to persist name, age, experience, hobby_name,hobby_experience for players whose sum of playing frequency is more than 5.
{
"name": "Tom",
"age": "24",
"gender": "male",
"hobbies": [{
"name": "Tennis",
"experience": 5,
"places": [{
"city": "London",
"frequency": 4
}, {
"city": "Sydney",
"frequency": 3
}]
}]
}
I am relatively new to Spark, please forgive if this is a repeat. Also, I am looking for a solution in JAVA.
You can specify your schema like so:
import org.apache.spark.sql.types.{DataTypes, StructField, StructType};
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true),
DataTypes.createStructField("gender", DataTypes.StringType, true),
DataTypes.createStructField("hobbies", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("experience", DataTypes.IntegerType, true),
DataTypes.createStructField("places", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("city", DataTypes.StringType, true),
DataTypes.createStructField("frequency", DataTypes.IntegerType, true)
}), true)
}), true)
});
And then use the schema to create your dataframe as needed:
import org.apache.spark.sql.functions.{col, from_json};
df.select(from_json(col("value"), schema).as("data"))
.select(
col("data.name").as("name"),
col("data.hobbies.name").as("hobbies_name"))
When I trying to read a spark dataframe column containing JSON string as array, with a defined schema it returns null. I tried Array, Seq and List for the schema but all returns null. My spark version is 2.2.0
val dfdata= spark.sql("""select "\[{ \"id\":\"93993\", \"name\":\"Phil\" }, { \"id\":\"838\", \"name\":\"Don\" }]" as theJson""")
dfdata.show(5,false)
val sch = StructType(
Array(StructField("id", StringType, true),
StructField("name", StringType, true)))
print(sch.prettyJson )
dfdata.select(from_json($"theJson", sch)).show
and the output
+---------------------------------------------------------------+
|theJson |
+---------------------------------------------------------------+
|[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]|
+---------------------------------------------------------------+
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}+----------------------+
|jsontostructs(theJson)|
+----------------------+
| null|
+----------------------+
Your schema isn't quite right for your example. Your example is an array of structs. Try by wrapping it in an ArrayType:
val sch = ArrayType(StructType(Array(
StructField("id", StringType, true),
StructField("name", StringType, true)
)))
Have you tried parsing your json string before obtaining a DF?
// obtaining this string should be easy:
val jsonStr = """[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]"""
// then you can take advantage of schema inference
val df2 = spark.read.json(Seq(jsonStr).toDS)
df2.show(false)
// it shows:
// +-----+----+
// |id |name|
// +-----+----+
// |93993|Phil|
// |838 |Don |
// +-----+----+
Here, I am Sending the json data to kafka from "test" topic ,give the schema to json, do some transformation and print it on console.
Here is the code:-
val kafkadata = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("zookeeper.connect", "localhost:2181")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
val schema1 = new StructType()
.add("id_sales_order", StringType)
.add("item_collection",
MapType(
StringType,
new StructType()
.add("id", LongType)
.add("ip", StringType)
.add("description", StringType)
.add("temp", LongType)
.add("c02_level", LongType)
.add("geo",
new StructType()
.add("lat", DoubleType)
.add("long", DoubleType)
)
)
)
val df = kafkadata.selectExpr("cast (value as string) as
json")
.select(from_json($"json",
schema=schema1).as("data"))
.select($"data.id_sales_order",explode($"data.item_collection"))
val query = df.writeStream
.outputMode("append")
.queryName("table")
.format("console")
.start()
query.awaitTermination()
spark.stop()
I am sending data to kafka by 2 ways:-
1) Single line json:-
{"id_sales_order": "2", "item_collection": {"2": {"id": 10,"ip": "68.28.91.22","description": "Sensor attached to the container ceilings","temp":35,"c02_level": 1475,"geo": { "lat":38.00, "long":97.00}}}}
It is giving me output
+--------------+---+--------------------+
|id_sales_order|key| value|
+--------------+---+--------------------+
| 2| 2|[10,68.28.91.22,S...|
+--------------+---+--------------------+
2)Multiline json:-
{
"id_sales_order": "2",
"item_collection": {
"2": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo":
{ "lat":38.00, "long":97.00}
}
}
}
It is not giving me any output.
+--------------+---+-----+
|id_sales_order|key|value|
+--------------+---+-----+
+--------------+---+-----+
Json coming from source is like 2nd one.
How do you handle json while reading streaming data from kafka?
I think the problem may be that the from_json function doesn't understand multiline json.