Add schema to a Dataset[Row] in Java - apache-spark

I am new to spark and trying to explore Spark structured streaming. I will be consuming messages from Kafka(nested JSON), filter these messages based on certain conditions on the JSON attribute. Every message satisfying the filter should then be pushed to Cassandra.
I have read the documentation on spark Cassandra connector
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(value AS STRING)")
I only need a few of the many attributes present in this nested JSON. How do I apply a schema on top of it, so that I can use sparkSQL for filtering?
For the sample JSON, I need to persist name, age, experience, hobby_name,hobby_experience for players whose sum of playing frequency is more than 5.
{
"name": "Tom",
"age": "24",
"gender": "male",
"hobbies": [{
"name": "Tennis",
"experience": 5,
"places": [{
"city": "London",
"frequency": 4
}, {
"city": "Sydney",
"frequency": 3
}]
}]
}
I am relatively new to Spark, please forgive if this is a repeat. Also, I am looking for a solution in JAVA.

You can specify your schema like so:
import org.apache.spark.sql.types.{DataTypes, StructField, StructType};
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("age", DataTypes.StringType, true),
DataTypes.createStructField("gender", DataTypes.StringType, true),
DataTypes.createStructField("hobbies", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("experience", DataTypes.IntegerType, true),
DataTypes.createStructField("places", DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("city", DataTypes.StringType, true),
DataTypes.createStructField("frequency", DataTypes.IntegerType, true)
}), true)
}), true)
});
And then use the schema to create your dataframe as needed:
import org.apache.spark.sql.functions.{col, from_json};
df.select(from_json(col("value"), schema).as("data"))
.select(
col("data.name").as("name"),
col("data.hobbies.name").as("hobbies_name"))

Related

What must type be for createDateFrame(data: dateFrameLike) parameter?

I have the following code to create a pyspark DataFrame:
from typing import List, Dict, Any
from pyspark.sql import SparkSession
from pyspark.sql.types import (
FloatType,
IntegerType,
StringType,
StructField,
StructType,
)
def purchases_schema() -> StructType:
return StructType(
[
StructField("Customer", StringType(), True),
StructField("Store", StringType(), True),
StructField("Product", StringType(), True),
StructField("Quantity", IntegerType(), True),
StructField("Basket", StringType(), True),
StructField("GrossSpend", FloatType(), True),
]
)
spark = SparkSession.builder.getOrCreate()
transactions = [
{
"Customer": "Leia",
"Store": "Hammersmith",
"Basket": "basket1",
"items": [
{"Product": "Cheddar", "Quantity": 2, "GrossSpend": 2.50},
{"Product": "Grapes", "Quantity": 1, "GrossSpend": 3.00},
],
},
{
"Customer": "Luke",
"Store": "Ealing",
"Basket": "basket2",
"items": [
{
"Product": "Custard",
"Quantity": 1,
"GrossSpend": 3.00,
}
],
},
]
flattened_transactions: List[Dict[str, Any]] = [
{
"Customer": d["Customer"],
"Store": d["Store"],
"Basket": d["Basket"],
**d2,
}
for d in transactions
for d2 in d["items"]
]
df = spark.createDataFrame(flattened_transactions, schema=purchases_schema())
print(df.collect())
It works fine, here is the output:
➜ python script.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/11/29 10:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Row(Timestamp=None, Customer='Leia', Store='Hammersmith', Channel=None, Product='Cheddar', Quantity=2, Basket='basket1', GrossSpend=2.5), Row(Timestamp=None, Customer='Leia', Store='Hammersmith', Channel=None, Product='Grapes', Quantity=1, Basket='basket1', GrossSpend=3.0), Row(Timestamp=None, Customer='Luke', Store='Ealing', Channel=None, Product='Custard', Quantity=1, Basket='basket2', GrossSpend=3.0)]
What bothers me is that my IDE is complaining about the list passed to createDataFrame's data parameter:
Argument of type "list[Dict[str, Any]]" cannot be assigned to parameter "data" of type "DataFrameLike" in function "createDataFrame"n  "list[Dict[str, Any]]" is incompatible with "DataFrameLike"
Given that the code runs successfully I'm slightly surprised about this complaint from the IDE. Can someone suggest how I can solve this problem (without simply sticking # type: ignore on the end of the line)

Spark reading json number as string

I'm working on spark java application and using spark 2.4.7 version. I have a json file that I'm loading into dataframe like
Dataset<Row> df = sparkSession().read().option("multiline",true).format(json).load(path_of_json);
The issue is that in my json file I have an attribute whose value is in number but when I printSchema() of the dataframe it is showing that attribute as StringType and not LongType.
Json file-
[
{"first": {
"id" :"fdfd",
"name":"temp",
"type":-1 --> reading it as LongType
},
"something":"something_else",
"data" : {
"key": {
"field":7569, --> reading it as StringType
"temp":"dfdfd"
}
}
}]
I tried reproducing the issue in my local spark shell but it is working fine there. Anyone has an idea why is it happening?
By default, Spark tries to infer the schema automatically when reading from a Json file data source. However, if you know it, you can specify the schema when loading the Dataframe.
You first need to define the schema, an instance of the StructType class, where you specify each field name and data type.
You can do it manually:
StructType keyType = new StructType()
.add(new StructField("field", DataTypes.LongType, true, Metadata.empty()))
.add(new StructField("temp", DataTypes.StringType, true, Metadata.empty()));
StructType dataType = new StructType()
.add(new StructField("key", keyType, true, Metadata.empty()));
StructType firstType = new StructType()
.add(new StructField("id", DataTypes.StringType, true, Metadata.empty()))
.add(new StructField("name", DataTypes.StringType, true, Metadata.empty()))
.add(new StructField("type", DataTypes.LongType, true, Metadata.empty()));
StructType schema = new StructType()
.add(new StructField("data", dataType, true, Metadata.empty()))
.add(new StructField("first", firstType, true, Metadata.empty()))
.add(new StructField("something", DataTypes.StringType, true, Metadata.empty()));
or from a DDL string:
StructType schema = StructType.fromDDL("data STRUCT<key: STRUCT<field: BIGINT, temp: STRING>>,first STRUCT<id: STRING, name: STRING, type: BIGINT>,something STRING"));
Then specify the schema when loading the Dataframe:
Dataset<Row> df = spark.read()
.option("multiline", true)
.format("json")
.schema(schema)
.load(jsonPath);

Can't extract value from <> need struct type but got string;

I have some nested json that I have parallelized and spat out as a json. A complete record would look like:
{
"id":"1",
"type":"site",
"attributes":{
"description":"Number 1 Park",
"activeInactive":{
"text":"Active",
"colour":"#4CBB17"
},
"lastUpdated":"2019-12-05T08:51:39"
},
"relationships":{
"region":{
"data":{
"type":"region",
"id":"1061",
"meta":{
"displayValue":"Park Region"
}
}
}
}
}
However, the data is pending a data cleanse and currently the region field is not populated.
{
"id":"1",
"type":"site",
"attributes":{
"description":"Number 1 Park",
"activeInactive":{
"text":"Active",
"colour":"#4CBB17"
},
"lastUpdated":"2019-12-05T08:51:39"
},
"relationships":{
"region":{
"data": null
}
}
}
}
The data element will be null if the relationship doesn't exist (i.e. it is an orphaned site).
I run this JSON into a spark dataframe via an RDD. The schema of the dataframe is:
attributes:struct
activeInactive:struct
colour:string
text:string
description:string
lastUpdated:string
id:string
relationships:struct
region:struct
data:string
I get errors when coding for region using df.select(col('relationships.region.data.meta.displayValue')) as if the nested fields were there rather than data as per the topic heading. I'm going to assume this is because of the conflict with the dataframe's schema.
The question is how can I make this more dynamic and still obtain the displayValue as and when this is populated without needing to revisit the code?
While reading a json file, you can impose the schema on the output dataframe using this syntax:
df = spark.read.json("<path to json file>", schema = <schema object>)
This way the data field will still show you null, but it's gonna be StructType() with a complete nested structure.
Based on the data snippet that was provided the applicable schema object looks like this:
schemaObject = StructType([
StructField('id', StringType(), True),
StructField('type', StringType(), True),
StructField('attributes', StructType([
StructField('descrption', StringType(), True),
StructField('activeInactive', StructType([
StructField('text', StringType(), True),
StructField('colour', StringType(), True)
]), True),
StructField('lastUpdated', StringType(), True)
]), True),
StructField('relationships'StructType([
StructField('region', StructType([
StructField('data', StructType([
StructField('type', StringType(), True),
StructField('id', StringType(), True),
StructField('meta', StructType([
StructField('displayValue', StringType(), True)
]), True)
]), True)
]), True)
]), True)
])

Spark Structed Streaming read nested json from kafka and flatten it

A json type data :
{
"id": "34cx34fs987",
"time_series": [
{
"time": "2020090300: 00: 00",
"value": 342342.12
},
{
"time": "2020090300: 00: 05",
"value": 342421.88
},
{
"time": "2020090300: 00: 10",
"value": 351232.92
}
]
}
I got the json from kafka:
spark = SparkSession.builder.master('local').appName('test').getOrCreate()
df = spark.readStream.format("kafka")...
How can I manipulate df to get a DataFrame as shown below:
id time value
34cx34fs987 20200903 00:00:00 342342.12
34cx34fs987 20200903 00:00:05 342421.88
34cx34fs987 20200903 00:00:10 351232.92
Using Scala:
If you define your schema as
val schema: StructType = new StructType()
.add("id", StringType)
.add("time_series", ArrayType(new StructType()
.add("time", StringType)
.add("value", DoubleType)
))
you can then make use of Spark SQL built-in functions from_json and explode
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = df
.selectExpr("CAST(value as STRING) as json")
.select(from_json('json, schema).as("data"))
.select(col("data.id").as("id"), explode(col("data.time_series")).as("time_series"))
.select(col("id"), col("time_series.time").as("time"), col("time_series.value").as("value"))
Your output will be then
+-----------+-----------------+---------+
|id |time |value |
+-----------+-----------------+---------+
|34cx34fs987|20200903 00:00:00|342342.12|
|34cx34fs987|20200903 00:00:05|342421.88|
|34cx34fs987|20200903 00:00:10|351232.92|
+-----------+-----------------+---------+
Sample code in pyspark
df2 = df.select("id", f.explode("time_series").alias("col"))
df2.select("id", "col.time", "col.value").show()

Cannot convert Catalyst type IntegerType to Avro type ["null","int"]

I've Spark Structured Streaming process build with Pyspark that reads a avro message from a kafka topic, make some transformations and load the data as avro in a target topic.
I use the ABRIS package (https://github.com/AbsaOSS/ABRiS) to serialize/deserialize the Avro from Confluent, integrating with Schema Registry.
The schema contains integer columns as follows:
{
"name": "total_images",
"type": [
"null",
"int"
],
"default": null
},
{
"name": "total_videos",
"type": [
"null",
"int"
],
"default": null
},
The process raises the following error: Cannot convert Catalyst type IntegerType to Avro type ["null","int"].
I've tried to convert the columns to be nullable but the error persists.
If someone have a suggestion I would appreciate that
I burned hours on this one
Actually, It is unrelated to Abris dependency (behaviour is the same with native spark-avro apis)
There may be several root causes but in my case … using Spark 3.0.1, Scala with Dataset : it was related to encoder and wrong type in the case class handling datas.
Shortly, avro field defined with "type": ["null","int"] can’t be mapped to scala Int, it needs Option[Int]
Using the following code:
test("Avro Nullable field") {
val schema: String =
"""
|{
| "namespace": "com.mberchon.monitor.dto.avro",
| "type": "record",
| "name": "TestAvro",
| "fields": [
| {"name": "strVal", "type": ["null", "string"]},
| {"name": "longVal", "type": ["null", "long"]}
| ]
|}
""".stripMargin
val topicName = "TestNullableAvro"
val testInstance = TestAvro("foo",Some(Random.nextInt()))
import sparkSession.implicits._
val dsWrite:Dataset[TestAvro] = Seq(testInstance).toDS
val allColumns = struct(dsWrite.columns.head, dsWrite.columns.tail: _*)
dsWrite
.select(to_avro(allColumns,schema) as 'value)
.write
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("topic", topicName)
.save()
val dsRead:Dataset[TestAvro] = sparkSession.read
.format("kafka")
.option("kafka.bootstrap.servers", bootstrap)
.option("subscribe", topicName)
.option("startingOffsets", "earliest")
.load()
.select(from_avro(col("value"), schema) as 'Metric)
.select("Metric.*")
.as[TestAvro]
assert(dsRead.collect().contains(testInstance))
}
It fails if case class is defined as follow:
case class TestAvro(strVal:String,longVal:Long)
Cannot convert Catalyst type LongType to Avro type ["null","long"].
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Catalyst type LongType to Avro type ["null","long"].
at org.apache.spark.sql.avro.AvroSerializer.newConverter(AvroSerializer.scala:219)
at org.apache.spark.sql.avro.AvroSerializer.$anonfun$newStructConverter$1(AvroSerializer.scala:239)
It works properly with:
case class TestAvro(strVal:String,longVal:Option[Long])
Btw, it would be more than nice to have support for SpecificRecord within Spark encoders (you can use Kryo but it is sub efficient)
Since, in order to use efficiently typed Dataset with my avro data … I need to create additional case classes (which duplicates of my SpecificRecords).

Resources