Add schema to Spark structured streaming messages in JSON format - apache-spark

I'm implementing a Spark Structured Streaming job where I'm consuming messages coming from Kafka in JSON format.
def setup_input_stream(kafka_brokers, spark, topic_name):
return spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", kafka_brokers) \
.option("subscribe", topic_name) \
.load()
I am then able to extract the value field from the Kafka message in the form of a String containing a JSON payload.
deserialized_data = data_stream \
.selectExpr("CAST (value AS STRING) as json") \
.select(f.from_json(f.col("json"), schema=JSON_SCHEMA).alias("schemaless_data")) \
.select("schemaless_data.payload")
Once I have this payload column, I'm struggling to find a way to let Spark automatically infer its schema and convert it to a proper DataFrame.
I know I can hardcode a StructType containing the schema of my payload, but since I want to use this generic implementation to ingest data coming from different RDBMS tables (each table in its separate topic), I don't really want to hardcode the schema of every possible table.
Can message schemas be inferred somehow?

Related

Writing data as JSON array with Spark Structured Streaming

I have to write data from Spark Structure streaming as JSON Array, I have tried using below code:
df.selectExpr("to_json(struct(*)) AS value").toJSON
which returns me DataSet[String], but unable to write as JSON Array.
Current Output:
{"name":"test","id":"id"}
{"name":"test1","id":"id1"}
Expected Output:
[{"name":"test","id":"id"},{"name":"test1","id":"id1"}]
Edit (moving comments into question):
After using proposed collect_list method I am getting
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
Then I tried something like this -
withColumn("timestamp", unix_timestamp(col("event_epoch"), "MM/dd/yyyy hh:mm:ss aa")) .withWatermark("event_epoch", "1 minutes") .groupBy(col("event_epoch")) .agg(max(col("event_epoch")).alias("timestamp"))
But I don't want to add a new column.
You can use the SQL built-in function collect_list for this. This function collects and returns a set of non-unique elements (compared to collect_set which returns only unique elements).
From the source code for collect_list you will see that this is an aggregation function. Based on the requirements given in the Structured Streaming Programming Guide on Output Modes it is highlighted that the output modes "complete" and "updated" are supported for aggregations without a watermark.
As I understand from your comments, you do not wish to add watermark and new columns. Also, the error you are facing
Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;
reminds you to not use the output mode "append".
In the comments, you have mentioned that you plan to produce the results into a Kafka message. One big JSON Array as one Kafka value. The complete code could look like
val df = spark.readStream
.[...] // in my test I am reading from Kafka source
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "offset", "partition")
// do not forget to convert you data into a String before writing to Kafka
.selectExpr("CAST(collect_list(to_json(struct(*))) AS STRING) AS value")
df.writeStream
.format("kafka")
.outputMode("complete")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test")
.option("checkpointLocation", "/path/to/sparkCheckpoint")
.trigger(Trigger.ProcessingTime(10000))
.start()
.awaitTermination()
Given the key/value pairs (k1,v1), (k2,v2), and (k3,v3) as inputs you will get a value in the Kafka topic that contains all selected data as a JSON Array:
[{"key":"k1","value":"v1","offset":7,"partition":0}, {"key":"k2","value":"v2","offset":8,"partition":0}, {"key":"k3","value":"v3","offset":9,"partition":0}]
Tested with Spark 3.0.1 and Kafka 2.5.0.

JSON schema inference in Structured Streaming with Kafka as source

I'm currently using Spark Structured Steaming to read json data out of a Kafka topic. The json is stored as a string in the topic. To accomplish this I supply a hardcoded JSON schema as a StructType. I'm searching for a good way to dynamically infer the schema of a topic during streaming.
This is my code:
(It's Kotlin and not the usually used Scala)
spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "my_topic")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value AS STRING)")
.select(
from_json(
col("value"),
JsonSchemaRegistry.mySchemaStructType)
.`as`("data")
)
.select("data.*")
.writeStream()
.format("my_format")
.option("checkpointLocation", "/path/to/checkpoint")
.trigger(ProcessingTime("25 seconds"))
.start("/my/path")
.awaitTermination()
Is this possible in a clean way right now without infering it again for each DataFrame? I'm looking for some idiomatic way. If schema inference is not advisable in Structured Streaming I would continue to hardcode my schemas, but want to be sure. The option spark.sql.streaming.schemaInference is mentioned in the Spark docs but I can't see how to use it.
For KAFKA NOT possible. Takes too much time. For file sources you can.
From the manual:
Schema inference and partition of streaming DataFrames/Datasets
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
But for file sources which is not KAFKA.

How to validate every row of streaming batch?

Need to validate each row of Streaming Dataframe (consumed through readStream(kafka) - Getting error
Queries with streaming sources must be executed with writeStream.start()
as it is not allowing to validate row by row
I have created spark batch job to consume data from Kafka , validated each row against HBase data, another set of validations based on rowkey and created a dataframe out of it. But here I need to handle the Kafka offset manually in the code.
To avoid the offset handling, am trying to use spark structural Streaming but there am not able to validate row by row.
writestream.foreach (foreachwriter) is good to sink with any external datasource or writing to kafka.
But in my case, I need to validate each row and create a new dataframe based on my validation. foreachwriter - process is not allowing to collect the data using other external classes/list.
Errors:
Getting this error when I tried to access the streaming dataframe to validate
Queries with streaming sources must be executed with writeStream.start();
Task is not serializable when I tried to create a list out of foreach(foreachwriter extended object). Will update with more details (as I am trying other options)
I am trying to achieve spark batch using writerstream.trigger(Trigger.once) with checkpointlocation
Updating with Spark batch and Structural Streaming Code.
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootStrap)
.option("subscribePattern", kafkaSubTopic)
.option("startingOffsets", "earliest")
//.option("endingOffsets", "latest")
.load()
rawData.collect.foreach(row => {
if (dValidate.dValidate(row)) {
validatedCandidates += (row.getString(0))
}
==================== in the above code I need to handle the offset manually for rerun so decided to use structural streaming.============
val rawData = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootStrap)
.option("subscribe", kafkaSubTopic)
.option("enable.auto.commit", "true")
.option("startingOffsets","latest")
.option("minPartitions", "10")
.option("failOnDataLoss", "true")
.load()
val sInput = new SinkInput(validatedCandidates,dValidate)
rawData.writeStream
.foreach(sInput)
.outputMode(OutputMode.Append())
.option("truncate", "false")
.trigger(Trigger.Once())
.start()
am getting "Task not serialized" error in here.
with class SinkInput , I am trying to do the same collect operation with external dValidate instance
Unless I misunderstood your case, rawData is a streaming query (a streaming Dataset) and does not support collect. The following part of your code is not correct:
rawData.collect
That's not supported and hence the exception.
You should be using foreach or foreachBatch to access rows.
Do this instead:
rawData.write.foreach(...)

Remove (corrupt) rows from Spark Streaming DataFrame that don't fit schema (incoming JSON data from Kafka)

I have a spark structured steaming application that I'm reading in from Kafka.
Here is the basic structure of my code.
I create the Spark session.
val spark = SparkSession
.builder
.appName("app_name")
.getOrCreate()
Then I read from the stream
val data_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server_list")
.option("subscribe", "topic")
.load()
In Kafka record, I cast the "value" as a string. It converts from binary to string. At this point there is 1 column in the data frame
val df = data_stream
.select($"value".cast("string") as "json")
Based off of a pre-defined schema, I try to parse out the JSON structure into columns. However, the problem here is if the data is "bad", or a different format then it doesn't match the defined schema. So the next dataframe (df2) get's null values into the columns.
val df2 = df.select(from_json($"json", schema) as "data")
.select("data.*")
I'd like to be able to filter out from df2 the row's that have "null" in a certain column (one that I use as a primary key in a database) i.e. ignore bad data that doesn't match the schema?
EDIT: I was somewhat able to accomplish this but not the way I intended to.
In my process, I use a query that uses the .foreach(writer) process. What this does is it opens a connection to a database, processes each row, and then closes the connection. The documentation for structured streaming mentions the necessities you need for this process. In the process method, I get the values from each row and check if my primary key is null, if it is null I don't insert it into the database.
Just filter out any null values you don't want:
df2
.filter(row => row("colName") != null)
Kafka stores data as raw byte array format. Data producers and consumers need to agree a structure of data for processing.
If there is change in produced message format, consumer need to adjust to read same format. The problem comes when your data structure is evolving, you might need to have compatible at consumer side.
Defining message format by Protobuff solves this problem.

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does:
val dataFrame = spark.read.json(rdd.map(_.value()))
dataFrame.printschema
Here is one possible way to do this:
Before you start streaming, get a small batch of the data from Kafka
Infer the schema from the small batch
Start streaming the data using the extracted schema.
The pseudo-code below illustrates this approach.
Step 1:
Extract a small (two records) batch from Kafka,
val smallBatch = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.option("endingOffsets", """{"topicName":{"0":2}}""")
.load()
.selectExpr("CAST(value AS STRING) as STRING").as[String].toDF()
Step 2:
Write the small batch to a file:
smallBatch.write.mode("overwrite").format("text").save("/batch")
This command writes the small batch into hdfs directory /batch. The name of the file that it creates is part-xyz*. So you first need to rename the file using hadoop FileSystem commands (see org.apache.hadoop.fs._ and org.apache.hadoop.conf.Configuration, here's an example https://stackoverflow.com/a/41990859) and then read the file as json:
val smallBatchSchema = spark.read.json("/batch/batchName.txt").schema
Here, batchName.txt is the new name of the file and smallBatchSchema contains the schema inferred from the small batch.
Finally, you can stream the data as follows (Step 3):
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.load()
val dataDf = inputDf.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=smallBatchSchema).as("data"))
.select("data.*")
Hope this helps!
It is possible using this construct:
myStream = spark.readStream.schema(spark.read.json("my_sample_json_file_as_schema.json").schema).json("my_json_file")..
How can this be? Well, as the spark.read.json("..").schema returns exactly a wanted inferred schema, you can use this returned schema as an argument for the mandatory schema parameter of spark.readStream
What I did was to specify a one-liner sample-json as input for inferring the schema stuff so it does not unnecessary take up memory. In case your data changes, simply update your sample-json.
Took me a while to figure out (constructing StructTypes and StructFields by hand was pain in the ..), therefore I'll be happy for all upvotes :-)
It is not possible. Spark Streaming supports limited schema inference in development with spark.sql.streaming.schemaInference set to true:
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.
but it cannot be used to extract JSON from Kafka messages and DataFrameReader.json doesn't support streaming Datasets as arguments.
You have to provide schema manually How to read records in JSON format from Kafka using Structured Streaming?
It is possible to convert JSON to a DataFrame without having to manually type the schema, if that is what you meant to ask.
Recently I ran into a situation where I was receiving massively long nested JSON packets via Kafka, and manually typing the schema would have been both cumbersome and error-prone.
With a small sample of the data and some trickery you can provide the schema to Spark2+ as follows:
val jsonstr = """ copy paste a representative sample of data here"""
val jsondf = spark.read.json(Seq(jsonstr).toDS) //jsondf.schema has the nested json structure we need
val event = spark.readStream.format..option...load() //configure your source
val eventWithSchema = event.select($"value" cast "string" as "json").select(from_json($"json", jsondf.schema) as "data").select("data.*")
Now you can do whatever you want with this val as you would with Direct Streaming. Create temp view, run SQL queries, whatever..
Taking Arnon's solution to the next step (since it's deprecated in spark's newer versions, and would require iterating the whole dataframe just for a type casting)
spark.read.json(df.as[String])
Anyways, as for now, it's still experimental.

Resources