Spark structured streaming kafka convert JSON without schema (infer schema) - apache-spark

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does:
val dataFrame = spark.read.json(rdd.map(_.value()))
dataFrame.printschema

Here is one possible way to do this:
Before you start streaming, get a small batch of the data from Kafka
Infer the schema from the small batch
Start streaming the data using the extracted schema.
The pseudo-code below illustrates this approach.
Step 1:
Extract a small (two records) batch from Kafka,
val smallBatch = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.option("endingOffsets", """{"topicName":{"0":2}}""")
.load()
.selectExpr("CAST(value AS STRING) as STRING").as[String].toDF()
Step 2:
Write the small batch to a file:
smallBatch.write.mode("overwrite").format("text").save("/batch")
This command writes the small batch into hdfs directory /batch. The name of the file that it creates is part-xyz*. So you first need to rename the file using hadoop FileSystem commands (see org.apache.hadoop.fs._ and org.apache.hadoop.conf.Configuration, here's an example https://stackoverflow.com/a/41990859) and then read the file as json:
val smallBatchSchema = spark.read.json("/batch/batchName.txt").schema
Here, batchName.txt is the new name of the file and smallBatchSchema contains the schema inferred from the small batch.
Finally, you can stream the data as follows (Step 3):
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.load()
val dataDf = inputDf.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=smallBatchSchema).as("data"))
.select("data.*")
Hope this helps!

It is possible using this construct:
myStream = spark.readStream.schema(spark.read.json("my_sample_json_file_as_schema.json").schema).json("my_json_file")..
How can this be? Well, as the spark.read.json("..").schema returns exactly a wanted inferred schema, you can use this returned schema as an argument for the mandatory schema parameter of spark.readStream
What I did was to specify a one-liner sample-json as input for inferring the schema stuff so it does not unnecessary take up memory. In case your data changes, simply update your sample-json.
Took me a while to figure out (constructing StructTypes and StructFields by hand was pain in the ..), therefore I'll be happy for all upvotes :-)

It is not possible. Spark Streaming supports limited schema inference in development with spark.sql.streaming.schemaInference set to true:
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.
but it cannot be used to extract JSON from Kafka messages and DataFrameReader.json doesn't support streaming Datasets as arguments.
You have to provide schema manually How to read records in JSON format from Kafka using Structured Streaming?

It is possible to convert JSON to a DataFrame without having to manually type the schema, if that is what you meant to ask.
Recently I ran into a situation where I was receiving massively long nested JSON packets via Kafka, and manually typing the schema would have been both cumbersome and error-prone.
With a small sample of the data and some trickery you can provide the schema to Spark2+ as follows:
val jsonstr = """ copy paste a representative sample of data here"""
val jsondf = spark.read.json(Seq(jsonstr).toDS) //jsondf.schema has the nested json structure we need
val event = spark.readStream.format..option...load() //configure your source
val eventWithSchema = event.select($"value" cast "string" as "json").select(from_json($"json", jsondf.schema) as "data").select("data.*")
Now you can do whatever you want with this val as you would with Direct Streaming. Create temp view, run SQL queries, whatever..

Taking Arnon's solution to the next step (since it's deprecated in spark's newer versions, and would require iterating the whole dataframe just for a type casting)
spark.read.json(df.as[String])
Anyways, as for now, it's still experimental.

Related

Right way to read stream from Kafka topic using checkpointLocation offsets

I'm trying to develop a small Spark app (using Scala) to read messages from Kafka (Confluent) and write them (insert) into Hive table. Everything works as expected, except for one important feature - managing offsets when the application is restarted (submited). It confuses me.
Cut from my code:
def main(args: Array[String]): Unit = {
val sparkSess = SparkSession
.builder
.appName("Kafka_to_Hive")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse/")
.config("hive.metastore.uris", "thrift://localhost:9083")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSess.sparkContext.setLogLevel("ERROR")
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
sparkSess.udf.register("deserialize", (bytes: Array[Byte]) =>
DeserializerWrapper.deserializer.deserialize(bytes)
)
val kafkaDataFrame = sparkSess
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", 'localhost:9092')
.option("group.id", 'kafka-to-hive-1')
// ------> which Kafka options do I need to set here for starting from last right offset to ensure completenes of data and "exactly once" writing? <--------
.option("failOnDataLoss", (false: java.lang.Boolean))
.option("subscribe", 'some_topic')
.load()
import org.apache.spark.sql.functions._
// don't consider this code block please, it's just a part of Confluent avro message deserializing adventures
val valueDataFrame = kafkaDataFrame.selectExpr("""deserialize(value) AS message""")
val df = valueDataFrame.select(
from_json(col("message"), sparkSchema.dataType).alias("parsed_value"))
.select("parsed_value.*")
df.writeStream
.foreachBatch((batchDataFrame, batchId) => {
batchDataFrame.createOrReplaceTempView("`some_view_name`")
val sqlText = "SELECT * FROM `some_view_name` a where some_field='some value'"
val batchDataFrame_view = batchDataFrame.sparkSession.sql(sqlText);
batchDataFrame_view.write.insertInto("default.some_hive_table")
})
.option("checkpointLocation", "/user/some_user/tmp/checkpointLocation")
.start()
.awaitTermination()
}
Questions (the questions are related to each other):
Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?
Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")
What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)
"Which Kafka options do I need to apply on readStream.format("kafka") for starting from last right offset on every submit of spark app?"
You would need to set startingOffsets=latest and clean up the checkpoint files.
"Do I need to manually read 3rd line of checkpointLocation/offsets/latest_batch file to find last offsets to read from Kafka? I mean something like that: readStream.format("kafka").option("startingOffsets", """{"some_topic":{"2":35079,"5":34854,"4":35537,"1":35357,"3":35436,"0":35213}}""")"
Similar to first question, if you set the startingOffsets as the json string, you need to delete the checkpointing files. Otherwise, the spark application will always fetch the information stored in the checkpoint files and override the settings given in the startingOffsets option.
"What is the right/convenient way to read stream from Kafka (Confluent) topic? (I'm not considering offsets storing engine of Kafka)"
Asking about "the right way" might lead to opinion based answers and is therefore off-topic on Stackoverflow. Anyway, using Spark Structured Streaming is already a mature and production-ready approach in my experience. However, it is always worth also looking into KafkaConnect.

JSON schema inference in Structured Streaming with Kafka as source

I'm currently using Spark Structured Steaming to read json data out of a Kafka topic. The json is stored as a string in the topic. To accomplish this I supply a hardcoded JSON schema as a StructType. I'm searching for a good way to dynamically infer the schema of a topic during streaming.
This is my code:
(It's Kotlin and not the usually used Scala)
spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "my_topic")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value AS STRING)")
.select(
from_json(
col("value"),
JsonSchemaRegistry.mySchemaStructType)
.`as`("data")
)
.select("data.*")
.writeStream()
.format("my_format")
.option("checkpointLocation", "/path/to/checkpoint")
.trigger(ProcessingTime("25 seconds"))
.start("/my/path")
.awaitTermination()
Is this possible in a clean way right now without infering it again for each DataFrame? I'm looking for some idiomatic way. If schema inference is not advisable in Structured Streaming I would continue to hardcode my schemas, but want to be sure. The option spark.sql.streaming.schemaInference is mentioned in the Spark docs but I can't see how to use it.
For KAFKA NOT possible. Takes too much time. For file sources you can.
From the manual:
Schema inference and partition of streaming DataFrames/Datasets
By default, Structured Streaming from file based sources requires you
to specify the schema, rather than rely on Spark to infer it
automatically. This restriction ensures a consistent schema will be
used for the streaming query, even in the case of failures. For ad-hoc
use cases, you can reenable schema inference by setting
spark.sql.streaming.schemaInference to true.
But for file sources which is not KAFKA.

How to convert JSON Dataset to DataFrame in Spark Structured Streaming [duplicate]

This question already has an answer here:
How to read records in JSON format from Kafka using Structured Streaming?
(1 answer)
Closed 5 years ago.
I am using Spark Structured streaming to process data from Kafka. I transform each message to JSON. However, spark needs an explicit schema to obtain columns from JSON. Spark Streaming with DStreams allows doing following
spark.read.json(spark.createDataset(jsons))
where jsons is RDD[String].
In case of Spark Structured Streaming similar approach
df.sparkSession.read.json(jsons)
(jsons is DataSet[String])
results to the following exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
I assume that read triggers execution instead of start, but is there a way to bypass this?
To stream from JSON on Kafka to a DataFrame you need to do something like this:
case class Colour(red: Int, green: Int, blue: Int)
val colourSchema: StructType = new StructType()
.add("entity", "int")
.add("security", "int")
.add("client", "int")
val streamingColours: DataFrame = spark
.readStream
.format("kafka")
.load()
.selectExpr("CAST(value AS STRING)")
.select(from_json($"value", colourSchema))
streamingColours
.writeStream
.outputMode("complete")
.format("console")
.start()
This should create a streaming DataFrame, and show the results of reading from Kafka on the console.
I do not believe it is possible to use "infer schema" with streaming data sets. And this makes sense, since infer schema looks at a large set of data to work out what the types are etc. With streaming datasets the schema that could be inferred by processing the first message might be different to the schema of the second message, etc. And Spark needs one schema for all elements of the DataFrame.
What we have done in the past is to process a batch of JSON messages with Spark's batch processing and using infer schema. And then export that schema for use with streaming datasets.

How to process Avro messages while reading a stream of messages from Kafka?

The below code reads the messages from Kafka and the messages are in Avro so how do I parse the message and put it into a dataframe in Spark 2.2.0?
Dataset<Row> df = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load();
This https://github.com/databricks/spark-avro library had no example for streaming case.
how do I parse the message and put it into a dataframe in Spark 2.2.0?
That's your home exercise that is going to require some coding.
This https://github.com/databricks/spark-avro library had no example for streaming case.
I've been told (and seen a couple of questions here) that spark-avro does not support Spark Structured Streaming (aka Spark Streams). It works fine with non-streaming Datasets, but can't handle streaming ones.
That's why I wrote that this is something you have to code yourself.
That could look as follows (I use Scala for simplicity):
// Step 1. convert messages to be strings
val avroMessages = df.select($"value" cast "string")
// Step 2. Strip the avro layer off
val from_avro = udf { (s: String) => ...processing here... }
val cleanDataset = avroMessages.withColumn("no_avro_anymore", from_avro($"value"))
That would require developing a from_avro custom UDF that would do what you want (and would be similar to how Spark handles JSON format using from_json standard function!)
Alternatively (and in a slightly more advanced? / convoluted approach) write your own custom streaming Source for datasets in Avro format in Kafka and use it instead.
Dataset<Row> df = sparkSession.readStream()
.format("avro-kafka") // <-- HERE YOUR CUSTOM Source
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic1")
.load();
I'm yet to find out how doable avro-kafka format is. It is indeed doable, but does two things at once, i.e. reading from Kafka and doing Avro conversion, and am not convinced that's the way to do things in Spark Structured Streaming and in software engineering in general. I wished there were a way to apply one format after another, but that's not possible in Spark 2.2.1 (and is not planned for 2.3 either).
I think then that a UDF is the best solution for the time being.
Just a thought, you could also write a custom Kafka Deserializer that would do the deserialization while Spark loads messages.

Spark Structured Streaming - compare two streams

I am using Kafka and Spark 2.1 Structured Streaming. I have two topics with data in json format eg:
topic 1:
{"id":"1","name":"tom"}
{"id":"2","name":"mark"}
topic 2:
{"name":"tom","age":"25"}
{"name":"mark","age:"35"}
I need to compare those two streams in Spark base on tag:name and when value is equal execute some additional definition/function.
How to use Spark Structured Streaming to do this ?
Thanks
Following the current documentation (Spark 2.1.1)
Any kind of joins between two streaming Datasets are not yet
supported.
ref: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
At this moment, I think you need to rely on Spark Streaming as proposed by #igodfried's answer.
I hope you got your solution. In case not, then you can try creating two KStreams from two topics and then join those KStreams and put joined data back to one topic. Now you can read the joined data as one DataFrame using Spark Structured Streaming. Now you'll be able to apply any transformations you want on the joined data. Since Structured streaming doesn't support join of two streaming DataFrames you can follow this approach to get the task done.
I faced a similar requirement some time ago: I had 2 streams which had to be "joined" together based on some criteria. What I used was a function called mapGroupsWithState.
What this functions does (in few words, more details on the reference below) is to take stream in the form of (K,V) and accumulate together its elements on a common state, based on the key of each pair. Then you have ways to tell Spark when the state is complete (according to your application), or even have a timeout for incomplete states.
Example based on your question:
Read Kafka topics into a Spark Stream:
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1,topic2") // Both topics on same stream!
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(value AS STRING) as jsonData") // Kafka sends bytes
Do some operations on your data (I prefer SQL, but you can use the DataFrame API) to transform each element into a key-value pair:
spark.sqlContext.udf.register("getKey", getKey) // You define this function; I'm assuming you will be using the name as key in your example.
val keyPairsStream = rawDataStream
.sql("getKey(jsonData) as ID, jsonData from rawData")
.groupBy($"ID")
Use the mapGroupsWithState function (I will show you the basic idea; you will have to define the myGrpFunct according to your needs):
keyPairsStream
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(myGrpFunct)
Thats it! If you implement myGrpFunct correctly, you will have one stream of merged data, which you can further transform, like the following:
["tom",{"id":"1","name":"tom"},{"name":"tom","age":"25"}]
["mark",{"id":"2","name":"mark"},{"name":"mark","age:"35"}]
Hope this helps!
An excellent explanation with some code snippets: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming/
One method would be to transform both streams into (K,V) format. In your case this would probably take the form of (name, otherJSONData) See the Spark documentation for more information on joining streams and an example located here. Then do a join on both streams and then perform whatever function on the newly joined stream. If needed you can use map to return (K,(W,V)) to (K,V).

Resources