Spark Structured Streaming - compare two streams - apache-spark

I am using Kafka and Spark 2.1 Structured Streaming. I have two topics with data in json format eg:
topic 1:
{"id":"1","name":"tom"}
{"id":"2","name":"mark"}
topic 2:
{"name":"tom","age":"25"}
{"name":"mark","age:"35"}
I need to compare those two streams in Spark base on tag:name and when value is equal execute some additional definition/function.
How to use Spark Structured Streaming to do this ?
Thanks

Following the current documentation (Spark 2.1.1)
Any kind of joins between two streaming Datasets are not yet
supported.
ref: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations
At this moment, I think you need to rely on Spark Streaming as proposed by #igodfried's answer.

I hope you got your solution. In case not, then you can try creating two KStreams from two topics and then join those KStreams and put joined data back to one topic. Now you can read the joined data as one DataFrame using Spark Structured Streaming. Now you'll be able to apply any transformations you want on the joined data. Since Structured streaming doesn't support join of two streaming DataFrames you can follow this approach to get the task done.

I faced a similar requirement some time ago: I had 2 streams which had to be "joined" together based on some criteria. What I used was a function called mapGroupsWithState.
What this functions does (in few words, more details on the reference below) is to take stream in the form of (K,V) and accumulate together its elements on a common state, based on the key of each pair. Then you have ways to tell Spark when the state is complete (according to your application), or even have a timeout for incomplete states.
Example based on your question:
Read Kafka topics into a Spark Stream:
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1,topic2") // Both topics on same stream!
.option("startingOffsets", "latest")
.option("failOnDataLoss", "true")
.load()
.selectExpr("CAST(value AS STRING) as jsonData") // Kafka sends bytes
Do some operations on your data (I prefer SQL, but you can use the DataFrame API) to transform each element into a key-value pair:
spark.sqlContext.udf.register("getKey", getKey) // You define this function; I'm assuming you will be using the name as key in your example.
val keyPairsStream = rawDataStream
.sql("getKey(jsonData) as ID, jsonData from rawData")
.groupBy($"ID")
Use the mapGroupsWithState function (I will show you the basic idea; you will have to define the myGrpFunct according to your needs):
keyPairsStream
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(myGrpFunct)
Thats it! If you implement myGrpFunct correctly, you will have one stream of merged data, which you can further transform, like the following:
["tom",{"id":"1","name":"tom"},{"name":"tom","age":"25"}]
["mark",{"id":"2","name":"mark"},{"name":"mark","age:"35"}]
Hope this helps!
An excellent explanation with some code snippets: http://asyncified.io/2017/07/30/exploring-stateful-streaming-with-spark-structured-streaming/

One method would be to transform both streams into (K,V) format. In your case this would probably take the form of (name, otherJSONData) See the Spark documentation for more information on joining streams and an example located here. Then do a join on both streams and then perform whatever function on the newly joined stream. If needed you can use map to return (K,(W,V)) to (K,V).

Related

spark.readStream vs Kafkautils.createDirectStream

I was wondering if anyone knew what the difference between the two syntax is? I know both are used to read data from Kafka but what differentiates them?
spark.readStream.format("kafka")
KafkaUtils.createDirectStream(__)
They are part of different dependencies, for one.
The first is for Structured Streaming, and returns Dataframes, and is considered the preferred API for Spark
The second is for RDD Spark Streaming operations where the data might not have any consistency to it (a structure), or if you did want more direct access to the lower level ConsumerRecord object of Spark

Spark Structured Streaming with Schema Registry integration for Avro based messages

We have a use-case where we are trying to consumer multiple Kafka topics (AVRO messages) integrating with Schema registry.
We are using Spark Structured streaming (Spark version : 2.4.4) , Confluent Kafka (Library version: 5.4.1) for the same:
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<bootstrap.server>:9092")
.option("subscribe", [List of Topics]) // multi topic subscribe
.load()
We are selecting value from the above DF and using Schema to de-serialize the AVRO message
val tableDF = kafkaDF.select(from_avro(col("value"), jsonSchema).as("table")).select("*")
The Roadblock here is , since we are using multiple Kafka topics , we have have integrated all our JSON schema into MAP with key being the topic name and values being the respective schema.
How can we do a lookup using the same map in the select above? we tried UDF but it returns a "col" type but jsonSchema has to be of String type. Also it Schema is different for different topics
Couple of added questions:
Is the above approach correct for consuming multiple topics at once?
Should we use single consumer per topic?
If we have more number of topics how can we achieve parallel topic processing, cos sequential might take substantial amount of time.
Without checking it all, you appear to be OK on the basics with from_avro, etc. from_json, etc.
https://sparkbyexamples.com/spark/spark-streaming-consume-and-produce-kafka-messages-in-avro-format/ can guide you on first part. This also very good https://www.waitingforcode.com/apache-spark-structured-streaming/two-topics-two-schemas-one-subscription-apache-spark-structured-streaming/read#filter_topic.
I would do table.*
Multiple, multiple schemas --> read multiple such versions from .avsc or code yourself.
For consuming multiple topics in a Spark Streaming App the question is how many per App? No hard rule except obvious ones like large consumption vs. smaller consumption and if order is important. Executor resources can be relinquished.
Then you need to process all the Topics separately like this imho - Filter on Topic - you can fill in the details as I am a little rushed - using the foreachBatch paradigm.
Not sure how you are writing the data out at rest, but the question is not about that.
Similar to this, then to process multiple Topics:
...
... // Need to get Topic first
stream.toDS()
.select($"topic", $"value")
.writeStream
.foreachBatch((dataset, _) => {
dataset.persist() // need this
dataset.filter($"topic" === topicA)
.select(from_avro(col("value"), jsonSchemaTA)
.as("tA"))
.select("tA.*")
.write.format(...).save(...)
dataset.filter($"topic" === topicB)
.select(from_avro(col("value"), jsonSchemaTB)
.as("tB"))
.select("tB.*")
.write.format(...).save(...)
...
...
dataset.unpersist()
...
})
.start().awaitTermination()
but blend with this excellent answer: Integrating Spark Structured Streaming with the Confluent Schema Registry

How to identify the origin of messages in spark structured streaming with kafka as a source?

I have a use case in which I have to subscribe to multiple topics in kafka in spark structured streaming. Then I have to parse each message and form a delta lake table out of it. I have made the parser and the messages(in form of xml) correctly parsing and forming delta-lake table. However, I am only subscribing to only one topic as of now. I want to subscribe to multiple topics and based on the topic, it should go to the parser dedicatedly made for this particular topic. So basically I want to identify the topic name for all the messages as they process so that I can send them to the desired parser and process further.
This is how I am accessing the messages from different topics. However, I have no idea how to identify the source of the incoming messages while processing them.
val stream_dataframe = spark.readStream
.format(ConfigSetting.getString("source"))
.option("kafka.bootstrap.servers", ConfigSetting.getString("bootstrap_servers"))
.option("kafka.ssl.truststore.location", ConfigSetting.getString("trustfile_location"))
.option("kafka.ssl.truststore.password", ConfigSetting.getString("truststore_password"))
.option("kafka.sasl.mechanism", ConfigSetting.getString("sasl_mechanism"))
.option("kafka.security.protocol", ConfigSetting.getString("kafka_security_protocol"))
.option("kafka.sasl.jaas.config",ConfigSetting.getString("jass_config"))
.option("encoding",ConfigSetting.getString("encoding"))
.option("startingOffsets",ConfigSetting.getString("starting_offset_duration"))
.option("subscribe",ConfigSetting.getString("topics_name"))
.option("failOnDataLoss",ConfigSetting.getString("fail_on_dataloss"))
.load()
var cast_dataframe = stream_dataframe.select(col("value").cast(StringType))
cast_dataframe = cast_dataframe.withColumn("parsed_column",parser(col("value"))) // Parser is the udf, made to parse the xml from the topic.
How can I identify the topic name of the messages as they process in spark structured streaming ?
As per the official documentation (emphasis mine)
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
...
As you see input topic is part of the output schema, and can be accessed without any special actions.

Spark structured streaming kafka convert JSON without schema (infer schema)

I read Spark Structured Streaming doesn't support schema inference for reading Kafka messages as JSON. Is there a way to retrieve schema the same as Spark Streaming does:
val dataFrame = spark.read.json(rdd.map(_.value()))
dataFrame.printschema
Here is one possible way to do this:
Before you start streaming, get a small batch of the data from Kafka
Infer the schema from the small batch
Start streaming the data using the extracted schema.
The pseudo-code below illustrates this approach.
Step 1:
Extract a small (two records) batch from Kafka,
val smallBatch = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.option("endingOffsets", """{"topicName":{"0":2}}""")
.load()
.selectExpr("CAST(value AS STRING) as STRING").as[String].toDF()
Step 2:
Write the small batch to a file:
smallBatch.write.mode("overwrite").format("text").save("/batch")
This command writes the small batch into hdfs directory /batch. The name of the file that it creates is part-xyz*. So you first need to rename the file using hadoop FileSystem commands (see org.apache.hadoop.fs._ and org.apache.hadoop.conf.Configuration, here's an example https://stackoverflow.com/a/41990859) and then read the file as json:
val smallBatchSchema = spark.read.json("/batch/batchName.txt").schema
Here, batchName.txt is the new name of the file and smallBatchSchema contains the schema inferred from the small batch.
Finally, you can stream the data as follows (Step 3):
val inputDf = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "node:9092")
.option("subscribe", "topicName")
.option("startingOffsets", "earliest")
.load()
val dataDf = inputDf.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=smallBatchSchema).as("data"))
.select("data.*")
Hope this helps!
It is possible using this construct:
myStream = spark.readStream.schema(spark.read.json("my_sample_json_file_as_schema.json").schema).json("my_json_file")..
How can this be? Well, as the spark.read.json("..").schema returns exactly a wanted inferred schema, you can use this returned schema as an argument for the mandatory schema parameter of spark.readStream
What I did was to specify a one-liner sample-json as input for inferring the schema stuff so it does not unnecessary take up memory. In case your data changes, simply update your sample-json.
Took me a while to figure out (constructing StructTypes and StructFields by hand was pain in the ..), therefore I'll be happy for all upvotes :-)
It is not possible. Spark Streaming supports limited schema inference in development with spark.sql.streaming.schemaInference set to true:
By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting spark.sql.streaming.schemaInference to true.
but it cannot be used to extract JSON from Kafka messages and DataFrameReader.json doesn't support streaming Datasets as arguments.
You have to provide schema manually How to read records in JSON format from Kafka using Structured Streaming?
It is possible to convert JSON to a DataFrame without having to manually type the schema, if that is what you meant to ask.
Recently I ran into a situation where I was receiving massively long nested JSON packets via Kafka, and manually typing the schema would have been both cumbersome and error-prone.
With a small sample of the data and some trickery you can provide the schema to Spark2+ as follows:
val jsonstr = """ copy paste a representative sample of data here"""
val jsondf = spark.read.json(Seq(jsonstr).toDS) //jsondf.schema has the nested json structure we need
val event = spark.readStream.format..option...load() //configure your source
val eventWithSchema = event.select($"value" cast "string" as "json").select(from_json($"json", jsondf.schema) as "data").select("data.*")
Now you can do whatever you want with this val as you would with Direct Streaming. Create temp view, run SQL queries, whatever..
Taking Arnon's solution to the next step (since it's deprecated in spark's newer versions, and would require iterating the whole dataframe just for a type casting)
spark.read.json(df.as[String])
Anyways, as for now, it's still experimental.

Saving values from spark to Cassandra

I need to store the values from kafka->spark streaming->cassandra.
Now, I am receiving the values from kafka->spark and I have a spark job to save values into the cassandra db. However, I'm facing a problem with the datatype dstream.
In this following snippet you can see how I'm trying to convert the DStream into python friendly list object so that I can work with it but it gives an error.
input at kafka producer:
Byrne 24 San Diego robbyrne#email.com Rob
spark-job:
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
lines = kafkaStream.map(lambda x: x[1])
words = lines.flatMap(lambda line: line.split(" "))
words.pprint() # outputs-> Byrne 24 SanDiego robbyrne#email.com Rob
list=[lambda word for word in words]
#gives an error -> TypeError: 'TransformedDStream' object is not iterable
This is how I'm saving values from spark->cassandra
rdd2=sc.parallelize([{
... "lastname":'Byrne',
... "age":24,
... "city":"SanDiego",
... "email":"robbyrne#email.com",
... "firstname":"Rob"}])
rdd2.saveToCassandra("keyspace2","users")
What's the best way of converting the DStream object to a dictionary or what's the best way of doing what I'm trying to do here?
I just need the values received from kafka (in the form of DStream) to be saved in Cassandra.
Thanks and any help would be nice!
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
Like everything 'sparky', I think a short explanation is due since even if you are familiar with RDDs, DStreams are of an even higher concept:
A Discretized Stream (DStream), is a continuous sequence of RDDs of the same type, representing a continuous stream of data. In your case, DStreams are created from live Kafka data.
While a Spark Streaming program is running, each DStream periodically generates a RDD from live Kafka data
Now, to iterate over received RDDs, you need to use DStream#foreachRDD (and as implied by its name, it serves a similar purpose as foreach, but this time, to iterate over RDDs).
Once you have an RDD, you can invoke rdd.collect() or rdd.take() or any other standard API for RDDs.
Now, as a closing note, to make things even more fun, Spark introduced a new receiver-less “direct” approach to ensure stronger end-to-end guarantees.
(KafkaUtils.createDirectStream which requires Spark 1.3+)
Instead of using receivers to receive data, this approach periodically queries Kafka for the latest offsets in each topic+partition, and accordingly defines the offset ranges to process in each batch. When the jobs to process the data are launched, Kafka’s simple consumer API is used to read the defined ranges of offsets from Kafka.
(which is a nice way to say you will have to "mess" with the offsets yourself)
See Direct Streams Approach for further details.
See here for a scala code example
According to the official doc of the spark-cassandra connector: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
import com.datastax.spark.connector.streaming._
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ...
val wc = stream
.map(...)
.filter(...)
.saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
ssc.start()
Actually, I found the answer in this tutorial http://katychuang.me/blog/2015-09-30-kafka_spark.html.

Resources