Performance of Spark structured streaming Stream-Stream join - apache-spark

I am trying out the stream-stream join feature of spark Structured streaming using Spark 2.4.0.
I am just joining two simple set of data just to observe the performance of stream-stream join. I am currently running this in my local machine with just a few input records. I observe that it takes more than a couple of minutes to join data from two streams and write output to Kafka.
Here is what I have been trying :
val in1Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic1"))
.load()
.select($"timestamp" as "timestamp1",$"value" cast "string" as "value1")
val in2Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic2"))
.load()
.select($"timestamp" as "timestamp2", $"value" cast "string" as "value2")
val in1DfWithWatermark = in1Df
.select($"timestamp1",$"value1")
.withWatermark("timestamp1", "10 seconds")
val in2DfWithWatermark = in2Df
.select($"timestamp2",$"value2")
.withWatermark("timestamp2", "20 seconds")
val joinedDf = in1DfWithWatermark.join(in2DfWithWatermark,
expr(("""value1 = value2 AND
timestamp2 >= timestamp1 AND
timestamp2 <= timestamp1 + interval 1 minutes""")))
joinedDf.select(($"value1").alias("value"))
.writeStream
.format("kafka")
.option("topic", config.getString("SparkStrucStreamingPoc.outTopic"))
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("checkpointLocation", config.getString("SparkStrucStreamingPoc.checkpoint"))
.start()
.awaitTermination()
Has anyone else observed this kind of a behavior ? Does it usually take this long to join two streams ?

Related

How to guarantee sequence of execution of multiple sinks in spark structured streaming

In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2

Send data to Kafka topics based on a condition in Dataframe

I want to change the Kafka topic destination to save the data depending on the value of the data in SparkStreaming.
Is it possible to do so again?
When I tried the following code, it only executes the first one, but does not execute the lower process.
(testdf
.filter(f.col("value") == "A")
.selectExpr("CAST(value as STRING) as value")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_1")
.option("kafka.bootstrap.servers","~~:9092")
.option("topic", "test")
.option("startingOffsets", "latest")
.start()
)
(testdf
.filter(f.col("value") == "B")
.selectExpr("CAST(value as STRING) as value")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_2")
.option("kafka.bootstrap.servers","~~:9092")
.option("topic", "testB")
.option("startingOffsets", "latest")
.start()
)
Data is stored in the topic name test.
Can anyone think of a way to do this?
I changed the destination to save such a data frame.
|type|value|
| A |testvalue|
| B |testvalue|
type A to topic test.
type B to topic testB.
With the latest versions of Spark, you could just create a column topic in your dataframe which is used to direct the record into the corresponding topic.
In your case it would mean you can do something like
testdf
.withColumn("topic", when(f.col("value") == "A", lit("test")).otherwise(lit("testB"))
.selectExpr("CAST(value as STRING) as value", "topic")
.writeStream .format("kafka")
.option("checkpointLocation", "/checkpoint_1")
.option("kafka.bootstrap.servers","~~:9092")
.start()
thx mike.
I was able to achieve this by running the following code!
(
testdf
.withColumn("topic",f.when(f.col("testTime") == "A", f.lit("test")).otherwise(("testB")))
.selectExpr("CAST(value as STRING) as value", "topic")
.writeStream
.format("kafka")
.option("checkpointLocation", "/checkpoint_2")
.option("startingOffsets", "latest")
.option("kafka.bootstrap.servers","9092")
.start()
)

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

Read Kafka topic tail in Spark

I need to subscribe to Kafka topic latest offset, read some newest records, print them and finish. How can I do this in Spark? I suppose I could do something like this
sqlContext
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", "latest")
.filter($"someField" === "someValue")
.take(10)
.show
You need to be aware in advance until which offsets in which partitions you want to consume from Kafka. If you have that information, you can do something like:
// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", """{"myTopic":{"0":20,"1":20}}""")
.option("endingOffsets", """{"myTopic":{"0":25,"1":25}}""")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.filter(...)
More details on the startingOffsets and endingOffsets are given in the Kafka + Spark Integration Guide

Mixing Spark Structured Streaming API and DStream to write to Kafka

I've recently noticed I have a confusion regarding Spark Streaming (I'm currently learning Spark).
I am reading data from a Kafka topic like this:
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Which creates a DStream.
In order to work with event-time (and not processing-time) I did this:
outputStream
.foreachRDD(rdd => {
rdd.toDF().withWatermark("timestamp", "60 seconds")
.groupBy(
window($"timestamp", "60 seconds", "10 seconds")
)
.sum("meterIncrement")
.toJSON
.toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "taxi-dollar-accurate")
.start()
)
})
And I get the error
'writeStream' can be called only on streaming Dataset/DataFrame
Which surprised me, because the source of the DF is a DStream. Anyway, I managed to solve this by changing .writeStream to .write and .start() to .save().
But I got the feeling that I lost the streaming power on that foreach somehow. Clearly that's why I am writing this question. Is this a correct approach? I've seen other scripts that use
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
But I don't know how different is this from just calling foreach on the DStream and then transforming each RDD to DF.
But I don't know how different is this from just calling foreach on the DStream and then transforming each RDD to DF.
When you are calling:
outputStream
.foreachRDD(rdd => {
rdd.toDF()
.[...]
.toJSON
.toDF("value")
.writeStream
.format("kafka")
your variable rdd (or the Dataframe) became a single RDD which is not a stream anymore. Hence, the rdd.toDF.[...].writeStream will not work anymore.
Continue with RDD
If you choose to use the DSream approach, you can send those single RDDs calling the KafkaProducer API.
An example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val producer = new KafkaProducer[String, String](kafkaParameters)
partitionOfRecords.foreach { message =>
producer.send(message)
}
producer.close()
}
}
However, this is not the recommended approach as you are creating and closing a KafkaProducer in each batch interval on each executor. But this should give you a basic understanding on how to write data to Kafka using the DirectStream API.
To further optimize sending your data to Kafka you can follow the guidance given here.
Continue with Dataframe
However, you could also transform your RDD into a Dataframe, but then making sure to call the batch-oriented API to write data into Kafka:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
For all the details on how to write a batch Dataframe into Kafka is geven in the Spark Structured Streaming + Kafka Integration Guide
Note
Still, and most importantly, I highly recommend to not mix up RDD and Structured API for such a case and rather stick to the one or the other.

Resources