How to guarantee sequence of execution of multiple sinks in spark structured streaming - apache-spark

In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2

Related

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

Read Kafka topic tail in Spark

I need to subscribe to Kafka topic latest offset, read some newest records, print them and finish. How can I do this in Spark? I suppose I could do something like this
sqlContext
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", "latest")
.filter($"someField" === "someValue")
.take(10)
.show
You need to be aware in advance until which offsets in which partitions you want to consume from Kafka. If you have that information, you can do something like:
// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.1:9092,...")
.option("subscribe", "myTopic")
.option("startingOffsets", """{"myTopic":{"0":20,"1":20}}""")
.option("endingOffsets", """{"myTopic":{"0":25,"1":25}}""")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.filter(...)
More details on the startingOffsets and endingOffsets are given in the Kafka + Spark Integration Guide

Performance of Spark structured streaming Stream-Stream join

I am trying out the stream-stream join feature of spark Structured streaming using Spark 2.4.0.
I am just joining two simple set of data just to observe the performance of stream-stream join. I am currently running this in my local machine with just a few input records. I observe that it takes more than a couple of minutes to join data from two streams and write output to Kafka.
Here is what I have been trying :
val in1Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic1"))
.load()
.select($"timestamp" as "timestamp1",$"value" cast "string" as "value1")
val in2Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic2"))
.load()
.select($"timestamp" as "timestamp2", $"value" cast "string" as "value2")
val in1DfWithWatermark = in1Df
.select($"timestamp1",$"value1")
.withWatermark("timestamp1", "10 seconds")
val in2DfWithWatermark = in2Df
.select($"timestamp2",$"value2")
.withWatermark("timestamp2", "20 seconds")
val joinedDf = in1DfWithWatermark.join(in2DfWithWatermark,
expr(("""value1 = value2 AND
timestamp2 >= timestamp1 AND
timestamp2 <= timestamp1 + interval 1 minutes""")))
joinedDf.select(($"value1").alias("value"))
.writeStream
.format("kafka")
.option("topic", config.getString("SparkStrucStreamingPoc.outTopic"))
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("checkpointLocation", config.getString("SparkStrucStreamingPoc.checkpoint"))
.start()
.awaitTermination()
Has anyone else observed this kind of a behavior ? Does it usually take this long to join two streams ?

Mixing Spark Structured Streaming API and DStream to write to Kafka

I've recently noticed I have a confusion regarding Spark Streaming (I'm currently learning Spark).
I am reading data from a Kafka topic like this:
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Which creates a DStream.
In order to work with event-time (and not processing-time) I did this:
outputStream
.foreachRDD(rdd => {
rdd.toDF().withWatermark("timestamp", "60 seconds")
.groupBy(
window($"timestamp", "60 seconds", "10 seconds")
)
.sum("meterIncrement")
.toJSON
.toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "taxi-dollar-accurate")
.start()
)
})
And I get the error
'writeStream' can be called only on streaming Dataset/DataFrame
Which surprised me, because the source of the DF is a DStream. Anyway, I managed to solve this by changing .writeStream to .write and .start() to .save().
But I got the feeling that I lost the streaming power on that foreach somehow. Clearly that's why I am writing this question. Is this a correct approach? I've seen other scripts that use
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
But I don't know how different is this from just calling foreach on the DStream and then transforming each RDD to DF.
But I don't know how different is this from just calling foreach on the DStream and then transforming each RDD to DF.
When you are calling:
outputStream
.foreachRDD(rdd => {
rdd.toDF()
.[...]
.toJSON
.toDF("value")
.writeStream
.format("kafka")
your variable rdd (or the Dataframe) became a single RDD which is not a stream anymore. Hence, the rdd.toDF.[...].writeStream will not work anymore.
Continue with RDD
If you choose to use the DSream approach, you can send those single RDDs calling the KafkaProducer API.
An example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val producer = new KafkaProducer[String, String](kafkaParameters)
partitionOfRecords.foreach { message =>
producer.send(message)
}
producer.close()
}
}
However, this is not the recommended approach as you are creating and closing a KafkaProducer in each batch interval on each executor. But this should give you a basic understanding on how to write data to Kafka using the DirectStream API.
To further optimize sending your data to Kafka you can follow the guidance given here.
Continue with Dataframe
However, you could also transform your RDD into a Dataframe, but then making sure to call the batch-oriented API to write data into Kafka:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
For all the details on how to write a batch Dataframe into Kafka is geven in the Spark Structured Streaming + Kafka Integration Guide
Note
Still, and most importantly, I highly recommend to not mix up RDD and Structured API for such a case and rather stick to the one or the other.

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

Resources