How to include both "latest" and "JSON with specific Offset" in "startingOffsets" while importing data from Kafka into Spark Structured Streaming - apache-spark

I have a streaming query saving data into filesink. I am using .option("startingOffsets", "latest") and a checkpoint location. If there is any down time on Spark and when the streaming query starts again i do not want to start processing where the query left off when it went down rather than this scenario i would also like to add ("startingOffsets", """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """) by specifying the user defined offset which needs to process from.
i tried doing this with different programs but i need to achieve this in one single program
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
object OSB_offset_kafkaToSpark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("OSB_kafkaToSpark").
config("spark.mongodb.output.uri", "spark.mongodb.output.uri=mongodb://somemongodb.com:27018").
getOrCreate()
println("SparkSession -> "+spark)
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "somekafkabroker:9092, somekafkabroker:9092")
.option("subscribe", "someTopic")
.option("startingOffsets", "latest")
.option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""")
.option("endingOffsets",""" {"someTopic":{"0":-1}}, "someTopic":{"1":-1}}, "someTopic":{"2":-1}} """)
.option("failOnDataLoss", "false")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val data = dfs.withColumn("splitted", split($"value", "/"))
.select($"splitted".getItem(4).alias("region"), $"splitted".getItem(5).alias("service"), col("value"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""", 1))
.withColumn("region_type", concat(
when(col("region").isNotNull, col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull, col("service_type")).otherwise(lit("null"))))
.withColumn("datetime", regexp_extract($"value", """\d{4}-[01]\d-[0-3]\d [0-2]\d:[0-5]\d:[0-5]\d""", 0))
val extractedDF = data.filter(
col("region").isNotNull &&
col("service").isNotNull &&
col("value").isNotNull &&
col("service_type").isNotNull &&
col("region_type").isNotNull &&
col("datetime").isNotNull)
.filter("region != ''")
.filter("service != ''")
.filter("value != ''")
.filter("service_type != ''")
.filter("region_type != ''")
.filter("datetime != ''")
val pathstring = "/user/spark_streaming".concat(args(0))
val query = extractedDF.writeStream
.format("json")
.option("path", pathstring)
.option("checkpointLocation", "/user/some_checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
query.awaitTermination()
}
}
I need run a single program with both .option("startingOffsets", "latest") and .option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""").
I am not sure if this is achievable

This is an old question at this point, so the OP likely got their answer, but when specifying offsets in JSON string format, you can use -2 for earliest and -1 for latest.
src
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

Related

What is the best way to perform multiple filter operations on spark streaming dataframe read from Kafka?

I need to apply multiple filters on a DataFrame read from a Kafka topic and publish output of each of these filter to an external system (like another Kafka topic).
I have read the kafkaDF like this
val kafkaDF: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.load()
.select(col("topic"), expr("cast(value as string) as message"))
.filter(col("message").isNotNull && col("message") =!= "")
.select(from_json(col("message"), eventsSchema).as("eventData"))
.select("eventData.*")
I am able to run a foreachBatch on this Dataframe and then iterate over the list of filters to get the filtered data which then can be published to a kafka topic, as shown below
kafkaDF.writeStream
.foreachBatch { (batch: DataFrame, _: Long) =>
// List of filters that needs to be applied
filterList.par.foreach(filterString => {
val filteredDF = batch.filter(filterString)
// Add some columns.
// Do some operations based on different filter
filteredDF.toJSON.foreach(value => {
// Publish a message to Kafka
})
})
}
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()
.awaitTermination()
But, I am not sure if this is the best way given so many iterations. Is there a better way than doing it like this?
If you plan to write data from one Kafka topic into multiple Kafka topics you can create a column called "topic" in a single Dataframe when writing to Kafka. The value in this column then defines the topic in which a record will be produced. This allows you to write to as many different Kafka topics as required.
Therefore, I would just apply your filter logic as a when/otherwise condition or, if more complex, as a UDF.
Below is an example code that should get you started. Based on the value of the consumed Kafka message, a column called "topic" gets created in the filteredDf. If value = 1 then the Dataframe record gets produced into the topic called "out1", and otherwise the recod gets produced into topic called "out2".
val inputDf = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "try.kafka.stream")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(key AS STRING) as key", "CAST(value AS STRING) as value", "partition", "offset", "timestamp")
val filteredDf = inputDf.withColumn("topic", when(filter, lit("out1")).otherwise(lit("out2")))
val query = filteredDf
.select(
col("key"),
to_json(struct(col("*"))).alias("value"),
col("topic"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/home/michael/sparkCheckpoint/1/")
.start()
query.awaitTermination()
EDIT: (I might have misunderstood your question initially)
If you just want to find a good way to apply multiple filters out of your filterList you can combine them using foldLeft:
val filter1 = col("value") === 1
val filter2 = col("key") === 1
val filterList = List(filter1, filter2)
val filterAll = filterList.tail.foldLeft(filterList.head)((f1, f2) => f1.and(f2))
println(filterAll)
((value = 1) AND (key = 1))
Then apply .filter(filterAll) to your Dataframe.

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

Performance of Spark structured streaming Stream-Stream join

I am trying out the stream-stream join feature of spark Structured streaming using Spark 2.4.0.
I am just joining two simple set of data just to observe the performance of stream-stream join. I am currently running this in my local machine with just a few input records. I observe that it takes more than a couple of minutes to join data from two streams and write output to Kafka.
Here is what I have been trying :
val in1Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic1"))
.load()
.select($"timestamp" as "timestamp1",$"value" cast "string" as "value1")
val in2Df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("subscribe", config.getString("SparkStrucStreamingPoc.inTopic2"))
.load()
.select($"timestamp" as "timestamp2", $"value" cast "string" as "value2")
val in1DfWithWatermark = in1Df
.select($"timestamp1",$"value1")
.withWatermark("timestamp1", "10 seconds")
val in2DfWithWatermark = in2Df
.select($"timestamp2",$"value2")
.withWatermark("timestamp2", "20 seconds")
val joinedDf = in1DfWithWatermark.join(in2DfWithWatermark,
expr(("""value1 = value2 AND
timestamp2 >= timestamp1 AND
timestamp2 <= timestamp1 + interval 1 minutes""")))
joinedDf.select(($"value1").alias("value"))
.writeStream
.format("kafka")
.option("topic", config.getString("SparkStrucStreamingPoc.outTopic"))
.option("kafka.bootstrap.servers", s"$kafkaHost:$kafkaPort")
.option("checkpointLocation", config.getString("SparkStrucStreamingPoc.checkpoint"))
.start()
.awaitTermination()
Has anyone else observed this kind of a behavior ? Does it usually take this long to join two streams ?

Spark Streaming: Text data source supports only a single column

I am consuming Kafka data and then stream the data to HDFS.
The data stored in Kafka topic trial is like:
hadoop
hive
hive
kafka
hive
However, when I submit my codes, it returns:
Exception in thread "main"
org.apache.spark.sql.streaming.StreamingQueryException: Text data source supports only a single column, and you have 7 columns.;
=== Streaming Query ===
Identifier: [id = 2f3c7433-f511-49e6-bdcf-4275b1f1229a, runId = 9c0f7a35-118a-469c-990f-af00f55d95fb]
Current Committed Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":13}}}
Current Available Offsets: {KafkaSource[Subscribe[trial]]: {"trial":{"2":13,"1":13,"3":12,"0":14}}}
My question is: as shown above, the data stored in Kafka comprises only ONE column, why the program says there are 7 columns ?
Any help is appreciated.
My spark-streaming codes:
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder.master("local[4]")
.appName("SpeedTester")
.config("spark.driver.memory", "3g")
.getOrCreate()
val ds = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.95.20:9092")
.option("subscribe", "trial")
.option("startingOffsets" , "earliest")
.load()
.writeStream
.format("text")
.option("path", "hdfs://192.168.95.21:8022/tmp/streaming/fixed")
.option("checkpointLocation", "/tmp/checkpoint")
.start()
.awaitTermination()
}
That is explained in the Structured Streaming + Kafka Integration Guide:
Each row in the source has the following schema:
Column Type
key binary
value binary
topic string
partition int
offset long
timestamp long
timestampType int
Which gives exactly seven columns. If you want to write only payload (value) select it and cast to string:
spark.readStream
...
.load()
.selectExpr("CAST(value as string)")
.writeStream
...
.awaitTermination()

How to handle delayed events per group in spark

Spark watermark features comes handy when it comes to delayed events. But I am not sure how to handle a scenario where stream is generated from multiple devices in the field, and some devices my be reporting the events bit late. If we apply a watermark, eventTime watermark is maintained in spark against all events and not against the groupBy fields. So spark will drop all the events coming from the devices which are running(syncing) late. What is the best way to handle such scenario? I have modified the word count program from spark structured streaming to demonstrate the issue.
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
case class DeviceData(deviceId:String, value:Double, userId:String, timestamp:Timestamp)
object StructuredNetworkWordCountWindowed {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: StructuredNetworkWordCountWindowed <hostname> <port>" +
" <window duration in seconds> [<slide duration in seconds>]")
System.exit(1)
}
val host = args(0)
val port = args(1).toInt
val windowSize = args(2).toInt
val slideSize = if (args.length == 3) windowSize else args(3).toInt
if (slideSize > windowSize) {
System.err.println("<slide duration> must be less than or equal to <window duration>")
}
val windowDuration = s"$windowSize seconds"
val slideDuration = s"$slideSize seconds"
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCountWindowed")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val deviceDF:DataFrame = lines.as[String].map(_.split(",")).
map(value=>DeviceData(value(0), value(1).toDouble, value(2), new Timestamp(value(3).toLong))).toDF()
// Group the data by window and deviceId and compute the count of each group
val windowedCounts = deviceDF
.withWatermark("timestamp", "2 minutes")
.groupBy(window($"timestamp", windowDuration, slideDuration), $"deviceId")
.count()
val query = windowedCounts.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
}
}
Here if device1 is syncing almost near real time, while device2 lag by 5 minutes, then the program will completely ignore the events from device2. Is there a way to apply the watermark against the groupBy function than keeping it as a whole?

Resources