How to handle delayed events per group in spark - apache-spark

Spark watermark features comes handy when it comes to delayed events. But I am not sure how to handle a scenario where stream is generated from multiple devices in the field, and some devices my be reporting the events bit late. If we apply a watermark, eventTime watermark is maintained in spark against all events and not against the groupBy fields. So spark will drop all the events coming from the devices which are running(syncing) late. What is the best way to handle such scenario? I have modified the word count program from spark structured streaming to demonstrate the issue.
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
case class DeviceData(deviceId:String, value:Double, userId:String, timestamp:Timestamp)
object StructuredNetworkWordCountWindowed {
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println("Usage: StructuredNetworkWordCountWindowed <hostname> <port>" +
" <window duration in seconds> [<slide duration in seconds>]")
System.exit(1)
}
val host = args(0)
val port = args(1).toInt
val windowSize = args(2).toInt
val slideSize = if (args.length == 3) windowSize else args(3).toInt
if (slideSize > windowSize) {
System.err.println("<slide duration> must be less than or equal to <window duration>")
}
val windowDuration = s"$windowSize seconds"
val slideDuration = s"$slideSize seconds"
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCountWindowed")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.load()
val deviceDF:DataFrame = lines.as[String].map(_.split(",")).
map(value=>DeviceData(value(0), value(1).toDouble, value(2), new Timestamp(value(3).toLong))).toDF()
// Group the data by window and deviceId and compute the count of each group
val windowedCounts = deviceDF
.withWatermark("timestamp", "2 minutes")
.groupBy(window($"timestamp", windowDuration, slideDuration), $"deviceId")
.count()
val query = windowedCounts.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
}
}
Here if device1 is syncing almost near real time, while device2 lag by 5 minutes, then the program will completely ignore the events from device2. Is there a way to apply the watermark against the groupBy function than keeping it as a whole?

Related

Why a new batch is triggered without getting any new offsets in streaming source?

I have multiple spark structured streaming jobs and the usual behaviour that I see is that a new batch is triggered only when there are any new offsets in Kafka which is used as source to create streaming query.
But when I run this example which demonstrates arbitrary stateful operations using mapGroupsWithState , then I see that a new batch is triggered even if there is no new data in Streaming source. Why is it so and can it be avoided?
Update-1
I modified the above example code and remove state related operation like updating/removing it. Function simply outputs zero. But still a batch is triggered every 10 seconds without any new data on netcat server.
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming._
object Stateful {
def main(args: Array[String]): Unit = {
val host = "localhost"
val port = "9999"
val spark = SparkSession
.builder
.appName("StructuredSessionization")
.master("local[2]")
.getOrCreate()
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to host:port
val lines = spark.readStream
.format("socket")
.option("host", host)
.option("port", port)
.option("includeTimestamp", true)
.load()
// Split the lines into words, treat words as sessionId of events
val events = lines
.as[(String, Timestamp)]
.flatMap { case (line, timestamp) =>
line.split(" ").map(word => Event(sessionId = word, timestamp))
}
val sessionUpdates = events
.groupByKey(event => event.sessionId)
.mapGroupsWithState[SessionInfo, Int](GroupStateTimeout.ProcessingTimeTimeout) {
case (sessionId: String, events: Iterator[Event], state: GroupState[SessionInfo]) =>
0
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.trigger(Trigger.ProcessingTime("10 seconds"))
.format("console")
.start()
query.awaitTermination()
}
}
case class Event(sessionId: String, timestamp: Timestamp)
case class SessionInfo(
numEvents: Int,
startTimestampMs: Long,
endTimestampMs: Long)
The reason for the empty batches showing up is the usage of Timeouts within the mapGroupsWithState call.
According to the book "Learning Spark 2.0" it says:
"The next micro-batch will call the function on this timed-out key even if there is not data for that key in that micro.batch. [...] Since the timeouts are processed during the micro-batches, the timing of their execution is imprecise and depends heavily on the trigger interval [...]."
As you have set the timeout to be GroupStateTimeout.ProcessingTimeTimeout it aligns with your trigger time of the query which is 10 seconds. The alternative would be to set the timeout based on event time (i.e. GroupStateTimeout.EventTimeTimeout).
The ScalaDocs on GroupState provide some more details:
When the timeout occurs for a group, the function is called for that group with no values, and GroupState.hasTimedOut() set to true.

How to include both "latest" and "JSON with specific Offset" in "startingOffsets" while importing data from Kafka into Spark Structured Streaming

I have a streaming query saving data into filesink. I am using .option("startingOffsets", "latest") and a checkpoint location. If there is any down time on Spark and when the streaming query starts again i do not want to start processing where the query left off when it went down rather than this scenario i would also like to add ("startingOffsets", """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """) by specifying the user defined offset which needs to process from.
i tried doing this with different programs but i need to achieve this in one single program
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
object OSB_offset_kafkaToSpark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("OSB_kafkaToSpark").
config("spark.mongodb.output.uri", "spark.mongodb.output.uri=mongodb://somemongodb.com:27018").
getOrCreate()
println("SparkSession -> "+spark)
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "somekafkabroker:9092, somekafkabroker:9092")
.option("subscribe", "someTopic")
.option("startingOffsets", "latest")
.option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""")
.option("endingOffsets",""" {"someTopic":{"0":-1}}, "someTopic":{"1":-1}}, "someTopic":{"2":-1}} """)
.option("failOnDataLoss", "false")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)")
val data = dfs.withColumn("splitted", split($"value", "/"))
.select($"splitted".getItem(4).alias("region"), $"splitted".getItem(5).alias("service"), col("value"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""", 1))
.withColumn("region_type", concat(
when(col("region").isNotNull, col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull, col("service_type")).otherwise(lit("null"))))
.withColumn("datetime", regexp_extract($"value", """\d{4}-[01]\d-[0-3]\d [0-2]\d:[0-5]\d:[0-5]\d""", 0))
val extractedDF = data.filter(
col("region").isNotNull &&
col("service").isNotNull &&
col("value").isNotNull &&
col("service_type").isNotNull &&
col("region_type").isNotNull &&
col("datetime").isNotNull)
.filter("region != ''")
.filter("service != ''")
.filter("value != ''")
.filter("service_type != ''")
.filter("region_type != ''")
.filter("datetime != ''")
val pathstring = "/user/spark_streaming".concat(args(0))
val query = extractedDF.writeStream
.format("json")
.option("path", pathstring)
.option("checkpointLocation", "/user/some_checkpoint")
.outputMode("append")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
query.awaitTermination()
}
}
I need run a single program with both .option("startingOffsets", "latest") and .option("startingOffsets",""" {"someTopic":{"0":438521}}, "someTopic":{"1":438705}}, "someTopic":{"2":254180}}""").
I am not sure if this is achievable
This is an old question at this point, so the OP likely got their answer, but when specifying offsets in JSON string format, you can use -2 for earliest and -1 for latest.
src
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

What if OOM happens for the complete output mode in spark structured streaming

I am new and learning spark structured streaming,
I have following code that is using complete as the output mode
import java.util.Date
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.types.StructType
object StreamingWordCount {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("StreamingWordCount")
.config("spark.sql.shuffle.partitions", 1)
.master("local[2]")
.getOrCreate()
import spark.implicits._
val lines = spark
.readStream
.schema(new StructType().add("value", "string"))
.option("maxFilesPerTrigger", 1)
.text("file:///" + data_path)
.as[String]
val wordCounts = lines.flatMap(_.split(" ")).groupBy("value").count()
val query = wordCounts.writeStream
.queryName("t")
.outputMode("complete")
.format("memory")
.start()
while (true) {
spark.sql("select * from t").show(truncate = false)
println(new Date())
Thread.sleep(1000)
}
query.awaitTermination()
}
}
A quick question is that over time, the spark runtime remembers too many states of word and count, so OOM should happen at some time,
I would ask how to do in practice for such kind of scenario.
Memory sink should be used only for debugging purposes on low data volumes as the entire output will be collected and stored in the driver’s memory. The output will be stored in memory as an in-memory table.
So if OOM error occurs, the driver will crashes and all the state maintained in Driver's memory will be lost.
The same applies for Console sink as well.

Refresh Dataframe in Spark real-time Streaming without stopping process

in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka)
And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process.
**Update:Im planning to publish an event into kafka whenever there is an update in S3, using SNS and AWS lambda and my streaming application will subscribe to the event and refresh the cached dataframe based on this event (basically unpersist()cache and reload from S3)
Is this a good approach ?
This question was recently asked on the Spark Mailing List
As far as I know the only way to do what you're asking is to reload the DataFrame from S3 when new data arrives which means you have to recreate the streaming DF as well and restart the query. This is because DataFrames are fundamentally immutable.
If you want to update (mutate) data in a DataFrame without reloading it, you need to try one of the datastores that integrate with or connect to Spark and allow mutations. One that I'm aware of is SnappyData.
Simplest way to achieve , below code reads dimension data folder for every batch but do keep in mind new dimension data values (country names in my case) have to be a new file.
package com.databroccoli.streaming.dimensionupateinstreaming
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, ForeachWriter, Row, SparkSession}
import org.apache.spark.sql.functions.{broadcast, expr}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
object RefreshDimensionInStreaming {
def main(args: Array[String]) = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
var countryDf: Option[DataFrame] = None: Option[DataFrame]
def updateDimensionDf() = {
val dimDf2 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
if (countryDf != None) {
countryDf.get.unpersist()
}
countryDf = Some(
dimDf2
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2"))
countryDf.get.show()
}
factDf1.writeStream
.outputMode("append")
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.show(10)
updateDimensionDf()
batchDF
.join(
countryDf.get,
expr(
"""
countrycode_2 = countrycode
"""
),
"leftOuter"
)
.show
}
.start()
.awaitTermination()
}
}

Resources