Why Spark streaming creates batches with 0 events? - apache-spark

Spark streaming keeps on creating batches with 0 events and queues them to be processed in next job iteration. But is it really necessary to queue batches which have nothing to be processed or is there something hidden going on?

This is working as intended because your job could still produce output even in the absence of data (which can also happen after filtering your data).
For example you might write some record to the database that indicates that there's no data available at a given timestamp.
stream.foreachRDD { rdd =>
if (rdd.isEmpty) // write "empty" record to db
else // write data to db
}

Related

Is there a way to keep multiple sink queries in the same spark streaming application in sync similar to a transaction in a DB?

I have the below requirement. I get data from a streaming source like Kafka or Kinesis.
This data needs to be processed by a Spark streaming application. The application needs to determine the current value(say, of a metric) and previous value(say, of the same metric) for each userid on the record. The current value is available on the record coming from the streaming source.
So, I created a spark streaming application with 2 sink queries
The first sink query processes the data from the streaming source(kafka or kinesis). It then looks up a table in a database(say, X) to find the previous value for the userid. If a record is found, it gets the previous value from there else the previous value is set to NULL. The userid, current value, previous value and other data are written as output to Kinesis or Kafka sink.
If there are multiple records for the userid in a microbatch(of 10 seconds), then only the most recent record is processed.
The second sink query writes the userid and current value to the table X.
Ideally speaking, query 1 and query 2 need to be part of the same transaction. So, either both are committed or nothing is committed. Since the queries are writing to 2 different sinks, they cannot be committed as part of the same transaction.
Is there a way around this?
If there is no way around, then the trigger time(of 10 seconds in my example) for both the queries above needs to match since they have to be in sync. If the database writes are slow, then the second query may not finish within 10 seconds.
Will the background thread wait for the second sink query to finish before it starts the next microbatch for the first sink query?
I'm still learning spark streaming. So any help is appreciated.

Does count() on Spark mean that all data is already available in memory to be processed?

My data scenario is as follows:
Reading data in a dataframe from database with JDBC using PySpark
I make a count() call to both see number of records and also "know" when data load is ready. I am doing this to understand a potential bottleneck.
Write to file in s3 (in same region)
So, my objective is to know exactly when all database/table data is loaded, so I can infer if there are problem either reading or writing data when job is getting slow.
In my first attempts, I could get the records number very quick (after 2 min of job running), but my guess is that doing count() does not mean that data is all loaded (in memory).
When you do the count() nothing is already loaded, it's a action that will trigger data processing.
if you have a simply logical plan like this :
spark.read(..)
.map(..)
.filter(..)
...
.count()
the database will be loaded as soon as you call an action (in this exemple count)

Can intermediate state be dropped/controlled in Spark structured streaming in Complete Output mode? (Spark 2.4.0)

I have a scenario where I want to process data from a kafka topic. I have this particular java code to read the data as a stream from kafka topic.
Dataset<Row> streamObjs = sparkSession.readStream().format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers).option("subscribe", streamTopic)
.option("failOnDataLoss", false).load();
I cast it to String, define the schema, then try to use watermark (for late data) and window (for grouping and aggregations) and finally output to kafka sink.
Dataset<Row> selectExprImporter = streamObjs.selectExpr("CAST(value AS STRING)");
StructType streamSchema = new StructType().add("id", DataTypes.StringType)
.add("timestamp", DataTypes.LongType)
.add("values", new MapType(DataTypes.StringType, DataTypes.DoubleType, false));
Dataset<Row> selectValueImporter = selectExprImporter
.select(functions.from_json(new Column("value"), streamSchema ).alias("data"));
.
.
(More transformations/operations)
.
.
Dataset<Row> aggCount_15min = streamData.withWatermark("timestamp", "2 minute")
.withColumn("frequency", functions.lit(15))
.groupBy(new Column("id"), new Column("frequency"),
functions.window(new Column("timestamp"), "15 minute").as("time_range"))
.agg(functions.mean("value").as("mean_value"), functions.sum("value").as("sum"),
functions.count(functions.lit(1)).as("number_of_values"))
.filter("mean_value > 35").orderBy("id", "frequency", "time_range");
aggCount_15min.selectExpr("to_json(struct(*)) AS value").writeStream()
.outputMode(OutputMode.Complete()).format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", outputTopic).option("checkpointLocation", checkpointLocation).start().awaitTermination();
Questions
Am I correct in understanding that when using Complete Output mode in the kafka sink, the intermediate state will keep on increasing forever until I get OutOfMemory exception?
Also, what is the ideal use case for Complete Output mode? Use it only when intermediate data/state doesn't increase?
Complete Output mode is needed in my case as I want to use the orderBy clause. Is there some way so that I can force spark to drop the state it has after every say 30 mins and work again with new data?
Is there a better way to not use Complete Output mode but still get the desired result? Should I use something else other than spark structured streaming?
The desired result being aggregating and grouping data as per the query above, then when 1st batch has been created, drop all state and start fresh for next batch. Here batch can be a function of last processed timestamp. Like say drop all state and start fresh when current timestamp has crossed 20 min from the first received timestamp or better if its a function of window time (15min in this example) like say when 4 batches of 15 min windows have been processed and timestamp for 5th batch arrives drop state for previous 4 batches and start fresh for this batch.
The question asks many things and focuses less on what Spark Structured Streaming (SSS) actually does. Answering your numbered questions, title question and non-numbered question then:
A. Title Question:
Not as such, but Complete mode only stores aggregates, so not all data
is stored but a state allowing re-computation based on incremental
adding of data. I find the manual misleading in terms of its
description, but it may be jus me. But you will get this error
otherwise:
org.apache.spark.sql.AnalysisException: Complete output mode not
supported when there are no streaming aggregations on streaming
DataFrames/Datasets
Am I correct in understanding that when using Complete Output mode in the kafka sink, the intermediate state will keep on increasing forever until I get OutOfMemory exception?
The kafka sink does not figure here. The intermediate state is what
Spark Structured Streaming needs to store. It stores aggregates and
discards the newer data. But in the end you would get an OOM due to
this or some other error I suspect.
Also, what is the ideal use case for Complete Output mode? Use it only when intermediate data/state doesn't increase?
For aggregations over all data received. 2nd part of your question is not logical and I cannot answer therefore. The state will generally increase over time.
Complete Output mode is needed in my case as I want to use the orderBy clause. Is there some way so that I can force spark to drop the state it has after every say 30 mins and work again with new data?
No, there is not. Even trying to stop gracefully is not an idea and
then re-starting as the period is not really 15 mins then. And, it's against the SSS approach anyway. From the manuals: Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode. You cannot drop the state as you would like, again aggregates discussion.
Is there a better way to not use Complete Output mode but still get the desired result? Should I use something else other than spark structured streaming?
No, as you have many requirements that cannot be satisfied by the
current implementation. Unless you drop order by and do
non-overlapping window operation (15,15) in Append mode with a
minuscule watermark, if memory serves correctly. You would then rely
on sorting later on by down-stream processing as order by not
supported.
Final overall question: The desired result being aggregating and grouping data as per the query above, then when 1st batch has been created, drop all state and start fresh for next batch. Here batch can be a function of last processed timestamp. Like say drop all state and start fresh when current timestamp has crossed 20 min from the first received timestamp or better if its a function of window time (15min in this example) like say when 4 batches of 15 min windows have been processed and timestamp for 5th batch arrives drop state for previous 4 batches and start fresh for this batch.
Whilst your ideas may be considered understandable, the SSS-framework
does not support it all and specifically what you want(, just yet).

Synchronization between Spark RDD partitions

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
})
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.

Collect results from RDDs in a dstream driver program

I have this function in the driver program which collects the result from rdds into an array and send it back. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong?
def runTopFunction() : Array[(String, Int)] = {
val topSearches = some function....
val summary = new ArrayBuffer[(String,Int)]()
topSearches.foreachRDD(rdd => {
summary = summary.++(rdd.collect())
})
return summary.toArray
}
So while the foreachRDD will do what you are looking to do, it is also non-blocking which means it won't wait until all of the stream is processed. Since you cal toArray on your buffer right after the call to foreachRDD, there won't have been any elements processed yet.
DStream.forEachRDD is an action on given DStream and will be scheduled for execution on each streaming batch interval. It's a declarative construction of the job to be executed later on.
Accumulating over the values in this way is not supported because while the Dstream.forEachRDD is just saying "do this on each iteration", the surrounding accumulation code is executed immediately, resulting in an empty array.
Depending of what happens to the summary data after it's calculated, there're few options on how to implement this:
If the data needs to be retrieved by another process, use a shared thread-safe structure. A priority queue is great for top-k uses.
If the data will be stored (fs, db), you can just write to the storage after applying the topSearches function to the dstream.

Resources