Spark stream is basically micro batch. It means it will run periodically to handle a batch of data during that interval.
Some other stream calculation engine like Storm can be trigger by event. It means when some data comes (a event occur), the data will be handled immediately.
I wonder why Spark Stream can not do the same? For example, some data comes, then spark stream turn it into a RDD immediately and run calculation on it. Is it hard? Or does not make sense? Why have to wait for a while (the interval) then handle the data together.
Related
What is the advantage of processing data stream in batches over processing every data element in data stream? Why is it not possible to apply processing on each data item as it arrives?
DStream uses this concept in Spark Streaming.
When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.
On which basis the default trigger determines the size of the micro-batches?
It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.
It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).
Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?
Almost (and really has not much if at all to do with Spark Structured Streaming).
The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).
Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.
You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.
I am writing a storage writer for spark structured streaming which will partition the given dataframe and write to a different blob store account. The spark documentation says the it ensures exactly once semantics for file sinks but also says that the exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
Is the blob store an idempotent sink if I write in parquet format?
Also how will the behavior change if I am doing streamingDF.writestream.foreachbatch(...writing the DF here...).start()? Will it still guarantee exactly once semantics?
Possible duplicate : How to get Kafka offsets for structured query for manual and reliable offset management?
Update#1 : Something like -
output
.writeStream
.foreachBatch((df: DataFrame, _: Long) => {
path = storagePaths(r.nextInt(3))
df.persist()
df.write.parquet(path)
df.unpersist()
})
Micro-Batch Stream Processing
I assume that the question is about Micro-Batch Stream Processing (not Continuous Stream Processing).
Exactly once semantics are guaranteed based on available and committed offsets internal registries (for the current stream execution, aka runId) as well as regular checkpoints (to persist processing state across restarts).
exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
It is possible that whatever has already been processed but not recorded properly internally (see below) can be re-processed:
That means that all streaming sources in a streaming query should be re-playable to allow for polling for data that has once been requested.
That also means that the sink should be idempotent so the data that has been processed successfully and added to the sink may be added again because a failure happened just before Structured Streaming managed to record the data (offsets) as successfully processed (in the checkpoint)
Internals
Before the available data (by offset) of any of the streaming source or reader is processed, MicroBatchExecution commits the offsets to Write-Ahead Log (WAL) and prints out the following INFO message to the logs:
Committed offsets for batch [currentBatchId]. Metadata [offsetSeqMetadata]
A streaming query (a micro-batch) is executed only when there is new data available (based on offsets) or the last execution requires another micro-batch for state management.
In addBatch phase, MicroBatchExecution requests the one and only Sink or StreamWriteSupport to process the available data.
Once a micro-batch finishes successfully the MicroBatchExecution commits the available offsets to commits checkpoint and the offsets are considered processed already.
MicroBatchExecution prints out the following DEBUG message to the logs:
Completed batch [currentBatchId]
When you use foreachBatch, spark guarantee only that foreachBatch will call only one time. But if you will have exception during execution foreachBatch, spark will try to call it again for same batch. In this case we can have duplication if we store to multiple storages and have exception during storing.
So you can manually handle exception during storing for avoid duplication.
In my practice I created custom sink if need to store to multiple storage and use datasource api v2 which support commit.
I have two streaming dataframes - firstDataFrame and secondDataframe. I want to stream firstDataframe completely. And if the first streaming finishes successfully, only then i would like to stream the other dataframe
For example, in the below code, I would like the first streaming action to execute completely and only then the second to begin
firstDataframe.writeStream.format("console").start
secondDataframe.writeStream.format("console").start
Spark follows FIFO job scheduling by default. This means it would give priority to the first streaming job. However, if the first streaming job does not require all the available resources, it would start the second streaming job in parallel. I essentially want to avoid this parallelism. Is there a way to do this?
Reference: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Current setting: a Spark Streaming job processes a Kafka topic of timeseries data. About every second new data comes in of different sensors. Also, the batch interval is 1 Second. By means of updateStateByKey() stateful data is computed as a new stream. As soon as this stateful data crosses a treshold, an event is generated on a Kafka topic. When the value later drops below the treshhold, again an event is fired that topic.
So far, so good.
Problem: when applying a new algorithm on the data by reconsuming the Kafka topic, I would like this to go fast. But this means that every batch contains (hundreds of) thousands messages. Moving these in 1 batch to updateStateByKey() results in 1 computed value for that key on the resulting stream.
Of course that's unacceptable as loads of data points are reduced to a single one. Alarm events that will be generated on a real-time stream will not be on the recomputed stream. So comparing algorithms this way is totally useless.
Question: How can I avoid this? Preferably not switching frameworks. It seems to me I'm looking for a true streaming (1 event a a time) framework. On the other hand Spark streaming is new to me, so I'm definitely missing a lot there.
In spark 1.6, a new API mapWithState for interacting with state has been introduced. I believe that will solve your problem.
Have a look at it here.