spark-streaming batch interval for kafka topic reprocessing - apache-spark

Current setting: a Spark Streaming job processes a Kafka topic of timeseries data. About every second new data comes in of different sensors. Also, the batch interval is 1 Second. By means of updateStateByKey() stateful data is computed as a new stream. As soon as this stateful data crosses a treshold, an event is generated on a Kafka topic. When the value later drops below the treshhold, again an event is fired that topic.
So far, so good.
Problem: when applying a new algorithm on the data by reconsuming the Kafka topic, I would like this to go fast. But this means that every batch contains (hundreds of) thousands messages. Moving these in 1 batch to updateStateByKey() results in 1 computed value for that key on the resulting stream.
Of course that's unacceptable as loads of data points are reduced to a single one. Alarm events that will be generated on a real-time stream will not be on the recomputed stream. So comparing algorithms this way is totally useless.
Question: How can I avoid this? Preferably not switching frameworks. It seems to me I'm looking for a true streaming (1 event a a time) framework. On the other hand Spark streaming is new to me, so I'm definitely missing a lot there.

In spark 1.6, a new API mapWithState for interacting with state has been introduced. I believe that will solve your problem.
Have a look at it here.

Related

Is windowing based on event time possible with Spark Streaming?

According to the Dataflow Model paper : A practical approach to balancing correctness, latency and cost in massive-scale, unbounded, out-of-order Data processing:
MillWheel and Spark Streaming are both sufficiently scalable,
fault-tolerant, and low-latency to act as reasonable substrates, but
lack high-level programming models that make calculating event-time
sessions straightforward.
Is it always the case?
No, it is not.
To quote from https://dzone.com/articles/spark-streaming-vs-structured-streaming so as to save on my lunch time!:
One big issue in the streaming world is how to process data according
to event-time.
Event-time is the time when the event actually happened. It is not
necessary for the source of the streaming engine to prove data in
real-time. There may be latencies in data generation and handing over
the data to the processing engine. There is no such option in Spark
Streaming to work on the data using the event-time. It only works with
the timestamp when the data is received by the Spark. Based on the
ingestion timestamp, Spark Streaming puts the data in a batch even if
the event is generated early and belonged to the earlier batch, which
may result in less accurate information as it is equal to the data
loss.
On the other hand, Structured Streaming provides the functionality to
process data on the basis of event-time when the timestamp of the
event is included in the data received. This is a major feature
introduced in Structured Streaming which provides a different way of
processing the data according to the time of data generation in the
real world. With this, we can handle data coming in late and get more
accurate results.
With event-time handling of late data, Structured Streaming outweighs
Spark Streaming.

Keeping only small window of output data in Spark Structured Streaming

I am interested in using Spark Structured Streaming for real-time data processing using data from for example last ~24hours of running, however I am not able to find correct solution for this problem.
Some useful information about entire situation:
Data is constantly flowing all the time as an input for Spark so stream is active 24/7
Spark does some actions and then writes some data to files(eg. parquet)
Watermarking is used to reduce state size
Someone wants to work on only most recent data returned from Spark Structured Streaming(for example all data from last 24 hours) to have a quick view on what happened in last time and for further very specific analysis.
From what I understand, watermarking helps managing state size so Spark does not hold entire data about the state. This is a good thing and solves one problem with 24/7 running.
The other problem is output data. Currently Spark appends the data and nothing else, this makes it grow bigger and bigger. Using memory sink for testing it creates memory problems. I didn't try it with file sink because it creates one file for each record(ugh) so there's a risk of using all available inodes in system extremely quickly. I can create one file per window with file sink.
So my question is:
Is it possible to force Spark Structured Streaming to delete output data after some amount of time when it is no longer needed? I want to keep output data only from for example last 24 hours. Is there any build-in solution or do I need to do it on my own? If I needed to do it on my own, wouldn't checkpoint data and spark metadata get corrupted?
Using watermarking will allow us to keep only the last 24 hours
df.withWatermark("timestamp", "24 hours") //when timestamp is your event time field
From Spark Spark doc
in Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly.

Spark Streaming input rate drop

Running a Spark Streaming job, I have encountered the following behavior more than once. Processing starts well: the processing time for each batch is well below the batch interval. Then suddenly, the input rate drops to near zero. See these graphs.
This happens even though the program could keep up and it slows down execution considerably. I believe the drop happens when there is not much unprocessed data left, but because the rate is so low, these final records take up most of the time needed to run the job. Is there any way to avoid this and speed up?
I am using PySpark with Spark 1.6.2 and using the direct approach for Kafka streaming. Backpressure is turned on and there is a maxRatePerPartition of 100.
Setting backpressure is more meaningful in the case of old spark streaming versions where you need receivers to consume the messages from a stream. From Spark 1.3 you have the receiver-less “direct” approach to ensure stronger end-to-end guarantees. So you do not need to worry about backpressure as spark does most of the fine tuning.

Spark: processing multiple kafka topic in parallel

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.
Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?
I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here
I made the following observations, in case its helpful for someone:
In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.
If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*. So, I decided to create multiple streams.
sparkConf.set("spark.streaming.concurrentJobs", "4");
I think the right solution depends on your use case.
If your processing logic is the same for data from all topics, then without doubt, this is a better approach.
If the processing logic is different, i guess you get a single RDD from all the topics and you have to create a pairedrdd for each processing logic and handle it separately. The problem is that this creates a sort of grouping to processing and the overall processing speed will be determined by the topic which needs the longest time to process. So topics with less data have to wait till data from all topics are processed. One advantage is that if its a timeseries data, then the processing proceeds together which might be a good thing.
Another advantage of running independent jobs is that you get better control and can adjust your resource sharing. For eg: jobs which process topic with high throughput can be allocated a higher CPU/memory.

In-order processing in Spark Streaming

Is it possible to enforce in-order processing in Spark Streaming? Our use case is reading events from Kafka, where each topic needs to be processed in order.
From what I can tell it's impossible - each stream in broken into RDDs, and RDDS are processed in parallel, so there is no way to guaranty order.
You could force the RDD to be a single partition, which removes any parallelism.
"Our use case is reading events from Kafka, where each topic needs to be processed in order. "
As per my understanding, each topic forms separata Dstreams. So you should be process each Dstreams one after another.
But most likely you mean you want to process each events your are getting from 1 Kafka topic in order. In that case, you should not depend on ordering of record in a RDD, rather you should tag each record with the timestamp when you first see them (probably way upstream) and use this timestamp to order later on.
You have other choices, which are bad :)
As Holden suggests, put everything in one partition
Partition with some increasing function based on receiving time, so you fill up partitions one after another. Then you can use zipWithIndex reliably.

Resources