Pyspark Structured Streaming continuous vs processingTime triggers - apache-spark

I've been looking into using triggers for a streaming job, but the differences between Continuous trigger vs processingTime trigger are not clear to me.
As far as I've read on different sites:
continuous is just an attempt to make the streaming almost real-time instead of micro-batch based (using much lower latency of 1ms).
As of the time of writing this question, only supports a couple of sources and sinks like Kafka.
Are these two points the only differences between the two triggers?

You are pretty much right Structured Streaming continuous got added in order to respond to low latency needs by achieving near-real-time processing using a continuous query, unlike the old batch way where the latency is depending on processing time and the batch job duration (aka micro-batch query)
the docs are pretty useful to get more in-depth.

Related

Is windowing based on event time possible with Spark Streaming?

According to the Dataflow Model paper : A practical approach to balancing correctness, latency and cost in massive-scale, unbounded, out-of-order Data processing:
MillWheel and Spark Streaming are both sufficiently scalable,
fault-tolerant, and low-latency to act as reasonable substrates, but
lack high-level programming models that make calculating event-time
sessions straightforward.
Is it always the case?
No, it is not.
To quote from https://dzone.com/articles/spark-streaming-vs-structured-streaming so as to save on my lunch time!:
One big issue in the streaming world is how to process data according
to event-time.
Event-time is the time when the event actually happened. It is not
necessary for the source of the streaming engine to prove data in
real-time. There may be latencies in data generation and handing over
the data to the processing engine. There is no such option in Spark
Streaming to work on the data using the event-time. It only works with
the timestamp when the data is received by the Spark. Based on the
ingestion timestamp, Spark Streaming puts the data in a batch even if
the event is generated early and belonged to the earlier batch, which
may result in less accurate information as it is equal to the data
loss.
On the other hand, Structured Streaming provides the functionality to
process data on the basis of event-time when the timestamp of the
event is included in the data received. This is a major feature
introduced in Structured Streaming which provides a different way of
processing the data according to the time of data generation in the
real world. With this, we can handle data coming in late and get more
accurate results.
With event-time handling of late data, Structured Streaming outweighs
Spark Streaming.

5 Minutes Spark Batch Job vs Streaming Job

I am trying to figure out what should be a better approach.
I have a Spark Batch Job which is scheduled to run every 5 mints and it takes 2-3 mints to execute.
Since Spark 2.0 have added support for dynamic allocation spark.streaming.dynamicAllocation.enabled, Is it a good idea to make its a streaming job which pulls data from source every 5 mints?
What are things I should keep in mind while choosing between streaming/batch job?
Spark Streaming is an outdated technology. Its successor is Structured Streaming.
If you do processing every 5 mins so you do batch processing. You can use the Structured Streaming framework and trigger it every 5 mins to imitate batch processing, but I usually wouldn't do that.
Structured Streaming has a lot more limitations than normal Spark. For example you can only write to Kafka or to a file, or else you need to implement the sink by yourself using Foreach sink. Also if you use a File sink then you cannot update it, but only append to it. Also there are operations that are not supported in Structured Streaming and there are actions that you cannot perform unless you do an aggrigation before.
I might use Structured Straming for batch processing if I read from or write to Kafka because they work well together and everything is pre-implemented. Another advantage of using Structured Streaming is that you automatically continue reading from the place you stopped.
For more information refer to Structured Streaming Programming Guide.
Deciding between streaming vs. batch, one needs to look into various factors. I am listing some below and based on your use case, you can decide which is more suitable.
1) Input Data Characteristics - Continuous input vs batch input
If input data is arriving in batch, use batch processing.
Else if input data is arriving continuously, stream processing may be more useful. Consider other factors to reach to a conclusion.
2) Output Latency
If required latency of output is very less, consider stream processing.
Else if latency of output does not matter, choose batch processing.
3) Batch size (time)
A general rule of thumb is use batch processing if the batch size > 1 min otherwise stream processing is required. This is because trigerring/spawning of batch process adds latency to overall processing time.
4) Resource Usage
What's the usage pattern of resources in your cluster ?
Are there more batch jobs which execute when other batch jobs are done ? Having more than one batch jobs running one after other and are using cluster respurces optimally. Then having batch jobs is better option.
Batch job runs at it's schedule time and resources in cluster are idle after that. Consider running streaming job if data is arriving continuously, less resources may be required for processing and output will become available with less latency.
There are other things to consider - Replay, Manageability (Streaming is more complex), Existing skill of team etc.
Regarding spark.streaming.dynamicAllocation.enabled, I would avoid using it because if the rate of input varies a lot, executors will be killed and created very frequently which would add to latency.

How to avoid Code Redundancy in Lambda Architecture?

We have an exiting batch processing which is working as mentioned below
Hive SQL is using for Daily batch processing.
Data are being either ingested from Files or RDMBS
Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.
Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?
ALL the source system is going to land on Kafka.
Same batch processing System will consume Kafka topics.
New Spark Application will consume kafka topics for streaming.
Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.
The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?
You can build your processing stages using Spark SQL and Spark Structured Streaming: https://spark.apache.org/docs/2.2.0/structured-streaming-programming-guide.html. Depending on your needs there can be some incompatibilities. But I´d try to build the Spark Aggregations + Transformations using the Dataset[_] api and then try to spawn in both ways, batch and streaming.
The problem of duplicated code base is inherent in lambda architecture. It gets a mention in the 'criticism' section of the wikipedia page
Another issue is that the data between batch and stream are not in sync so can lead to unexpected results when bringing data together. For example, joining across stream and batch when keys do not yet exist in batch.
I believe the lambda architecture comes from an belief that streaming is complex and expensive so keep batch as much as possible and add streaming only for those elements that require near-real time. We already have batch, let's add a few streaming things.
An alternate architecture is to use streaming for everything. This is based on the realization that batch is a special case of streaming, so do your batch and stream processing on a single streaming platform.
use spark structured streaming for batch
lambda architecture issues and how only using streaming solves them
questioning the lambda architecture

Spark Streaming Real time integration with Kafka

I have integrated Spark Streaming Process with Kafka to read a particular topic. Created Spark Context with polling time of 5 seconds., it works fine. But in case of if I want to access feeds in real time can I further reduce it to 1 second (will it over kill ?) or is there any other better option to handle this situation.
Spark Structured streaming offers several modes or "Triggers" for processing time. You can sacrifice throughput for less latency by using the continuous processing mode. You sacrifice latency for more throughput by increasing the Trigger duration. You should be fine setting the micro-batch duration to 1s on Scala and 2s on Python.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

How to achieve ingestion time?

I found the distinction between different notions of time in the Apache Flink documentation in Event Time / Processing Time / Ingestion Time.
Event time is the time that each individual event occurred on its producing device.
And that's what datasets come with and so is available in Spark Structured Streaming out of the box.
Processing time refers to the system time of the machine that is executing the respective operation.
Ingestion time is the time that events enter Flink.
The two processing time and ingestion time are of my concern. I think I know how to achieve processing time, but am not sure about ingestion time (or perhaps I'm wrong and it's the opposite).
How to achieve ingestion time in Spark Structured Streaming 2.2 and later?

Resources