Spark streaming with predefined ordering - apache-spark

I have two streaming dataframes - firstDataFrame and secondDataframe. I want to stream firstDataframe completely. And if the first streaming finishes successfully, only then i would like to stream the other dataframe
For example, in the below code, I would like the first streaming action to execute completely and only then the second to begin
firstDataframe.writeStream.format("console").start
secondDataframe.writeStream.format("console").start
Spark follows FIFO job scheduling by default. This means it would give priority to the first streaming job. However, if the first streaming job does not require all the available resources, it would start the second streaming job in parallel. I essentially want to avoid this parallelism. Is there a way to do this?
Reference: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Related

Spark Structured Streaming - How to ignore checkpoint?

I'm reading messages from Kafka stream using microbatching (readStream), processing them and writing results to another Kafka topic via writeStream. The job (streaming query) is designed to run "forever", processing microbatches of size 10 seconds (of processing time). The checkpointDirectory option is set, since Spark requires checkpointing.
However, when I try to submit another query with the same source stream (same topic etc.) but possibly different processing algorithm), Spark finishes the previous running query and creates a new one with the same ID (so it starts from the very same offset on which the previous job "finished").
How to tell Spark that the second job is different from the first one, so there is no need to restore from checkpoint (i.e. intended behaviour is to create a completely new streaming query not connected to previous one, and keep the previous one running)?
You can achieve independence of the two streaming queries by setting the checkpointLocation option in their respective writeStream call. You should not set the checkpoint location centrally in the SparkSession.
That way, they can run independently and will not interfere from each other.

5 Minutes Spark Batch Job vs Streaming Job

I am trying to figure out what should be a better approach.
I have a Spark Batch Job which is scheduled to run every 5 mints and it takes 2-3 mints to execute.
Since Spark 2.0 have added support for dynamic allocation spark.streaming.dynamicAllocation.enabled, Is it a good idea to make its a streaming job which pulls data from source every 5 mints?
What are things I should keep in mind while choosing between streaming/batch job?
Spark Streaming is an outdated technology. Its successor is Structured Streaming.
If you do processing every 5 mins so you do batch processing. You can use the Structured Streaming framework and trigger it every 5 mins to imitate batch processing, but I usually wouldn't do that.
Structured Streaming has a lot more limitations than normal Spark. For example you can only write to Kafka or to a file, or else you need to implement the sink by yourself using Foreach sink. Also if you use a File sink then you cannot update it, but only append to it. Also there are operations that are not supported in Structured Streaming and there are actions that you cannot perform unless you do an aggrigation before.
I might use Structured Straming for batch processing if I read from or write to Kafka because they work well together and everything is pre-implemented. Another advantage of using Structured Streaming is that you automatically continue reading from the place you stopped.
For more information refer to Structured Streaming Programming Guide.
Deciding between streaming vs. batch, one needs to look into various factors. I am listing some below and based on your use case, you can decide which is more suitable.
1) Input Data Characteristics - Continuous input vs batch input
If input data is arriving in batch, use batch processing.
Else if input data is arriving continuously, stream processing may be more useful. Consider other factors to reach to a conclusion.
2) Output Latency
If required latency of output is very less, consider stream processing.
Else if latency of output does not matter, choose batch processing.
3) Batch size (time)
A general rule of thumb is use batch processing if the batch size > 1 min otherwise stream processing is required. This is because trigerring/spawning of batch process adds latency to overall processing time.
4) Resource Usage
What's the usage pattern of resources in your cluster ?
Are there more batch jobs which execute when other batch jobs are done ? Having more than one batch jobs running one after other and are using cluster respurces optimally. Then having batch jobs is better option.
Batch job runs at it's schedule time and resources in cluster are idle after that. Consider running streaming job if data is arriving continuously, less resources may be required for processing and output will become available with less latency.
There are other things to consider - Replay, Manageability (Streaming is more complex), Existing skill of team etc.
Regarding spark.streaming.dynamicAllocation.enabled, I would avoid using it because if the rate of input varies a lot, executors will be killed and created very frequently which would add to latency.

Alternate to recursively Running Spark-submit jobs

Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Order of messages with Spark Executors

I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.
I am deploying this job in a cluster mode.
My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?
The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.
Refer answer on how to extract timestamp from kafka message.
To maintain order using single partition is the right choice, here are few other things you can try:
Turn off speculative execution
spark.speculation - If set to "true", performs speculative execution
of tasks. This means if one or more tasks are running slowly in a
stage, they will be re-launched.
Adjust your batch interval / sizes such that they can finish processing without any lag.
Cheers !

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Resources