Context : I have a Spark Structured Streaming job with Kafka as source and S3 as sink. The outputs in S3 are again picked up as input in other MapReduce jobs.
I, therefore, want to increase the output size of the files on S3 so that the MapReduce job works efficiently.
Currently, because of small input size, the MapReduce jobs are taking way too long to complete.
Is there a way to configure the streaming job to wait for at least 'X' number of records to process?
Probably you want to wait micro batch trigger till sufficient data are available at source . You can use minOffsetsPerTrigger option to wait for sufficient data available in kafka .
Make sure to set sufficient maxTriggerDelay time as per your application need .
No there is not in reality.
No for Spark prior to 3.x.
Yes and No for Spark 3.x which equates to No effectively.
minOffsetsPerTrigger was introduced but has a catch as per below. That means the overall answer still remains No.
From the manuals:
Minimum number of offsets to be processed per trigger interval. The
specified total number of offsets will be proportionally split across
topicPartitions of different volume. Note, if the maxTriggerDelay is
exceeded, a trigger will be fired even if the number of available
offsets doesn't reach minOffsetsPerTrigger.
Related
I am trying to figure out what should be a better approach.
I have a Spark Batch Job which is scheduled to run every 5 mints and it takes 2-3 mints to execute.
Since Spark 2.0 have added support for dynamic allocation spark.streaming.dynamicAllocation.enabled, Is it a good idea to make its a streaming job which pulls data from source every 5 mints?
What are things I should keep in mind while choosing between streaming/batch job?
Spark Streaming is an outdated technology. Its successor is Structured Streaming.
If you do processing every 5 mins so you do batch processing. You can use the Structured Streaming framework and trigger it every 5 mins to imitate batch processing, but I usually wouldn't do that.
Structured Streaming has a lot more limitations than normal Spark. For example you can only write to Kafka or to a file, or else you need to implement the sink by yourself using Foreach sink. Also if you use a File sink then you cannot update it, but only append to it. Also there are operations that are not supported in Structured Streaming and there are actions that you cannot perform unless you do an aggrigation before.
I might use Structured Straming for batch processing if I read from or write to Kafka because they work well together and everything is pre-implemented. Another advantage of using Structured Streaming is that you automatically continue reading from the place you stopped.
For more information refer to Structured Streaming Programming Guide.
Deciding between streaming vs. batch, one needs to look into various factors. I am listing some below and based on your use case, you can decide which is more suitable.
1) Input Data Characteristics - Continuous input vs batch input
If input data is arriving in batch, use batch processing.
Else if input data is arriving continuously, stream processing may be more useful. Consider other factors to reach to a conclusion.
2) Output Latency
If required latency of output is very less, consider stream processing.
Else if latency of output does not matter, choose batch processing.
3) Batch size (time)
A general rule of thumb is use batch processing if the batch size > 1 min otherwise stream processing is required. This is because trigerring/spawning of batch process adds latency to overall processing time.
4) Resource Usage
What's the usage pattern of resources in your cluster ?
Are there more batch jobs which execute when other batch jobs are done ? Having more than one batch jobs running one after other and are using cluster respurces optimally. Then having batch jobs is better option.
Batch job runs at it's schedule time and resources in cluster are idle after that. Consider running streaming job if data is arriving continuously, less resources may be required for processing and output will become available with less latency.
There are other things to consider - Replay, Manageability (Streaming is more complex), Existing skill of team etc.
Regarding spark.streaming.dynamicAllocation.enabled, I would avoid using it because if the rate of input varies a lot, executors will be killed and created very frequently which would add to latency.
So, in one of our kafka topic, there's close to 100 GB of data.
We are running spark-structured streaming to get the data in S3
When the data is upto 10GB, streaming runs fine and we are able to get the data in S3.
But with 100GB, it is taking forever to stream the data in kafka.
Question: How does spark-streaming reads data from Kafka?
Does it take the entire data from current offset?
Or does it take in batch of some size?
Spark will work off consumer groups, just as any other Kafka consumer, but in batches. Therefore it takes as much data as possible (based on various Kafka consumer settings) from the last consumed offsets. In theory, if you have the same number of partitions, with the same commit interval as 10 GB, it should only take 10x longer to do 100 GB. You've not stated how long that currently takes, but to some people 1 minute vs 10 minutes might seem like "forever", sure.
I would recommend you plot the consumer lag over time using the kafka-consumer-groups command line tool combined with something like Burrow or Remora... If you notice an upward trend in the lag, then Spark is not consuming records fast enough.
To overcome this, the first option would be to ensure that the number of Spark executors is evenly consuming all Kafka partitions.
You'll also want to be making sure you're not doing major data transforms other than simple filters and maps between consuming and writing the records, as this also introduces lag.
For non-Spark approaches, I would like to point out that the Confluent S3 connector is also batch-y in that it'll only periodically flush to S3, but the consumption itself is still closer to real-time than Spark. I can verify that it's able to write very large S3 files (several GB in size), though, if the heap is large enough and the flush configurations are set to large values.
Secor by Pinterest is another option that requires no manual coding
I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.
I am deploying this job in a cluster mode.
My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?
The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.
Refer answer on how to extract timestamp from kafka message.
To maintain order using single partition is the right choice, here are few other things you can try:
Turn off speculative execution
spark.speculation - If set to "true", performs speculative execution
of tasks. This means if one or more tasks are running slowly in a
stage, they will be re-launched.
Adjust your batch interval / sizes such that they can finish processing without any lag.
Cheers !
I use createDirectStream in my spark streaming application. I set the batch interval to 7 seconds and most of the time the batch job can finish within about 5 seconds. However, in very rare cases, the batch job need cost 60 seconds and this will delay some batch of jobs.
To cut down the total delay time, I hope I can process more streaming data which spread over the delayed jobs at one time. This will help the streaming return to normal as soon as possible.
So, I want to know there is some method to dynamically update/merge batch size of input for spark and kafka when delay appears.
You can set the "spark.streaming.backpressure.enabled" option to true.
If the batch delay occurs when the backpressure option is true, it initially starts with a small batch size and then dynamically changes to a large batch size.
See the spark configuration document.
You can see the description below.
Enables or disables Spark Streaming's internal backpressure mechanism
(since 1.5). This enables the Spark Streaming to control the receiving
rate based on the current batch scheduling delays and processing times
so that the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).
Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them.
I think your problem can be solved by Spark Streaming Backpressure.
Check spark.streaming.backpressure.enabled and spark.streaming.backpressure.initialRate.
By default spark.streaming.backpressure.initialRate is not set and spark.streaming.backpressure.enabled is disabled by default so I suppose spark will take as much as he can.
From Apache Spark Kafka configuration
spark.streaming.backpressure.enabled:
This enables the Spark Streaming to control the receiving rate based
on the current batch scheduling delays and processing times so that
the system receives only as fast as the system can process.
Internally, this dynamically sets the maximum receiving rate of
receivers. This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see below).
And since you want to control first batch, or to be more specific - number of messages in first batch, I think you need spark.streaming.backpressure.initialRate
spark.streaming.backpressure.initialRate:
This is the initial maximum receiving rate at which each receiver will
receive data for the first batch when the backpressure mechanism is
enabled.
This one is good when your Spark job (respectively Spark workers at all) is able to process let say 10000 messages from kafka, but kafka brokers give to your job 100000 messages.
Maybe you will be also interested to check spark.streaming.kafka.maxRatePerPartition and also some research and suggestions for these properties on real example by Jeroen van Wilgenburg on his blog.
Apart from above answers. Batch size is product of 3 parameters
batchDuration: The time interval at which streaming data will be divided into batches (in Seconds).
spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages per partition per second. This when combined with batchDuration will control the batch size. You want the maxRatePerPartition to be set, and large (otherwise you are effectively throttling your job) and batchDuration to be very small.
No of partitions in kafka topic
For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)
Limiting the Max batch size will greatly help to control the processing time, however, it increase the processing latency of message.
By settings below properties, we could control the batch size
spark.streaming.receiver.maxRate=
spark.streaming.kafka.maxRatePerPartition=
You could even dynamically set the batch size based on processing time, by enabling the back pressure
spark.streaming.backpressure.enabled:true
spark.streaming.backpressure.initialRate: