Long and consistent wait between tasks in spark streaming job

Long and consistent wait between tasks in spark streaming job - apache-spark

I have a spark streaming job running on Mesos.
All its batches take the exact same time and this time is much longer than expected.
The jobs pull data from kafka, process the data and insert it into cassandra and again back to kafka into a different topic.
Each batch (below) has 3 jobs, 2 of them pull from kafka, process and insert into cassandra, and the other one pulls from kafka, processes and pushes back into kafka.
I inspected the batch in the spark UI and found that they all take the same time (4s) but drilling down more, they actually process for less than a second each but they all have a gap of the same time (around 4 seconds).
Adding more executors or more processing power doesn't look like it will make a difference.
Details of batch: Processing time = 12s & total delay = 1.2 s ??
So I drill down into each job of the batch (they all take the exact same time = 4s even if they are doing different processing):
They all take 4 seconds to run one of their stage (the one that reads from kafka).
Now I drill down into the stage of one of them (they are all very similar):
Why this wait? The whole thing actually only takes 0.5s to run, it is just waiting. Is it waiting for Kafka?
Has anyone experienced anything similar?
What could I have coded wrong or configured incorrectly?
EDIT:
Here is a minimum code that triggers this behaviour. This makes me think that it must be the setup somehow.
object Test {
def main(args: Array[String]) {
val sparkConf = new SparkConf(true)
val streamingContext = new StreamingContext(sparkConf, Seconds(5))
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "####,####,####",
"group.id" -> "test"
)
val stream = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](
streamingContext, kafkaParams, Set("test_topic")
)
stream.map(t => "LEN=" + t._2.length).print()
streamingContext.start()
streamingContext.awaitTermination()
}
}
Even if all the executors are in the same node (spark.executor.cores=2 spark.cores.max=2), the problem is still there and it is exactly 4 seconds as before: One mesos executor
Even if the topic has no messages (batch of 0 records), spark streaming takes 4 seconds for every batch.
The only way that I have been able to fix this is by setting cores=1 and cores.max=1 so that it only creates one task to execute.
This task has locality NODE_LOCAL. So it seems that when NODE_LOCAL the execution is instantaneous but when Locality is ANY it takes 4 seconds to connect to kafka. All the machines are in the same 10Gb network. Any idea why this would be?

The problem was with spark.locality.wait, this link gave me the idea
Its default value is 3 seconds and it was taking this whole time for every batch processed in spark streaming.
I have set it to 0 seconds when submitting the job with Mesos (--conf spark.locality.wait=0) and everything now runs as expected.

Related

Spark 3 structured streaming use maxOffsetsPerTrigger in Kafka source with Trigger.Once

We need to use maxOffsetsPerTrigger in the Kafka source with Trigger.Once() in structured streaming but based on this issue it seems reads allAvailable in spark 3. Is there a way for achieving rate limit in this situation?
Here is a sample code in spark 3:
def options: Map[String, String] = Map(
"kafka.bootstrap.servers" -> conf.getStringSeq("bootstrapServers").mkString(","),
"subscribe" -> conf.getString("topic")
) ++
Option(conf.getLong("maxOffsetsPerTrigger")).map("maxOffsetsPerTrigger" -> _.toString)
val streamingQuery = sparkSession.readStream.format("kafka").options(options)
.load
.writeStream
.trigger(Trigger.Once)
.start()

There is no other way around it to properly set a rate limit. If the maxOffsetsPerTrigger is not applicable for streaming jobs with the Once trigger you could do the following to achieve identical result:
Choose another trigger and use maxOffsetsPerTrigger to limit the rate and kill this job manually after it finished processing all data.
Use options startingOffsets and endingOffsets while making the job a batch job. Repeat until you have processed all data within the topic. However, there is a reason why "Streaming in RunOnce mode is better than Batch" as detailed here.
Last option would be to look into the linked pull request and compile Spark on your own.

Here is how we "solved" this. This is basically the approach mike wrote about in the accepted answer.
In our case, the size of the message varied very little and we therefore knew how much time the processing of a batch takes. So in a nutshell we:
changed the Trigger.Once() with Trigger.ProcessingTime(<ms>) since maxOffsetsPerTrigger works with this mode
killed this running query by calling awaitTermination(<ms>) to mimic Trigger.Once()
set the processing interval to be larger than the termination interval so exactly one "batch" would fit to be processed
val kafkaOptions = Map[String, String](
"kafka.bootstrap.servers" -> "localhost:9092",
"failOnDataLoss" -> "false",
"subscribePattern" -> "testTopic",
"startingOffsets" -> "earliest",
"maxOffsetsPerTrigger" -> "10", // "batch" size
)
val streamWriterOptions = Map[String, String](
"checkpointLocation" -> "path/to/checkpoints",
)
val processingInterval = 30000L
val terminationInterval = 15000L
sparkSession
.readStream
.format("kafka")
.options(kafkaOptions)
.load()
.writeStream
.options(streamWriterOptions)
.format("Console")
.trigger(Trigger.ProcessingTime(processingInterval))
.start()
.awaitTermination(terminationInterval)
This works because the first batch will be read and processed respecting the maxOffsetsPerTrigger limit. Say, in 10 seconds. The second batch is then started to be processed but it is terminated in the middle of the operation after ~5s and never reaches the set 30s mark. But it stores the offsets correctly. picks up and processes this "killed" batch in the next run.
A downside of this approach is that you have to approximately know how much time does it take to process one "batch" - if you set the terminationInterval too low the job's output will constantly be nothing.
Of course, if you don't care about the exact number of batches you process in one run, you can easily adjust the processingInterval to be times smaller than the terminationInterval. In that case, you may process a varying number of batches in one go but still respecting the value of maxOffsetsPerTrigger.

Apache Spark Delay Between Jobs

my as you can see, my small application has 4 jobs which run for a total duration of 20.2 seconds, however there is a big delay between job 1 and 2 causing the total time to be over a minute. Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload of HFiles into a HBase table. Here is the code I used to load to load the files
val outputDir = new Path(HBaseUtils.getHFilesStorageLocation(resolvedTableName))
val job = Job.getInstance(hBaseConf)
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, resolvedTableName)
job.setOutputFormatClass(classOf[HFileOutputFormat2])
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
val connection = ConnectionFactory.createConnection(job.getConfiguration)
val hBaseAdmin = connection.getAdmin
val table = TableName.valueOf(Bytes.toBytes(resolvedTableName))
val tab = connection.getTable(table).asInstanceOf[HTable]
val bulkLoader = new LoadIncrementalHFiles(job.getConfiguration)
preBulkUploadCallback.map(callback => callback())
bulkLoader.doBulkLoad(outputDir, hBaseAdmin, tab, tab.getRegionLocator)
If anyone has any ideas, I would be very greatful

I can see there are 26 tasks in job 1 which is based on the number of hfiles created. Even though the job 2 shows completed in 2s, it takes some time to copy these files to target location and that's why you are getting a delay between job 2 and 3. This can be avoided by reducing the number of tasks in job 1.

Decrease the number of Regions for the output table in Hbase, which will result in reducing the number of task for your second job.
TableOutputFormat determines the split based on the number of regions for a given table in Hbase

Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload
This is not quite true. This job merely creates HFiles outside of HBase. The gap you see between this job and the next one could be explained by the actual bulk loading at bulkLoader.doBulkLoad. This operation involves only metadata trasfer and usually performs faster (from my experience), so you should check the driver logs to see where it hangs.

Thanks for your input guys, I lowered the number of HFiles created in task 0. This has decreased the lag by about 20%. I used
HFileOutputFormat2.configureIncrementalLoad(job, tab, tab.getRegionLocator)
which automatically calculates the number of reduce tasks to match the current number of regions for the table. I will say that we are are using HBase backed by S3 in AWS EMR instead of the classical HDFS. I'm am going to investigate now whether this could be contributing to the lag.

Kafka.Utils.createRDD Vs KafkaDirectStreaming

I would like to know if the read operations from a Kafka queue is faster by using batch-Kafka RDD instead of the KafkaDirectStream when I want to read all the Kafka queue.
I've observed that reading from different partition with batch RDD is not resulting in Spark concurrent jobs. Is there some Spark proprierties to config in order to allow this behaviour?
Thanks.

Try running your spark consumers in different threads or as different processes. That's the approach I take. I've observed that I get the best concurrency by allocating one consumer thread (or process) per topic partition.
Regarding your questions about batch vs KafkaDirectStream, I think even KafkaDirectStream is batch oriented. The batch interval can be specified in the streaming context, like this:
private static final int INTERVAL = 5000; // 5 seconds
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(INTERVAL));
There's a good image that described how spark streaming is batch oriented here:
http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#discretized-streams-dstreams

Spark is essentially a batch engine and Spark streaming takes batching closer to streaming by defining something called micro-batching. Micro-batching is nothing but specifying batch interval to be very low (can be as low as 50ms per the advice in the official documentation). So now all it matters is how much is your micro-batch interval going to be. If you keep it low, you would feel it is near real-time.
On the Kafka consumer front, Spark direct receiver runs as a separate task in each executor. So if you have enough executors as the partitions, then it fetches data from all partitions and creates an RDD out of it.
If you are talking about reading from multiple queues, then you would create multiple DStreams, which would again need more executors to match the total number of partitions.

How to Dynamically Increase Active Tasks in Spark running on Yarn

I am running a spark streaming process where I got a batch of 6000 events. But when I look at executors only one active task is running. I tried dynamic allocation and as well as setting number of executors etc. Even if I have 15 executors only one active task is running at a time. Can any one please guide me what am I doing wrong here.

It looks like you're having only one partition in your DStream. You should try to explicitly repartition your input stream:
val input: DStream[...] = ...
val partitionedInput = input.repartition(numPartitions = 16)
This way you would have 16 partitions in your input DStream, and each of those partitions could be processed in a separate task (and each of those tasks could be executed on a separate executor)

How a Spark Streaming application be loaded and run?

hi i am new to spark and spark streaming.
from the official document i could understand how to manipulate input data and save them.
the problem is the quick example of Spark Streaming quick examplemade me confuse
i knew the the job should get data from the DStream you have setted and do something on them, but since its running 24/7. how will the application be loaded and run?
will it run every n seconds or just run once at the beginning and then enter the cycle of [read-process-loop]?
BTW, i am using python, so i checked the python code of that example, if its the latter case, how spark's executor knews the which code snipnet is the loop part ?

Spark Streaming is actually a microbatch processing. That means each interval, which you can customize, a new batch is executed.
Look at the coding of the example, which you have mentioned
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc,1)
You define a streaming context, which a micro-batch interval of 1 second.
That is the subsequent coding, which uses the streaming context
lines = ssc.socketTextStream("localhost", 9999)
...
gets executed every second.
The streaming process gets initially triggerd by this line
ssc.start() # Start the computation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Long and consistent wait between tasks in spark streaming job - apache-spark

Related

Spark 3 structured streaming use maxOffsetsPerTrigger in Kafka source with Trigger.Once

Apache Spark Delay Between Jobs

Kafka.Utils.createRDD Vs KafkaDirectStreaming

How to Dynamically Increase Active Tasks in Spark running on Yarn

How a Spark Streaming application be loaded and run?

Categories

Resources