Order of messages with Spark Executors - apache-spark

I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.
I am deploying this job in a cluster mode.
My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?

The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.
Refer answer on how to extract timestamp from kafka message.

To maintain order using single partition is the right choice, here are few other things you can try:
Turn off speculative execution
spark.speculation - If set to "true", performs speculative execution
of tasks. This means if one or more tasks are running slowly in a
stage, they will be re-launched.
Adjust your batch interval / sizes such that they can finish processing without any lag.
Cheers !

Related

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

How does the default (unspecified) trigger determine the size of micro-batches in Structured Streaming?

When the query execution In Spark Structured Streaming has no setting about trigger,
import org.apache.spark.sql.streaming.Trigger
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
//.trigger(???) // <--- Trigger intentionally omitted ----
.start()
As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says
If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.
QUESTION: On which basis the default trigger determines the size of the micro-batches?
Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.
On which basis the default trigger determines the size of the micro-batches?
It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.
It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).
Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?
Almost (and really has not much if at all to do with Spark Structured Streaming).
The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).
Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.
You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.

Spark Direct Stream Kafka order of events

I have a question regarding reading data with Spark Direct Streaming (Spark 1.6) from Kafka 0.9 saving in HBase.
I am trying to do updates on specific row-keys in an HBase table as recieved from Kafka and I need to ensure the order of events is kept (data received at t0 is saved in HBase for sure before data received at t1 ).
The row key, represents an UUID which is also the key of the message in Kafka, so at Kafka level, I am sure that the events corresponding to a specific UUID are ordered at partition level.
My problem begins when I start reading using Spark.
Using the direct stream approach, each executor will read from one partition. I am not doing any shuffling of data (just parse and save), so my events won't get messed up among the RDD, but I am worried that when the executor reads the partition, it won't maintain the order so I will end up with incorrect data in HBase when I save them.
How can I ensure that the order is kept at executor level, especially if I use multiple cores in one executor (which from my understanding result in multiple threads)?
I think I can also live with 1 core if this fixes the issue and by turning off speculative execution, enabling spark back pressure optimizations and keeping the maximum retries on executor to 1.
I have also thought about implementing a sort on the events at spark partition level using the Kafka offset.
Any advice?
Thanks a lot in advance!

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

In-order processing in Spark Streaming

Is it possible to enforce in-order processing in Spark Streaming? Our use case is reading events from Kafka, where each topic needs to be processed in order.
From what I can tell it's impossible - each stream in broken into RDDs, and RDDS are processed in parallel, so there is no way to guaranty order.
You could force the RDD to be a single partition, which removes any parallelism.
"Our use case is reading events from Kafka, where each topic needs to be processed in order. "
As per my understanding, each topic forms separata Dstreams. So you should be process each Dstreams one after another.
But most likely you mean you want to process each events your are getting from 1 Kafka topic in order. In that case, you should not depend on ordering of record in a RDD, rather you should tag each record with the timestamp when you first see them (probably way upstream) and use this timestamp to order later on.
You have other choices, which are bad :)
As Holden suggests, put everything in one partition
Partition with some increasing function based on receiving time, so you fill up partitions one after another. Then you can use zipWithIndex reliably.

Resources