I want to process the messages reported at a web server in real time. The messages reported at web server belong to different sessions and I want to do some session level aggregations. For this purpose I plan to use Spark Streaming front ended by Kafka. Even before I start, I have listed down a few challenges which this architecture is going to throw. Can someone familiar with this ecosystem help me out with these questions:
If each Kafka message belongs to a particular session, how to manage session affinity so that the same Spark executor sees all the messages linked to a session?
How to ensure that messages belonging to a session are processed by a Spark executor in the order they were reported at Kafka? Can we somehow achieve this without putting a constraint on thread count and incurring processing overheads (like sorting by message timestamp)?
When to checkpoint session state? How state is resurrected from last checkpoint in case of executor node crash? How state is resurrected from last checkpoint in case of driver node crash?
How state is resurrected if a node(executor/driver) crashes before checkpointing its state? If Spark recreates state RDD by replaying messages then where does it start replaying the Kafka messages from: last checkpoint on wards or does it process all the messages needed to recreate the partition? Can/does Spark streaming resurrect state across multiple spark streaming batches or only for the current batch i.e. can the state be recovered if checkpointing was not done during the last batch?
If each Kafka message belongs to a particular session, how to manage
session affinity so that the same Spark executor sees all the messages
linked to a session?
Kafka divides topics into partitions, and every partition can only be read by one consumer at a time, so you need to make sure that all messages belonging to one session go into the same partition. Partition assignment is controlled via the key that you assign to every message, so the easiest way to achieve this would probably be to use the session id as key when sending data. That way the same consumer will get all messages for one session.
There is one caveat though: Kafka will rebalance assignment of partitions to consumers, when a consumer joins or leaves the consumergroup. If this happens mid-session, it can (and will) happen, that half the messages for that session go to one consumer and the other half go to a different consumer after the rebalance. To avoid this, you'll need to manually subscribe to specific partitions in your code so that every processor has its specific set of partitions and does not change those. Have a look at ConsumerStrategies.Assign in the SparkKafka Component Code for this.
How to ensure that messages belonging to a session are processed by a
Spark executor in the order they were reported at Kafka? Can we
somehow achieve this without putting a constraint on thread count and
incurring processing overheads (like sorting by message timestamp)?
Kafka preserves ordering per partition, so there is not much you need to do here. The only thing is to avoid having multiple requests from the producer to the broker at the same time, which you can configure via the producer parameter max.in.flight.requests.per.connection. As long as you keep this at 1, you should be safe if I understand your setup correctly.
When to checkpoint session state? How state is resurrected from last
checkpoint in case of executor node crash? How state is resurrected
from last checkpoint in case of driver node crash?
I'd suggest reading the offset storage section of the Spark Streaming + Kafka Integration Guide, which should answer a lot of questions already.
The short version is, you can persist your last read offset into Kafka and should definitely do this whenever you checkpoint your executors. That way, whenever a new executor picks up processing, no matter whether it was restored from a checkpoint or not, it will know where to read from in Kafka.
How state is resurrected if a node(executor/driver) crashes before
checkpointing its state? If Spark recreates state RDD by replaying
messages then where does it start replaying the Kafka messages from:
last checkpoint on wards or does it process all the messages needed to
recreate the partition? Can/does Spark streaming resurrect state
across multiple spark streaming batches or only for the current batch
i.e. can the state be recovered if checkpointing was not done during
the last batch?
My Spark knowledge here is a bit shaky, but I would say that this not something that is done by Kafka/Spark but rather something that you actively need to influence with your code.
By default, if a new Kafka Stream is started up and finds no previous committed offset, it will simply start reading from the end of the topic, so it would get any message that is produced after the consumer is started. If you need to resurrect state, then you'd either need to know from what exact offset you want to start re-reading messages, or just start reading from the beginning again. You can pass an offset to read from into the above mentioned .Assign() method, when distributing partitions.
I hope this helps a bit, I am sure it is by no means a complete answer to all questions, but it is a fairly wide field to work, let me know if I can be of further help.
Related
I have a spark streaming application which streams data from kafka. I rely heavily on the order of the messages and hence just have one partition created in the kafka topic.
I am deploying this job in a cluster mode.
My question is: Since I am executing this in the cluster mode, I can have more than one executor pick up tasks and will I lose the order of messages received from kafka in that case. If not, how does spark guarantee order?
The distributed processing power wouldn't be there with single partition, so instead use multiple partitions and I would suggest to attach sequence number with every message, either counter or timestamp.
If you don't have timestamp within message then kafka streaming provide a way to extract message timestamp and you can use it to order events based on timestamp then run events based on sequence.
Refer answer on how to extract timestamp from kafka message.
To maintain order using single partition is the right choice, here are few other things you can try:
Turn off speculative execution
spark.speculation - If set to "true", performs speculative execution
of tasks. This means if one or more tasks are running slowly in a
stage, they will be re-launched.
Adjust your batch interval / sizes such that they can finish processing without any lag.
Cheers !
I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.
On yarn-cluster I use kafka directstream as input(ex.batch time is 15s),and want to aggregate the input msg in seperate userIds.
So I use stateful streaming api like updateStateByKey or mapWithState.But from the api source,I see that the mapWithState's default checkpoint duration is batchduration * 10 (in my case 150 s),and in kafka directstream the partition offset is checkpointed at every batch(15 s).Actually,every dstream can set different checkpoint duration.
So, my question is:
When streaming app crashed,I restart it,the kafka offset and state stream rdd are asynchronous in checkpoint,in this case how can I keep no data lose? Or I misunderstand the checkpoint mechanism?
How can I keep no data lose?
Stateful streams such as mapWithState or updateStateByKey require you to provide a checkpoint directory because that's part of how they operate, they store the state every intermediate to be able to recover the state upon a crash.
Other than that, each DStream in the chain is free to request checkpointing as well, question is "do you really need to checkpoint other streams"?
If an application crashes, Spark takes all the state RDDs stored inside the checkpoint and brings then back to memory, so your data there is as good as it was the last time spark checkpointed it there. One thing to keep in my mind is, if you change your application code, you cannot recover state from checkpoint, you'll have to delete it. This means that if for instance you need to do a version upgrade, all data that was previously stored in the state will be gone unless you manually save it yourself in a manner which allows versioning.
I'm running a Spark Streaming application that reads data from Kafka.
I have activated checkpointing to recover the job in case of failure.
The problem is that if the application fails, when it restarts it tries to execute all the data from the point of failure in only one micro batch.
This means that if a micro-batch usually receives 10.000 events from Kafka, if it fails and it restarts after 10 minutes it will have to process one micro-batch of 100.000 events.
Now if I want the recovery with checkpointing to be successful I have to assign much more memory than what I would do normally.
Is it normal that, when restarting, Spark Streaming tries to execute all the past events from checkpointing at once or am I doing something wrong?
Many thanks.
If your application finds it difficult to process all events in one micro batch after recovering it from failure, you can provide spark.streaming.kafka.maxRatePerPartition configuration is spark-conf, either in spark-defaults.conf or inside your application.
i.e if you believe your system/app can handle 10K events per minute second safely, and your kafka topic has 2 partitions, add this line to spark-defaults.conf
spark.streaming.kafka.maxRatePerPartition 5000
or add it inside your code :
val conf = new SparkConf()
conf.set("spark.streaming.kafka.maxRatePerPartition", "5000")
Additionally, I suggest you to set this number little bit higher and enable backpressure. This will try to stream data at a rate, which doesn't destabilizes your streaming app.
conf.set("spark.streaming.backpressure.enabled","true")
update: There was a mistake, The configuration is for number of seconds per seconds not per minute.
When streaming Spark DStreams as a consumer from a Kafka source, one can checkpoint the spark context so when the app crashes (or is affected by a kill -9), the app can recover from the context checkpoint. But if the app is 'accidentally deployed with bad logic', one might want to rewind to the last topic+partition+offset to replay events from a certain Kafka topic's partitions' offset positions that were working fine before the 'bad logic'. How are streaming apps rewound to the last 'good spot' (topic+partition+offset) when checkpointing is in effect?
Note: In I (Heart) Logs, Jay Kreps writes about using a parallel consumer (group) process that starts at the diverging Kafka offset locations until caught up with the original and then killing the original. (What does this 2nd Spark streaming process look like with respect to the starting from certain partition/offset locations?)
Sidebar: This question may be related to Mid-Stream Changing Configuration with Check-Pointed Spark Stream as a similar mechanism may need to be deployed.
You are not going to be able to rewind a stream in a running SparkStreamingContext. It's important to keep these points in mind (straight from the docs):
Once a context has been started, no new streaming computations can be set up or added to it.
Once a context has been stopped, it cannot be restarted.
Only one StreamingContext can be active in a JVM at the same time.
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop()
called stopSparkContext to false.
A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping
the SparkContext) before the next StreamingContext is created
Instead, you are going to have to stop the current stream, and create a new one. You can start a stream from a specific set of offsets using one of the versions of createDirectStream that takes a fromOffsets parameter with the signature Map[TopicAndPartition, Long] -- it's the starting offset mapped by topic and partition.
Another theoretical possibility is to use KafkaUtils.createRDD which takes offset ranges as input. Say your "bad logic" started at offset X and then you fixed it at offset Y. For certain use cases, you might just want to do createRDD with the offsets from X to Y and process those results, instead of trying to do it as a stream.