Hazelcast Jet Kafka with not serializable event handler - hazelcast-jet

I want to use hazelcast-jet-kafka in my app, because in my case the number of kafka partitions is limited. How I understand jet-kafka parallelism doesn't depend on kafka partitions, it would be nice to find explanations of how jet-kafka achieve independence of the number of kafka partitions.
But my question is how I can handle events in jet when my event handler could not be serializable.
For example, I've found a solution - use map sink and add local event listener to this map,
but for me, it seems like a crutch, because I don't need to store these events in map. It is possible to set map size to zero in such scheme?
Also, I see in docs new type of sink - observable, it seems what I want, but observable listener could not get only local entries and for me, it is not suitable.
Could you help find the right solution? Or hazelcast-jet-kafka is not a good choice in that case?

it would be nice to find explanations of how jet-kafka achieve independence of the number of kafka partitions.
One Jet thread can handle any number of partitions, so it's easy to achieve this independence. Jet just distributes all the partitions fairly among all the Kafka connector threads.
But my question is how I can handle events in jet when my event handler could not be serializable.
Hazelcast Jet doesn't require your event handler to be serializable. If you need a stateful handler, you have to supply a function that creates the state object. The function must be serializable, but the state doesn't have to be. If you just want a stateless mapping function, it must be serializable, but usually there's no problem with that.
If you are getting an error that says a function is non-serializable, this can be due to a common pitfall of capturing more state than you actually need in the lambda. You should show your code in that case.

Related

Hazelcast, recovering a MessageListener member after termination due to message loss

We have a ReliableMessageListener that synchronizes some data structures that it holds across the cluster by the onMessage implementation.
The cluster is composed of three nodes. We noticed that one of the topics gets out of sync, and had been terminated due to message loss, detected by the ring-buffer, as we get a "Terminating MessageListener, ... Reason: Underlying ring buffer data related to reliable topic is lost" exception. What happens is that this node is still up, but this specific listener does not get events/messages from the other two nodes, while they do get those from it.
We get to a de-facto split-brain for this specific topic.
Our message listener is configured as isLossTolerant = false, and isTerminal = false.
I am trying to understand what is considered to be a good strategy for handling such a scenario and recovering from it.
For example, is that a good practice to try and subscribe this topic again?
Is that a good practice to send a message for clearing the data from the other nodes in the cluster? Will they even get the message after the ring-buffer got out of sync?
Thanks
The message Reason: Underlying ring buffer data related to the reliable topic is lost means that the data you are trying to read is not available anymore because it was overwritten by newer data in the underlying Ringbuffer - your producer is likely faster than your consumers.
When such a situation occurs the ReliableTopic is still usable, and you can register a new listener.
To prevent the situation from occurring you can either increase the size of the underlying ringbuffer (provide ringbuffer config with same name as the reliable topic) or configure TopicOverloadPolicy. See documentation for details.

Are Hazelcast Jet Reliable Topic Sinks idempotent? (Hazelcast fault-tolerance of a websocket source)

I cannot find this in the Hazelcast Jet 5.0 (or 4.x) documentation, so I hope someone can answer this here - can a reliable topic be used as an idempotent sink, for example to de-duplicate events coming from two identical unreliable sources (like a websocket).
Or should I use an explicit event de-duplication as suggested at https://hazelcast.com/blog/stream-deduplication-with-hazelcast-jet/? Or is there a better way to cope with unreliable sources like websockets (I mean for the case I don't want to miss events ingested over a websocket, and there is non-zero chance that a single websocket instance might fail)?
Any queue can't in general be used for de-duplication. If you offer the same item twice, it has no means to ignore such call, for that it would have to store the identifiers from the entire history, or you have to specify storage limits like in the example you linked where the TTL attribute of filterStateful is used.
I ended up using a putIfAbsent() on an IMap journal - I thinks that's much simpler (and somewhat obvious) for my use case than the de-duplication solution linked above.

Properly Seek and Consume Kafka Messages on Multipartition Topic

I recently found that a topic i've been using is multi-partition rather than single partition. I need to reconfigure my consumer class to handle the multiple partitions, but i'm a touch confused. I am currently using an offset group, let's call it test_offset_group for sake of the below example. Under normal circumstances, I will always be parsing linearly and continuing forward in time; as messages get added to the topic I will parse them and move on, but in the event of a crash or the need to go back and re-run the feed for the previous day, I need to be able to seek by timestamp. Kafka is mandatory in this project so I have no ability to change the type of streaming data service i'm using.
I configure my consumer like this:
test_consumer = KafkaConsumer("test_topic", bootstrap_servers="bootstrap_string", enable_auto_commit=False, group_id="test_offset_group"
In the event I need to seek to a timestamp, i'll provide a timestamp and then seek using the following method:
test_consumer.poll()
tp = TopicPartition("test_topic", 0)
needed_date = datetime.timestamp(timestamp)
rec_in = test_consumer.offsets_for_times({tp: needed_date * 1000})
test_consumer.seek(tp, rec_in[tp].offset)
The above functions perfectly for a single partition consumer, but this feels very clunky and difficult when you consider numerous partitions. I guess I could fetch the number of partitions using
test_consumer.partitions_for_topic('test_topic")
and then iterate over each of them, but again, that seems like i'm going against the grain of Kafka and I feel like there should be an easier way to do this.
In summary: I'd like to understand how to seek to a number of offsets with multiple partitions utilizing the offset_group functionality and i'd like to confirm that, by conducting the above operation, I am effectively ignoring all partitions aside from 0?
You are doing the right logic, you just need to perform it on all partitions asigned to this consumer instance.
You can retrieve the current assignment using assignment().

Order Guarantee with Sparking Streaming

I am trying to get some change event from Kafka that I would like to propagate downstream in another system. However the Change order matters. Hence I wonder what is the appropriate way to do that with some Spark transformation in the middle.
The only thing I see is to loose the parallelism and make the DStream on one partition. Maybe there is a way to do operation in parallel and bring everything back in one partition and then send it to the external system or back in Kafka and then use a Kafka Sink for the matter.
What approach can I try?
In a distributed environment, with some form of cashing/buffering at most layer, message generated from same machine may reach back-end in different order. Also the definition of order is subjective. Implementing a global definition of order will be restrictive (may not be correct) for the data as a whole.
So, Kafka is meant for keeping the data in order in the order of put but partition comes as a catch!!! Partition defines the level of parallelism per topic.
Typically, the level of abstraction at which kafka is kept, it should not bother much about order. It should be optimised for maximum throughput, where partitioning will come handy!!! Consider ordering just a side effect of supporting streaming!!!
Now, what ever logic ensures, that data is put in to kafka in order, that makes more sense in your application (spark job).

Spark streaming with Kafka - createDirectStream vs createStream

We have been using spark streaming with kafka for a while and until now we were using the createStream method from KafkaUtils.
We just started exploring the createDirectStream and like it for two reasons:
1) Better/easier "exactly once" semantics
2) Better correlation of kafka topic partition to rdd partitions
I did notice that the createDirectStream is marked as experimental. The question I have is (sorry if this in not very specific):
Should we explore the createDirectStream method if exactly once is very important to us? Will be awesome if you guys can share your experience with it. Are we running the risk of having to deal with other issues such as reliability etc?
There is a great, extensive blog post by the creator of the direct approach (Cody) here.
In general, reading the Kafka delivery semantics section, the last part says:
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
This basically means "we give you at least once out of the box, if you want exactly once, that's on you". Further, the blog post talks about the guarantee of "exactly once" semantics you get from Spark with both approaches (direct and receiver based, emphasis mine):
Second, understand that Spark does not guarantee exactly-once
semantics for output actions. When the Spark streaming guide talks
about exactly-once, it’s only referring to a given item in an RDD
being included in a calculated value once, in a purely functional
sense. Any side-effecting output operations (i.e. anything you do in
foreachRDD to save the result) may be repeated, because any stage of
the process might fail and be retried.
Also, this is what the Spark documentation says about receiver based processing:
The first approach (Receiver based) uses Kafka’s high level API to store consumed
offsets in Zookeeper. This is traditionally the way to consume data
from Kafka. While this approach (in combination with write ahead logs)
can ensure zero data loss (i.e. at-least once semantics), there is a
small chance some records may get consumed twice under some failures.
This basically means that if you're using the Receiver based stream with Spark you may still have duplicated data in case the output transformation fails, it is at least once.
In my project I use the direct stream approach, where the delivery semantics depend on how you handle them. This means that if you want to ensure exactly once semantics, you can store the offsets along with the data in a transaction like fashion, if one fails the other fails as well.
I recommend reading the blog post (link above) and the Delivery Semantics in the Kafka documentation page. To conclude, I definitely recommend you look into the direct stream approach.

Resources