Today I noticed that it is written as below in under High Level Producer of kafka-node.
⚠️WARNING: Batch multiple messages of the same topic/partition together as an array on the messages attribute otherwise you may lose messages!
What does it mean ? Is it means that messages may get lost if I frequently tries to write data to same topic/partition without batch ?
Related
I am using Spark 2.3 structured streaming to read messages from Kafka and write to Postgres (using the Scala programming language), my application is supposed to be a long living application, and it should be able to handle any case of failure without human intervention.
I have been looking for ways to catch unexpected errors in Structured Streaming, and I found this example here:
Spark Structured Streaming exception handling
This way it is possible to catch all errors that are thrown in the Stream, but the problem is, when the application tries again, it is stuck on the same exception again.
Is there a way in Structured Streaming that I can handle the error and tell spark to increment the offset in the "checkpointlocation" programatically so that it proceeds to the consume the next message without being stuck?
This is called in the streaming event processing world as handling a "poison pill"
Please have a look on the following link
https://www.waitingforcode.com/apache-spark-structured-streaming/corrupted-records-poison-pill-records-apache-spark-structured-streaming/read
It suggest several ways to handle this type of scenario
Strategy 1: let it crash
The Streaming application will log a poison pill message and stop the processing. It's not a big deal because thanks to the checkpointed offsets we'll be able to reprocess the data and handle it accordingly, maybe with a try-catch block.
However, as you already saw in your question, it's not a good practice in streaming systems because the consumer stops and during that idle period it accumulates the lag (the producer continues to generate data).
Strategy 2: ignore errors
If you don't want downtime of your consumer, you can simply skip the corrupted events. In Structured Streaming it can be summarized to filtering out null records and, eventually, logging the unparseable messages for further investigation, or records that get you an error.
Strategy 3: Dead Letter Queue
we ignore the errors but instead of logging them, we dispatch them into another data storage.
Strategy 4: sentinel value
You can use a pattern called Sentinel Value and it can be freely used with Dead Letter Queue.
Sentinel Value corresponds to a unique value returned every time in case of trouble.
So in your case, whenever a record cannot be converted to the structure we're processing, you will emit a common object,
For code samples look inside the link
Hi I have a event hub with two consumer group.
Many device are sending data to my event hub any I want to save all message to my data base.
Now data are getting send by multiple device so data ingress is to high so in order two process those message i have written one EventHub Trigger webjob to process the message and save to database.
But since saving these message to my data base is time consuming task or I can say that receiver speed is slow then sender speed.
So is there any way two process these message faster by creating multiple receiver kind of thing.
I have create two event receiver with different consumer group but I found that same message is getting processed by both trigger function so now duplicate data are getting save in my data base.
So please help me to know how I can create multiple receiver which will process unique message parallel.
Please guys help me...
Creating multiple consumer groups won't help you out as you found out yourself. Different consumer groups all read the same data, but they can have their own speed.
In order to increase the speed of processing there are just 2 options:
Make the process/code itself faster, so try to optimize the code that is saving the data to the database
Increase the amount of partitions so more consumers can read the data from a given partition in parallel. This means however that you will have to recreate the Event Hub as you cannot increase/decrease the partition count after the Event Hub is created. See the docs for guidance.
about 2.: The number of concurrent data consumers is equal to the number of partitions created. For example, if you have 4 partitions you can have up to 4 concurrent data readers processing the data.
I do not know your situation but if you have certain peaks in which the processing is too slow but it catches up during more quiet hours you might be able to live with the current situation. If not, you have to do something like I outlined.
I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck
Is there any way I can have time threshold associated message in Kafka.
E.g.
Consumer pulls a message out of Kafka but system does not have enough information to process. So I put the message back in "resolver" queue, but I do not want to pull it out of the "resolver" queue for next 15 minutes, is there any way I can achieve that.
No, there is no way to achieve that with just Kafka. Kafka is designed to store messages for a certain period of time or until a certain size. The only way to get messages from Kafka is by offset. The offset identifies which messages have been consumed, messages before the offset have already been read, messages after the offset are yet to be read.
As far as I know Kafka does not provide time stamps with its messages (I could be wrong about that). But reading through the documentation and working with it for several months I have never encountered any example about retrieving messages from Kafka based on time.
I'm just started with Apache Kafka and really try to figure out, how could I design my system to use it in proper manner.
I'm building system which process data and actually my chunk of data is a task (object), that need to be processed. And object knows how it could be processed, so that's not a problem.
My system is actually a splited into 3 main component: Publisher (code which spown tasks), transport - actually kafka, and set of Consumers - it's actually workers who just pull data from the queue, process it somehow. It's important to note, that Consumer could be a publisher itself, if it's task need 2 step computation (Consumer just create tasks and send it back to transport)
So we could start with idea that I have 3 server: 1 single root publisher (kafka server also running there) and 2 consumers servers which actually handle the tasks. Data workflow is like that: Publisher create task, put it to transposrt, than one of consumers take this task from the queue and handle it. And it will be nice if each consumer will be handle the same ammount of tasks as the others (so workload spread eqauly between consumers).
Which kafka configuration pattern I need to use for that case? Does kafka have some message balancing features or I need to create 2 partitions and each consumer will be only binded to single partitions and could consume data only from this partition?
In kafka number of partitions roughly translates to the parallelism of the system.
General tip is create more partitions per topic (eg. 10) and while creating the consumer specify the number of consumer threads corresponding to the number of partitions.
In the High-level consumer API while creating the consumer you can provide the number of streams(threads) to create per topic. Assume that you create 10 partitions and you run the consumer process from a single machine, you can give topicCount as 10. If you run the consumer process from 2 servers you could specify the topicCount as 5.
Please refer to this link
The createMessageStreams call registers the consumer for the topic, which results in rebalancing the consumer/broker assignment. The API encourages creating many topic streams in a single call in order to minimize this rebalancing.
Also you can dynamically increased the number of partitions using kafka-add-partitions.sh command under kafka/bin. After increasing the partitions you can restart the consumer process with increased topicCount
Also while producing you should use the KeyedMessage class based on some random key within your message object so that the messages are evenly distributed across the different partitions