Design Question Kafka Consumer/Producer vs Kafka Stream - node.js

I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers

The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.

buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer

Related

A serverless solution to run a Kafka consumer/producer in node?

I'm having a hard time figuring out a serverless runtime for my Kafka consumers to process my events, and I want my producer to be a constant listener to pick up and ingest events into corresponding topics.
I'm using Upstash Kafka (serverless), but I don't know where to run my consumer code. I tried AWS Lambda to consume messages even though I don't think that it's the right approach.

Is it possible to make a Poller (or PollableMessageSource) to poll messages as List?

Following the example found in GitHub https://github.com/spring-cloud/spring-cloud-gcp/tree/master/spring-cloud-gcp-samples/spring-cloud-gcp-pubsub-polling-binder-sample regarding polling messages from a PubSub subscription, I was wondering...
Is it possible to make a PollableMessageSource retrieve List<Message<?>> instead of a single message per poll?
I've seen the #Poller notation only being used in Source typed objects, never in Processor or Sink. Is it possible to use in such context when for example using #StreamListener or with a functional approach?
The PollableMessageSource binding and Source stream applications are fully based on the Poller and MessageSource abstraction from Spring Integration where its contract is to produce a single message to the channel configured. The point of the messaging is really to process a single message not affecting others. The failure for one message doesn't mean to fail others in the flow.
On the other hand you probably mean GCP Pub/Sub messages to be produced as a list in the Spring message payload. That is really possible, but via some custom code from Pub/Sub consumer and MessageSource impl. Although I would think twice to expect some batched from the source. Probably you may utilize an aggregator to build some small windows if your further logic is about processing as list. But again: it is going to be a single Spring message.
May be better to start thinking about a reactive function implementation where you indeed can expect a Flux<Message<?>> as an input and Spring Cloud Stream framework will take care for you how to emit the data from Pub/Sub into the reactive stream you expect.
See more info in docs: https://docs.spring.io/spring-cloud-stream/docs/3.1.0/reference/html/spring-cloud-stream.html#_reactive_functions_support

Reading messages in bulk through a Pulsar consumer

I am using node pulsar client to consume messages from a Pulsar topic. The consumer is subscribed to the topic using a shared subscription mode. Currently, each call to receive gets a single message from the topic. Is there a way to receive messages in bulk?
The fact that you get messages one by one doesn't mean that the Pulsar client doesn't use batching and other optimization techniques in the background. Official documentation for the Pulsar Java consumer defines the receiverQueueSize parameter defining accumulation of messages. By default, the Pulsar consumer uses reasonable values for its parameters and it should perform quite well for the most of the applications. Do you experience any kind of issues or slow performance?
Update
Since the 2.4.1 version of Apache Pulsar it is possible to receive messages in batches using consumer. First, the consumer should be created with the BatchReceivePolicy config (change values to more appropriate for your use case):
Consumer<GenericRecord> consumer = pulsarClient
.newConsumer(Schema.AUTO_CONSUME())
.batchReceivePolicy(BatchReceivePolicy.builder()
.maxNumMessages(5000)
.maxNumBytes(10 * 1024 * 1024)
.timeout(1, TimeUnit.SECONDS).build())
// .. other configuration such as topic and subscription
Second, use the batchReceive method to get a batch of messages:
Messages<GenericRecord> messages = consumer.batchReceive();
When all messages are processed, simply acknowledge all of them:
consumer.acknowledge(messages);

Structured Streaming Rollback files in case of Exception

In my Structured Streaming application, I am reading the data from MQ and doing some transformation and writing the results to kafka. I have implemented the MQ custom source.
My question is how to roll back the messages to MQ incase of exceptions during the transformation or while writing the messages to Kafka.
I am reading the messages as bulk, say 5000 messages per batch, but while writing the results, if kafka goes down, what are the ways we can rollback the messages?
Is there any approach we can rollback or recover messages when using custom source (any not distributed source like MQ).

Using Apache Kafka for log aggregation

I am learning Apache Kafka from their quickstart tutorial: http://kafka.apache.org/documentation.html#quickstart. Upto now, I have done the setup as follows. A producer node, where a web server is running at port 8888. A Kafka server(broker), Consumer and Zookeeper instance on another node. And I have tested the default console/file enabled producer and consumer with 3 partitions. The setup is perfect, and I am able to see the messages I sent in the order they created (with in each partition).
Now, I want to send the logs generated from the web server to Kafka Broker. These messages will be processed by consumer later. Currently I am using syslog-ng to capture server logs to a text file. I have come up with 3 rough ideas on how to implement producer to use kafka for log aggregation
Producer Implementations
First Kind:
Listen to tcp port of syslog-ng. Fetch each message and send to kafka server. Here we have two middle processes: Producer and syslog-ng
Second Kind: Using syslog-ng as Producer. Should find a way to send messages to Kafka server instead of writing to a file. Syslog-ng, the producer is the middle process.
Third Kind: Configuring the webserver itself as producer.
Am I correct in my thinking. In the last case we don't have any middle process. But I doubt its implementation will effect server performance. Can anyone let me know the best way of using Apache Kafka(if the above 3 are not good) and guide me through appropriate configuration of server?..
P.S.: I am using node.js for my web server
Thanks,
Sarath
Since you specify that you wish to send the logs generated to kafka broker, it indeed looks as if executing a process to listen and resend messages mainly creates another point of failure with no additional value (unless you need a specific syslog-ng capability).
Syslog-ng can send messages to external applications using:
http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-3.4-guides/en/syslog-ng-ose-v3.4-guide-admin/html/configuring-destinations-program.html. I don't know if there are other ways to do that.
For the third option, I am not sure if kafka can easily be integrated into Node.js as it requires a c++ producer and when I last looked for one, I was not able to find. However, an easy alternative could be to have kafka read the log file created by the server and send those logs (using the console producer provided with kafka). This is usually a good way, as it completely remove dependencies between kafka and the web server (embedding the producer in would require error handling, configuration, etc). It requires the use of tail --follow and it works for us very well. If you wish more details on that, I can include them as well. Still you would need to supervise kafka execution to make sure messages are not lost (and provide a recovery option to offline send messages that failed). But, the good thing about this method is that there are no dependency between the tools.
Hope it helps...
Eran

Resources