Using Apache Kafka for log aggregation - node.js

I am learning Apache Kafka from their quickstart tutorial: http://kafka.apache.org/documentation.html#quickstart. Upto now, I have done the setup as follows. A producer node, where a web server is running at port 8888. A Kafka server(broker), Consumer and Zookeeper instance on another node. And I have tested the default console/file enabled producer and consumer with 3 partitions. The setup is perfect, and I am able to see the messages I sent in the order they created (with in each partition).
Now, I want to send the logs generated from the web server to Kafka Broker. These messages will be processed by consumer later. Currently I am using syslog-ng to capture server logs to a text file. I have come up with 3 rough ideas on how to implement producer to use kafka for log aggregation
Producer Implementations
First Kind:
Listen to tcp port of syslog-ng. Fetch each message and send to kafka server. Here we have two middle processes: Producer and syslog-ng
Second Kind: Using syslog-ng as Producer. Should find a way to send messages to Kafka server instead of writing to a file. Syslog-ng, the producer is the middle process.
Third Kind: Configuring the webserver itself as producer.
Am I correct in my thinking. In the last case we don't have any middle process. But I doubt its implementation will effect server performance. Can anyone let me know the best way of using Apache Kafka(if the above 3 are not good) and guide me through appropriate configuration of server?..
P.S.: I am using node.js for my web server
Thanks,
Sarath

Since you specify that you wish to send the logs generated to kafka broker, it indeed looks as if executing a process to listen and resend messages mainly creates another point of failure with no additional value (unless you need a specific syslog-ng capability).
Syslog-ng can send messages to external applications using:
http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-3.4-guides/en/syslog-ng-ose-v3.4-guide-admin/html/configuring-destinations-program.html. I don't know if there are other ways to do that.
For the third option, I am not sure if kafka can easily be integrated into Node.js as it requires a c++ producer and when I last looked for one, I was not able to find. However, an easy alternative could be to have kafka read the log file created by the server and send those logs (using the console producer provided with kafka). This is usually a good way, as it completely remove dependencies between kafka and the web server (embedding the producer in would require error handling, configuration, etc). It requires the use of tail --follow and it works for us very well. If you wish more details on that, I can include them as well. Still you would need to supervise kafka execution to make sure messages are not lost (and provide a recovery option to offline send messages that failed). But, the good thing about this method is that there are no dependency between the tools.
Hope it helps...
Eran

Related

Query a Kafka Topic with nodejs

I'm a bit puzzled. Is there really no NodeJS API to query Kafka topics (e.g. as with Kafka Streams and the Java API)? Am I missing something?
Just to get this straight. Only be notified of the latest event/record of a topic is not enough for me. I want to query and process the topics' records - and then maybe store an aggregate to another topic.
thanks for your thoughts if this is possible with nodejs and a library only.
Here what worked for me and most people use.
Limited solution
If you are stubborn and want to insist on a node.js library to wrap things up for you: https://nodefluent.github.io/kafka-streams/docs/
As of today they offer:
easy access streams
merge streams
split streams
Full blown solution
The easiest way (as from today - there are rumors Confluent is creating more libraries and also one for nodejs) one can query kafka is by the REST API. It is part of the ksqlDB and ksqlDB is part of the confluent platform which also ships with Zookeeper and a Kafka instance, which you probably already have. If you wonder how to install:
It spins up in 1 minutes with the docker yml file.
Run docker-compose up -d
See the ports and services running with docker ps
Start requesting the status from the REST API by simply sending a GET request to http://0.0.0.0:8088/. It will return service information.
{
"KsqlServerInfo": {
"version": "6.2.0",
"kafkaClusterId": "uOXfee3zQ76vCKBbREw1yg",
"ksqlServiceId": "default_",
"serverStatus": "RUNNING"
}
}
Hope the strips some of you from the initial research. And.... if we are lucky there will be a wrapper library soon.
Then create a stream out of your topic and voila. You are ready to query your topic (through the stream) with the REST API. Since the REST API offers HTTP2, one could also expect continuous updates on freshly arriving records in the stream. Apply Push Queries for this. Pull queries will cut the line after the result has been delivered.

Design Question Kafka Consumer/Producer vs Kafka Stream

I'm working with NodeJs MS, so far they communicate through Kafka Consumer/Producer. Now I need to buiid a Loggger MS which must record all the messages and do some processing (parse and save to db), but I'm not sure if the current approach could be improved using Kafka Stream or if I should continue using Consumers
The Streams API is a higher level abstraction that sits on top of the Consumer/Producer APIs. The Streams API allows you to filter and transform messages, and build a topology of processing steps.
For what you're describing, if you're just picking up a messages and doing a single processing step, the Consumer API is probably fine. That said, you could do the same thing with the Streams API too and not use the other features.
buiid a Loggger MS which must record all the messages and do some processing (parse and save to db)
I would suggest using something like Streams API or Nodejs Producer + Consumer to parse and write back to Kafka.
From your parsed/filtered/sanitized messages, you can run a Kafka Connect cluster to sink your data into a DB
could be improved using Kafka Stream or if I should continue using Consumers
Ultimately, depends what you need. The peek and foreach methods of Streams DSL are functionally equivalent to a Consumer

Apache Pulsar Java client taking too much memory (OOM)

I wrote a simple Apache Pulsar client with Spring boot - a pulsar-producer initialized as beans that will be used in the rest controller to publish incoming api messages to Pulsar, and a consumer that consumes message, prints some values in console & acknowledge.
As of now the application is very simple, but the moment this spring-boot app loads I see memory peak, at times getting OOM. Is there any specific configuration to be used when using Pulsar client with Spring-boot?
The code is mostly the one found the Pulsar doc.
I am answering this to doc this issue - do not use the loops to consume messages, instead adopt the MessageListener subscribed to consumer via
consumer.messageListener(new Myconsumer())
or
consumer.messageListener((consumer, msg)->{//do something})
Docs didnt mention this, but I found surfing the consumer api.

How to send a message to ReactPHP/Amp/Swoole/etc. from PHP-FPM?

I'm thinking about making a worker script to handle async tasks on my server, using a framework such as ReactPHP, Amp or Swoole that would be running permanently as a service (I haven't made my choice between these frameworks yet, so solutions involving any of these are helpful).
My web endpoints would still be managed by Apache + PHP-FPM as normal, and I want them to be able to send messages to the permanently running script to make it aware that an async job is ready to be processed ASAP.
Pseudo-code from a web endpoint:
$pdo->exec('INSERT INTO Jobs VALUES (...)');
$jobId = $pdo->lastInsertId();
notify_new_job_to_worker($jobId); // how?
How do you typically handle communication from PHP-FPM to the permanently running script in any of these frameworks? Do you set up a TCP / Unix Socket server and implement your own messaging protocol, or are there ready-made solutions to tackle this problem?
Note: In case you're wondering, I'm not planning to use a third-party message queue software, as I want async jobs to be stored as part of the database transaction (either the whole transaction is successful, including committing the pending job, or the whole transaction is discarded). This is my guarantee that no jobs will be lost. If, worst case scenario, the message cannot be sent to the running service, missed jobs may still be retrieved from the database at a later time.
If your worker "runs permanently" as a service, it should provide some API to interact through. I use AmPHP in my project for async services, and my services implement HTTP/Websockets servers (using Amp libraries) as an API transport.
Hey ReactPHP core team member here. It totally depends on what your ReactPHP/Amp/Swoole process does. Looking at your example my suggestion would be to use a message broker/queue like RabbitMQ. That way the process can pic it up when it's ready for it and ack it when it's done. If anything happens with your process in the mean time and dies it will retry as long as it hasn't acked the message. You can also do a small HTTP API but that doesn't guarantee reprocessing of messages on fatal failures. Ultimately it all depends on your design, all 3 projects are a toolset to build your own architectures and systems, it's all up to you.

Azure Eventhub Apache Storm issue

I followed this article to try eventhub with Apache Storm, But when I run the Storm topology it was receiving events for a minute and then it stopped receiving. So I've restarted my program and then it was receiving the remaining messages. Every time I run the program after a minute it couldn't receive from eventhub. Please help me with the possibilities of the issue...
Should I change any configurations at Storm or Zookeeper.
The above jar contains a fix for a known issue in the QPID JMS client, which is used by the Event Hub spout implementation. When the service sends an empty frame (heartbeat to keep connection alive), a decoding error occurs in the client and that causes the client to stop processing commands. Details of this issue can be found here: https://issues.apache.org/jira/browse/QPID-6378

Resources