Query a Kafka Topic with nodejs - node.js

I'm a bit puzzled. Is there really no NodeJS API to query Kafka topics (e.g. as with Kafka Streams and the Java API)? Am I missing something?
Just to get this straight. Only be notified of the latest event/record of a topic is not enough for me. I want to query and process the topics' records - and then maybe store an aggregate to another topic.
thanks for your thoughts if this is possible with nodejs and a library only.

Here what worked for me and most people use.
Limited solution
If you are stubborn and want to insist on a node.js library to wrap things up for you: https://nodefluent.github.io/kafka-streams/docs/
As of today they offer:
easy access streams
merge streams
split streams
Full blown solution
The easiest way (as from today - there are rumors Confluent is creating more libraries and also one for nodejs) one can query kafka is by the REST API. It is part of the ksqlDB and ksqlDB is part of the confluent platform which also ships with Zookeeper and a Kafka instance, which you probably already have. If you wonder how to install:
It spins up in 1 minutes with the docker yml file.
Run docker-compose up -d
See the ports and services running with docker ps
Start requesting the status from the REST API by simply sending a GET request to http://0.0.0.0:8088/. It will return service information.
{
"KsqlServerInfo": {
"version": "6.2.0",
"kafkaClusterId": "uOXfee3zQ76vCKBbREw1yg",
"ksqlServiceId": "default_",
"serverStatus": "RUNNING"
}
}
Hope the strips some of you from the initial research. And.... if we are lucky there will be a wrapper library soon.
Then create a stream out of your topic and voila. You are ready to query your topic (through the stream) with the REST API. Since the REST API offers HTTP2, one could also expect continuous updates on freshly arriving records in the stream. Apply Push Queries for this. Pull queries will cut the line after the result has been delivered.

Related

periodic refresh of static data in Structure Streaming and Stateful Streaming

I am trying to implement 5 min batch monitoring using spark structured streaming where read from kafka and look up on (1 huge and 1 smaller) diff static datasets as part of ETL logic and call rest API to send final results to an external application (out of billions of records from kafka only less than 100 will be out to rest API after ETL).
How to achieve refreshing static look ups with out restarting the whole streaming application ? (StreamingQueryListener using StreamingQueryManager.addListener method to have our own logic of refreshing/recreating static df via StreamingQuery.AwaitTermination ? or use persist and unpersis cache ? or any other better ideas ?)
Note : Went through below article but not sure if hbase is better option as its an old one.
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
Once a record is enriched with look up information and applied some rules/conditions , we need to start keep track of it to send updates until it completed its lifecycle of an event as per custom logic via rest API. So hoping flatmapwithGroupState implementation helps here to keep track of event state. Please suggest best options here.
Managing group state with in HDFS vs using HBase. Please suggest best options from an operationalization and monitoring point of view in production environment where support team has minimal knowledge of Spark. If we use HDFS for state maintenance, how to keep it up with event state tracking in case of rest API fails to send updates to end user/system?

Apache Pulsar Java client taking too much memory (OOM)

I wrote a simple Apache Pulsar client with Spring boot - a pulsar-producer initialized as beans that will be used in the rest controller to publish incoming api messages to Pulsar, and a consumer that consumes message, prints some values in console & acknowledge.
As of now the application is very simple, but the moment this spring-boot app loads I see memory peak, at times getting OOM. Is there any specific configuration to be used when using Pulsar client with Spring-boot?
The code is mostly the one found the Pulsar doc.
I am answering this to doc this issue - do not use the loops to consume messages, instead adopt the MessageListener subscribed to consumer via
consumer.messageListener(new Myconsumer())
or
consumer.messageListener((consumer, msg)->{//do something})
Docs didnt mention this, but I found surfing the consumer api.

How to implement something similar to Storm DRPC in Samza?

I have samza job with a number of tasks, each of which holds some state in its embedded store. I want to expose this store for reading to outside world via some kind of RPC mechanism. What could be the best solution for this?
Here is one paragraph in Samza documentation about it:
Samza does not currently have an equivalent API to DRPC,
but you can build it yourself using Samza’s stream
processing primitives.
The only solution which comes to my mind is to make my tasks, in addition to normal processing, to consume request messages with some correlation IDs on a special request topic, and to put response messages with the same correlation IDs into special response topic. So it's like RPC-over-Kafka solution which seems to me suboptimal.
Any thoughts are welcome!
As far as I remember the embedded store is backed up in a Kafka topic. When you set something in the store, the message is produced to the topic. Thus you can consume this topic and you can "clone" the embedded store to a different database. Then you can query the database. Or you can use just the database instead of the embedded store. But this approach could lead to performance issues in your Samza job...

How to connect pinoccio to apache couchdb

Is there anyone using the nice pinoccio from www.pinocc.io ?
I want to use it to post data into an apache couchdb using node.js. So I'm trying to poll data from the pinnocio API, but I'm a little lost:
schedule the polls
do long polls
do a completely different approach
Any ideas are welcome
Pitt
Sure. I wrote the Pinoccio API, here’s how you do it
https://gist.github.com/soldair/c11d6ae6f4bead140838
This example depends on the pinoccio npm module ~0.1.3 so make sure to npm install again to pick up the newest version.
you don't need to poll because pinoccio will send you changes as they happen if you have an open connection to either "stats" or "sync". if you want to poll you can but its not "real time".
sync gives you the current state + streams changes as they happen. so its perfect if you
only need to save the changes to your troop while your script is running. or show the current and last known state on a web page.
The solution that replicates every data point we store is stats. This is the example provided. Stats lets you read everything that has happened to a scout. Digital pins for example are the "digital" report. You can ask for data from a specific point in time or just from the current time (default). Changes to this "digital" report will continue streaming live as they happen, until the "end" time is reached, or if "tail" equals 0 in the options passed to stats.
hope this helps. i tested the script on my local couch and it worked well. you would need to modify it to copy more stats from each scout. I hope that soon you will be able to request multiple reports from multiple scouts in the same stream. i just have some bugs to sort out ;)
You need to look into 2 dimensions:
node.js talking to CouchDB. This is well understood and there are some questions you can find here.
Getting the data from the pinoccio. The API suggests that as long as the connection is open, you get data. So use a short timeout and a loop. You might want to run your own node.js instance for that.
Interesting fact: the CouchDB team seems to work on replacing their internal JS engine with node.js

Using Apache Kafka for log aggregation

I am learning Apache Kafka from their quickstart tutorial: http://kafka.apache.org/documentation.html#quickstart. Upto now, I have done the setup as follows. A producer node, where a web server is running at port 8888. A Kafka server(broker), Consumer and Zookeeper instance on another node. And I have tested the default console/file enabled producer and consumer with 3 partitions. The setup is perfect, and I am able to see the messages I sent in the order they created (with in each partition).
Now, I want to send the logs generated from the web server to Kafka Broker. These messages will be processed by consumer later. Currently I am using syslog-ng to capture server logs to a text file. I have come up with 3 rough ideas on how to implement producer to use kafka for log aggregation
Producer Implementations
First Kind:
Listen to tcp port of syslog-ng. Fetch each message and send to kafka server. Here we have two middle processes: Producer and syslog-ng
Second Kind: Using syslog-ng as Producer. Should find a way to send messages to Kafka server instead of writing to a file. Syslog-ng, the producer is the middle process.
Third Kind: Configuring the webserver itself as producer.
Am I correct in my thinking. In the last case we don't have any middle process. But I doubt its implementation will effect server performance. Can anyone let me know the best way of using Apache Kafka(if the above 3 are not good) and guide me through appropriate configuration of server?..
P.S.: I am using node.js for my web server
Thanks,
Sarath
Since you specify that you wish to send the logs generated to kafka broker, it indeed looks as if executing a process to listen and resend messages mainly creates another point of failure with no additional value (unless you need a specific syslog-ng capability).
Syslog-ng can send messages to external applications using:
http://www.balabit.com/sites/default/files/documents/syslog-ng-ose-3.4-guides/en/syslog-ng-ose-v3.4-guide-admin/html/configuring-destinations-program.html. I don't know if there are other ways to do that.
For the third option, I am not sure if kafka can easily be integrated into Node.js as it requires a c++ producer and when I last looked for one, I was not able to find. However, an easy alternative could be to have kafka read the log file created by the server and send those logs (using the console producer provided with kafka). This is usually a good way, as it completely remove dependencies between kafka and the web server (embedding the producer in would require error handling, configuration, etc). It requires the use of tail --follow and it works for us very well. If you wish more details on that, I can include them as well. Still you would need to supervise kafka execution to make sure messages are not lost (and provide a recovery option to offline send messages that failed). But, the good thing about this method is that there are no dependency between the tools.
Hope it helps...
Eran

Resources