Read specific messages using Apache Kafka and NodeJS - node.js

I want to build an API with NodeJS and Kafka, which can take an offset and a topic as an input and output the first 10 messages starting from the offset. I tried this approach with No-Kafka and Kafka-Node.
The consumer API provided by them allows consuming messages from a particular offset. I want to stop consuming the messages once I have read around 10 messages. But both the API calls will continue to fetch the messages till the last message. How can I stop doing that?
Here is my EDITED FULL CODE
var Kafka = require('no-kafka');
var express = require("express");
var app = express();
var producer = new Kafka.Producer();
producer.init().then(function() {
console.log("Producer Ready");
});
var consumer = new Kafka.SimpleConsumer();
consumer.init().then(function() {
console.log("Consumer Ready");
});
app.get('/produce/:topic/:msg', function(req, res) {
producer.send({
topic: req.params.topic,
partition: 0,
message: {
value: req.params.msg
}
});
res.send("Added: " + req.params.msg + " to topic: " + req.params.topic);
});
app.get('/consume/:topic/:off', function(req, res) {
console.log("Request for topic: " + req.params.topic + " Offset: " + req.params.off);
consumer.subscribe(req.params.topic, 0, {
offset: req.params.off,
maxBytes: 1000
}, function(messageSet, topic, partition) {
var msg = "";
var size = messageSet.length;
//console.log(messageSet);
messageSet.some(function(m) {
msg += m.message.value.toString('utf8') + " ";
if (parseInt(m.offset, 10) > parseInt(req.params.off, 10) + 10) {
return true;
}
});
res.send("Thank you " + size + " " + req.params.off + " " + msg);
});
});
app.listen(process.env.PORT);
Any response in this regard is appreciated.

You can't really stop consuming from Kafka so abruptly, for a couple of different reasons. For one thing, Kafka consumers, whether JavaScript or something else, don't read a message at a time -- they fetch batches of messages. I know with kafka-node it seems like they come in one at a time, since you get an EventEmitter event for each message. But under the hood, the client fetches them in batches.
The best you can do is to to keep track of your offsets as you go, and when you get outside of the range you want just to ignore them, and then unsubsribe the topic or close the consumer to stop listening.
This definitely gets trickier with partitions -- you have to keep track of offsets relative to all of your partitions. I don't do the same thing you do -- my typical use case is to read from a point in time up to the current offset for each partition. So I have not optimized my partition reads to die out as soon as they hit their last offset. I do addTopics and add all the partions at once. You, on the other hand, probably need to add the partitions one at a time -- i.e. do addTopic for a specific partition, read that partition until you find your offset, then ignore messages and removeTopic on the partition.
I believe I played around with that flow, and you might even have to stand up a new consumer for each partition, not to mention a whole new client.

I was also working on a similar project. But what I did is, I set a timeout and made an agreement that, pass me the offset and number of records u want to get. I will fetch records from the offset you send but I cannot guarantee the number of records we give you. It might fetch less messages if timeout happens. But we give you offset of the last read record so that you can call again with that offset.
(As #David Griffin said, the problem is we have to create new client each time for each partition. OR store your data in only one partition and get it from that partition.

you can reduce the maxBytes by 10 or less depends on the size of the message or you can do something like this:
if(req.body.off>= req.body.off+10){//return res..}

Related

How to manage massive calls to Postgresql in Node

I have a question regarding massive calls to PostgreSQL.
This is the scenario:
I have a simple Nodejs app that makes queries to PostgreSQL in a short period of time.
Everything is fine, but sometimes these calls get rejected due to Postgresql maximum pool connections setting, which is equal to 100.
I have in mind to make queue consumption app style, which means adding every query to a queue and then consuming an element every second. By consequence a query to PostgreSQL every second.
But my problem is, Idk where to start. This is the part where I am getting problems with, at some point, I have a lot of calls and I get lots of "ERROR IN QUERY EXECUTION" for the reason explained before.
const pool3 = new Pool(credentialsPostGres);
let res = [];
let sql_call = "select colum1 from table2 where x = y"; //the real query is a bit more complex, but you get the idea.
poll_query.query(sql_call,(err,results) => {
if (err) {
pool3.end();
console.log(err + " ERROR IN QUERY EXECUTION");
} else {
res.push({ data: Object.values(JSON.parse(JSON.stringify(results.rows))) });
pool3.end();
return callback(res,data);
}
})
How I should manage this part into a queue? I am a bit lost.
Help!

Any suggestions about how to publish a huge amount of messages within one round of request / response?

If I publish 50K messages using Promise.all like below:
const pubsub = new PubSub({ projectId: PUBSUB_PROJECT_ID });
const topic = pubsub.topic(topicName, {
batching: {
maxMessages: 1000,
maxMilliseconds: 100,
},
});
const n = 50 * 1000;
const dataBufs: Buffer[] = [];
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
dataBufs.push(dataBuffer);
}
const tasks = dataBufs.map((d, idx) =>
topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
})
);
// publish messages concurrencly
await Promise.all(tasks);
// send response to front-end
res.json(data);
I will hit this issue: pubsub-emulator throw error and publisher throw "Retry total timeout exceeded before any response was received" when publish 50k messages
If I use for loop and async/await. The issue is gone.
const n = 50 * 1000;
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
const messageId = await topic.publish(dataBuffer)
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${i}`)
}
// some logic ...
// send response to front-end
res.json(data);
But it will block the execution of subsequent logic because of async/await until all messages have been published. It takes a long time to post 50k messages.
Any suggestions about how to publish a huge amount of messages(about 50k) without blocking the execution of subsequent logic? Do I need to use child_process or some queue like bull to publish the huge amount of messages in the background without blocking request/response workflow of the API? This means I need to respond to the front-end as soon as possible, the 50k messages should be the background tasks.
It seems there is a memory queue inside #google/pubsub library. I am not sure if I should use another queue like bull again.
The time it will take to publish large amounts of data depends on a lot of factors:
Message size. The larger the messages, the longer it takes to send them.
Network capacity (both of the connection between wherever the publisher is running and Google Cloud and, if relevant, of the virtual machine itself). This puts an upper bound on the amount of data that can be transmitted. It is not atypical to see smaller virtual machines with limits in the 40MB/s range. Note that if you are testing via Wifi, the limits could be even lower than this.
Number of threads and number of CPU cores. When having to run a lot of asynchronous callbacks, the ability to schedule them to run can be limited by the parallel capacity of the machine or runtime environment.
Typically, it is not good to try to send 50,000 publishes simultaneously from one instance of a publisher. It is likely that the above factors will cause the client to get overloaded and result in deadline exceeded errors. The best way to prevent this is to limit the number of messages that can be outstanding for publish at one time. Some of the libraries like Java support this natively. The Node.js library does not yet support this feature, but likely will in the future.
In the meantime, you'd want to keep a counter of the number of messages outstanding and limit it to whatever the client seems to be able to handle. Start with 1000 and work up or down from there based on the results. A semaphore would be a pretty standard way to achieve this behavior. In your case the code would look something like this:
var sem = require('semaphore')(1000);
var publishes = []
const tasks = dataBufs.map((d, idx) =>
sem.take(function() => {
publishes.push(topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
sem.leave();
}));
})
);
// Await the start of publishing all messages
await Promise.all(tasks);
// Await the actual publishes
await Promise.all(publishes);

DynamoDB PutItem using all heap memory - NodeJS

I have a csv with over a million lines, I want to import all the lines into DynamoDB. I'm able to loop through the csv just fine, however, when I try to call DynamoDB PutItem on these lines, I run out of heap memory after about 18k calls.
I don't understand why this memory is being used or how I can get around this issue. Here is my code:
let insertIntoDynamoDB = async () => {
const file = './file.csv';
let index = 0;
const readLine = createInterface({
input: createReadStream(file),
crlfDelay: Infinity
});
readLine.on('line', async (line) => {
let record = parse(`${line}`, {
delimiter: ',',
skip_empty_lines: true,
skip_lines_with_empty_values: false
});
await dynamodb.putItem({
Item: {
"Id": {
S: record[0][2]
},
"newId": {
S: record[0][0]
}
},
TableName: "My-Table-Name"
}).promise();
index++;
if (index % 1000 === 0) {
console.log(index);
}
});
// halts process until all lines have been processed
await once(readLine, 'close');
console.log('FINAL: ' + index);
}
If I comment out the Dynamodb call, I can look through the file just fine and read every line. Where is this memory usage coming from? My DynamoDB write throughput is at 500, adjusting this value has no affect.
For anyone that is grudging through the internet and trying to find out why DynamoDB is consuming all the heap memory, there is a github bug report found here: https://github.com/aws/aws-sdk-js/issues/1777#issuecomment-339398912
Basically, the aws sdk only has 50 sockets to make http requests, if all sockets are consumed, then the events will be queued until a socket becomes available. When processing millions of requests, these sockets get consumed immediately, and then the queue builds up until it blows up the heap.
So, then how do you get around this?
Increase heap size
Increase number of sockets
Control how many "events" you are queueing
Options 1 and 2 are the easy way out, but do no scale. They might work for your scenario, if you are doing a 1 off thing, but if you are trying to build a robust solution, then you will wan't to go with number 3.
To do number 3, I determine the max heap size, and divide it by how large I think an "event" will be in memory. For example: I assume an updateItem event for dynamodb would be 100,000 bytes. My heap size was 4GB, so 4,000,000,000 B / 100,000 B = 40,000 events. However, I only take 50% of this many events to leave room on the heap for other processes that the node application might be doing. This percentage can be lowered/increased depending on your preference. Once I have the amount of events, I then read a line from the csv and consume an event, when the event has been completed, I release the event back into the pool. If there are no events available, then I pause the input stream to the csv until an event becomes available.
Now I can upload millions of entries to dynamodb without any worry of blowing up the heap.

Nodejs Cluster Architecture reading from single REDIS instance

I'm using Nodejs cluster module to have multiple workers running.
I created a basic Architecture where there will be a single MASTER process which is basically an express server handling multiple requests and the main task of MASTER would be writing incoming data from requests into a REDIS instance. Other workers(numOfCPUs - 1) will be non-master i.e. they won't be handling any request as they are just the consumers. I have two features namely ABC and DEF. I distributed the non-master workers evenly across features via assigning them type.
For eg: on a 8-core machine:
1 will be MASTER instance handling request via express server
Remaining (8 - 1 = 7) will be distributed evenly. 4 to feature:ABD and 3 to fetaure:DEF.
non-master workers are basically consumers i.e. they read from REDIS in which only MASTER worker can write data.
Here's the code for the same:
if (cluster.isMaster) {
// Fork workers.
for (let i = 0; i < numCPUs - 1; i++) {
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
}
cluster.on('exit', function(worker) {
console.log(`Worker ${worker.process.pid}::type(${worker.type}) died`);
ClusteringUtil.removeWorkerFromList(worker.type);
ClusteringUtil.forkNewClusterWithAutoTypeBalancing();
});
// Start consuming on server-start
ABCConsumer.start();
DEFConsumer.start();
console.log(`Master running with process-id: ${process.pid}`);
} else {
console.log('CLUSTER type', cluster.worker.process.env.type, 'running on', process.pid);
if (
cluster.worker.process.env &&
cluster.worker.process.env.type &&
cluster.worker.process.env.type === ServerTypeEnum.EXPRESS
) {
// worker for handling requests
app.use(express.json());
...
}
{
Everything works fine except consumers reading from REDIS.
Since there are multiple consumers of a particular feature, each one reads the same message and start processing individually, which is what I don't want. If there are 4 consumers, 1 is marked as busy and can not consumer until free, 3 are available. Once the message for that particular feature is written in REDIS by MASTER, the problem is all 3 available consumers of that feature start consuming. This means that the for a single message, the job is done based on number of available consumers.
const stringifedData = JSON.stringify(req.body);
const key = uuidv1();
const asyncHsetRes = await asyncHset(type, key, stringifedData);
if (asyncHsetRes) {
await asyncRpush(FeatureKeyEnum.REDIS.ABC_MESSAGE_QUEUE, key);
res.send({ status: 'success', message: 'Added to processing queue' });
} else {
res.send({ error: 'failure', message: 'Something went wrong in adding to queue' });
}
Consumer simply accepts messages and stop when it is busy
module.exports.startHeartbeat = startHeartbeat = async function(config = {}) {
if (!config || !config.type || !config.listKey) {
return;
}
heartbeatIntervalObj[config.type] = setInterval(async () => {
await asyncLindex(config.listKey, -1).then(async res => {
if (res) {
await getFreeWorkerAndDoJob(res, config);
stopHeartbeat(config);
}
});
}, HEARTBEAT_INTERVAL);
};
Ideally, a message should be read by only one consumer of that particular feature. After consuming, it is marked as busy so it won't consume further until free(I have handled this). Next message could only be processed by only one consumer out of other available consumers.
Please help me in tacking this problem. Again, I want one message to be read by only one free consumer and rest free consumers should wait for new message.
Thanks
I'm not sure I fully get your Redis consumers architecture, but I feel like it contradicts with the use case of Redis itself. What you're trying to achieve is essentially a queue based messaging with an ability to commit a message once its done.
Redis has its own pub/sub feature, but it is built on fire and forget principle. It doesn't distinguish between consumers - it just sends the data to all of them, assuming that its their logic to handle the incoming data.
I recommend to you use Queue Servers like RabbitMQ. You can achieve your goal with some features that AMQP 0-9-1 supports: message acknowledgment, consumer's prefetch count and so on. You can set up your cluster with very agile configs like ok, I want to have X consumers, and each can handle 1 unique (!) message at a time and they will receive new ones only after they let the server (rabbitmq) know that they successfully finished message processing. This is highly configurable and robust.
However, if you want to go serverless with some fully managed service so that you don't provision like virtual machines or anything else to run a message queue server of your choice, you can use AWS SQS. It has pretty much similar API and features list.
Hope it helps!

node-kafka pause() method on consumer. Any working version?

I can't make following code to work:
"use strict";
let kafka = require('kafka-node');
var conf = require('./providers/Config');
let client = new kafka.Client(conf.kafka.connectionString, conf.kafka.clientName);
let consumer = new kafka.HighLevelConsumer(client, [ { topic: conf.kafka.readTopic } ], { groupId: conf.kafka.clientName, paused: true });
let threads = 0;
consumer.on('message', function(message) {
threads++;
if (threads > 10) consumer.pause();
if (threads > 50) process.exit(1);
console.log(threads + " >>> " + message.value);
});
consumer.resume();
I see 50 messages in console and process exits by termination statement.
What I'm trying to understand, is that is it my code broken or package broken? Or maybe I'm just doing something wrong? Does anyone was able to make kafka consumer work with pause/resume? I tried several versions of kafka-node, but all of them behave same way. Thanks!
You are already using pause and resume in your code, so obviously they work. ;)
It's because pause doesn't pause the consumption of messages. It pauses the fetching of messages. I'm guessing you already fetched the first 50 in one throw before you receive the first message and call pause.
For kicks, I just tested pause() and resume() in the Node REPL and they work as expected:
var kafka = require('kafka-node');
var client = new kafka.Client('localhost:2181');
var consumer = new kafka.HighLevelConsumer(client, [{topic: 'MyTest'}]);
consumer.on('message', (msg) => { console.log(JSON.stringify(msg)) });
Then I go into another window and run:
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic MyTest
And type some stuff, and it shows up in the first window. Then in the first window, I type: consumer.pause(); And type some more in the second window. Nothing appears in the first window. Then I run consumer.resume()in the first window, and the delayed messages appear.
BTW, you should be able to play with the Kafka config property fetch.message.max.bytes and control how many messages can be fetched at one time. For example, if you had fixed-width messages of 500 bytes, set fetch.message.max.bytes to something less than 1000 (but greater than 500!) to only receive a single message per fetch. But note that this might not fix the problem entirely -- I am fairly new to Node, but it is asynchronous, and I suspect a second fetch could get kicked off before you processed the first fetch completely (or at all).

Resources