kafka-node asynchronous consumer handler - node.js

That's how my consumer is initialised:
const client = new kafka.Client(config.ZK_HOST)
const consumer = new kafka.Consumer(client, [{ topic: config.KAFKA_TOPIC, offset: 0}],
{
autoCommit: false
})
Now the consumer consumer.on('message', message => applyMessage(message))
The thing is applyMessage talks to the database using knex, the code looks something like:
async function applyMessage(message: kafka.Message) {
const usersCount = await db('users').count()
// just assume we ABSOLUTELY need to calculate a number of users,
// so we need previous state
await db('users').insert(inferUserFromMessage(message))
}
The code above makes applyMessage to execute in parallel for all the messages in kafka, so in the code above given that there are no users in the database yet, usersCount will ALWAYS be 0 even for the second message from kafka where it should be 1 already since first call to applyMessage inserts a user.
How do I "synchronise" the code in a way that all the applyMessage functions run sequentially?

You'll need to implement some sort of Mutex. Basically a class which queues up things to execute synchronously. Example
var Mutex = function() {
this.queue = [];
this.locked = false;
};
Mutex.prototype.enqueue = function(task) {
this.queue.push(task);
if (!this.locked) {
this.dequeue();
}
};
Mutex.prototype.dequeue = function() {
this.locked = true;
const task = this.queue.shift();
if (task) {
this.execute(task);
} else {
this.locked = false;
}
};
Mutex.prototype.execute = async function(task) {
try { await task(); } catch (err) { }
this.dequeue();
}
In order for this to work, your applyMessage function (whichever handles Kafka messages) needs to return a Promise - notice also the async has moved from the parent function to the returned Promise function:
function applyMessage(message: kafka.Message) {
return new Promise(async function(resolve,reject) {
try {
const usersCount = await db('users').count()
// just assume we ABSOLUTELY need to calculate a number of users,
// so we need previous state
await db('users').insert(inferUserFromMessage(message))
resolve();
} catch (err) {
reject(err);
}
});
}
Finally, each invocation of applyMessage needs to be added to the Mutex queue instead of called directly:
var mutex = new Mutex();
consumer.on('message', message => mutex.enqueue(function() { return applyMessage(message); }))

Related

Trigger the execution of a function if any condition is met

I'm writing an HTTP API with expressjs in Node.js and here is what I'm trying to achieve:
I have a regular task that I would like to run regularly, approx every minute. This task is implemented with an async function named task.
In reaction to a call in my API I would like to have that task called immediately as well
Two executions of the task function must not be concurrent. Each execution should run to completion before another execution is started.
The code looks like this:
// only a single execution of this function is allowed at a time
// which is not the case with the current code
async function task(reason: string) {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
// call task regularly
setIntervalAsync(async () => {
await task("ticker");
}, 5000) // normally 1min
// call task immediately
app.get("/task", async (req, res) => {
await task("trigger");
res.send("ok");
});
I've put a full working sample project at https://github.com/piec/question.js
If I were in go I would do it like this and it would be easy, but I don't know how to do that with Node.js.
Ideas I have considered or tried:
I could apparently put task in a critical section using a mutex from the async-mutex library. But I'm not too fond of adding mutexes in js code.
Many people seem to be using message queue libraries with worker processes (bee-queue, bullmq, ...) but this adds a dependency to an external service like redis usually. Also if I'm correct the code would be a bit more complex because I need a main entrypoint and an entrypoint for worker processes. Also you can't share objects with the workers as easily as in a "normal" single process situation.
I have tried RxJs subject in order to make a producer consumer channel. But I was not able to limit the execution of task to one at a time (task is async).
Thank you!
You can make your own serialized asynchronous queue and run the tasks through that.
This queue uses a flag to keep track of whether it's in the middle of running an asynchronous operation already. If so, it just adds the task to the queue and will run it when the current operation is done. If not, it runs it now. Adding it to the queue returns a promise so the caller can know when the task finally got to run.
If the tasks are asynchronous, they are required to return a promise that is linked to the asynchronous activity. You can mix in non-asynchronous tasks too and they will also be serialized.
class SerializedAsyncQueue {
constructor() {
this.tasks = [];
this.inProcess = false;
}
// adds a promise-returning function and its args to the queue
// returns a promise that resolves when the function finally gets to run
add(fn, ...args) {
let d = new Deferred();
this.tasks.push({ fn, args: ...args, deferred: d });
this.check();
return d.promise;
}
check() {
if (!this.inProcess && this.tasks.length) {
// run next task
this.inProcess = true;
const nextTask = this.tasks.shift();
Promise.resolve(nextTask.fn(...nextTask.args)).then(val => {
this.inProcess = false;
nextTask.deferred.resolve(val);
this.check();
}).catch(err => {
console.log(err);
this.inProcess = false;
nextTask.deferred.reject(err);
this.check();
});
}
}
}
const Deferred = function() {
if (!(this instanceof Deferred)) {
return new Deferred();
}
const p = this.promise = new Promise((resolve, reject) => {
this.resolve = resolve;
this.reject = reject;
});
this.then = p.then.bind(p);
this.catch = p.catch.bind(p);
if (p.finally) {
this.finally = p.finally.bind(p);
}
}
let queue = new SerializedAsyncQueue();
// utility function
const sleep = function(t) {
return new Promise(resolve => {
setTimeout(resolve, t);
});
}
// only a single execution of this function is allowed at a time
// so it is run only via the queue that makes sure it is serialized
async function task(reason: string) {
function runIt() {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
return queue.add(runIt);
}
// call task regularly
setIntervalAsync(async () => {
await task("ticker");
}, 5000) // normally 1min
// call task immediately
app.get("/task", async (req, res) => {
await task("trigger");
res.send("ok");
});
Here's a version using RxJS#Subject that is almost working. How to finish it depends on your use-case.
async function task(reason: string) {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
const run = new Subject<string>();
const effect$ = run.pipe(
// Limit one task at a time
concatMap(task),
share()
);
const effectSub = effect$.subscribe();
interval(5000).subscribe(_ =>
run.next("ticker")
);
// call task immediately
app.get("/task", async (req, res) => {
effect$.pipe(
take(1)
).subscribe(_ =>
res.send("ok")
);
run.next("trigger");
});
The issue here is that res.send("ok") is linked to the effect$ streams next emission. This may not be the one generated by the run.next you're about to call.
There are many ways to fix this. For example, you can tag each emission with an ID and then wait for the corresponding emission before using res.send("ok").
There are better ways too if calls distinguish themselves naturally.
A Clunky ID Version
Generating an ID randomly is a bad idea, but it gets the general thrust across. You can generate unique IDs however you like. They can be integrated directly into the task somehow or can be kept 100% separate the way they are here (task itself has no knowledge that it's been assigned an ID before being run).
interface IdTask {
taskId: number,
reason: string
}
interface IdResponse {
taskId: number,
response: any
}
async function task(reason: string) {
console.log("do thing because %s...", reason);
await sleep(1000);
console.log("done");
}
const run = new Subject<IdTask>();
const effect$: Observable<IdResponse> = run.pipe(
// concatMap only allows one observable at a time to run
concatMap((eTask: IdTask) => from(task(eTask.reason)).pipe(
map((response:any) => ({
taskId: eTask.taskId,
response
})as IdResponse)
)),
share()
);
const effectSub = effect$.subscribe({
next: v => console.log("This is a shared task emission: ", v)
});
interval(5000).subscribe(num =>
run.next({
taskId: num,
reason: "ticker"
})
);
// call task immediately
app.get("/task", async (req, res) => {
const randomId = Math.random();
effect$.pipe(
filter(({taskId}) => taskId == randomId),
take(1)
).subscribe(_ =>
res.send("ok")
);
run.next({
taskId: randomId,
reason: "trigger"
});
});

How to call a custom function synchronously in node js?

I am developing a server project which needs to call some functions synchronously. Currently I am calling it in asynchronous nature. I found some similar questions on StackOverflow and I can't understand how to apply those solutions to my code. Yet I tried using async/await and ended up with an error The 'await' operator can only be used in an 'async' function
Here is my implementation
function findSuitableRoom(_lecturer, _sessionDay, _sessionTime, _sessionDuration, _sessionType){
let assignedRoom = selectRoomByLevel(preferredRooms, lecturer.level, _sessionDuration); <------- Need to be call synchronously
if (!assignedRoom.success){
let rooms = getRooms(_sessionType); <------- Need to be call synchronously
assignedRoom = assignRoom(_lecturer.rooms, _sessionDuration, _lecturer.level);
} else {
arr_RemovedSessions.push(assignedRoom.removed)
}
return assignedRoom;
}
function getRooms(type){
switch (type){
case 'Tutorial' : type = 'Lecture hall'
break;
case 'Practical' : type = 'Lab'
break;
default : type = 'Lecture hall'
}
Rooms.find({type : type},
(err, rooms) => {
if (!err){
console.log('retrieved rooms ' + rooms)
return rooms;
}
})
}
Here I have provided only two methods because full implementation is very long and I feel if I could understand how to apply synchronous way to one method, I can manage the rest of the methods. Can someone please help me?
Well yes await is only available inside an async function so put async infront of findSuitableRoom.
Also you did a classic mistake. You use return inside of a callback function, and expect getRooms to return you some value.
async function findSuitableRoom(
_lecturer,
_sessionDay,
_sessionTime,
_sessionDuration,
_sessionType
) {
let assignedRoom = selectRoomByLevel(
preferredRooms,
lecturer.level,
_sessionDuration
);
if (!assignedRoom.success) {
try {
let rooms = await getRooms(_sessionType);
} catch (err) {
console.log("no rooms found");
}
assignedRoom = assignRoom(
_lecturer.rooms,
_sessionDuration,
_lecturer.level
);
} else {
arr_RemovedSessions.push(assignedRoom.removed);
}
return assignedRoom;
}
Also wrap it in an try / catch
Since .find() returns an promise if you dont pass an callback you can write it like this
function getRooms(type) {
switch (type) {
case "Tutorial":
type = "Lecture hall";
break;
case "Practical":
type = "Lab";
break;
default:
type = "Lecture hall";
}
return Rooms.find({ type });
}
Note here findSuitableRoom is no longer synchronouse. Its async and returns an promise. That means you will need to use the function like this:
findSuitableRoom.then(res => { console.log(res); })
The 'await' operator can only be used in an 'async' function
This means whenever you want to use the await keyword it needs to be inside a function which has an async keyword (returns promise)
read this https://javascript.info/async-await for more info
const add = (a, b) => {
return new Promise((resolve, reject) => {
setTimeout(() => { //to make it asynchronous
if (a < 0 || b < 0) {
return reject("don't need negative");
}
resolve(a + b);
}, 2000);
});
};
const jatin = async () => {
try{
const sum = await add(10, 5);
const sum2 = await add(sum, -100);
const sum3 = await add(sum2, 1000);
return sum3;
} catch (e) {
console.log(e);
}
};
jatin()
Try this
let's understand this
add is a normal function that does some asynchronous action like
waiting for 2 seconds
normally we use await with async function so in order to use it we make as async function jatin and use await with add function call
to make it synchronous, so until first await add call() doesn't
happen it wont execute another await add call().
Example code if you will use in your app.js
router.post("/users/login", async (req, res) => {
try {
const user = await User.findByCredentials(
req.body.email,
req.body.password
);
const token = await user.generateToken();
res.status(200).send({
user,
token,
});
}
catch (error) {
console.log(error);
res.status(400).send();
}
});

Async.queue crashes after processing 3 initial elements of the queue with concurrency = 3

Async.queue() intially runs as expected but crashes after processing the first N elements (N = 3).
When adding callback() after running getAddress(), concurrency is totally ignored. Subsequently getAddress() runs for all tasks passed to the queue via the stream.
The problem arose when attempting to build upon this tutorial.
Trying to determine the root cause and a solution. Seems possible that this is related to promise chaining?
Have attempted to refactor async.queue() following the async docs, but appears that the syntax is out of date and can't find a working example with chained promises.
const { csvFormat } = require('d3-dsv');
const Nightmare = require('nightmare');
const { readFileSync, writeFileSync } = require('fs');
const numbers = readFileSync('./tesco-title-numbers.csv',
{encoding: 'utf8'}).trim().split('\n');
const START = 'https://eservices.landregistry.gov.uk/wps/portal/Property_Search';
var async = require("async")
console.log(numbers)
// create a read stream
var ArrayStream = require('arraystream')
var stream = ArrayStream.create(numbers)
// set concurrency
N = 3
var q = async.queue(function (task, callback) {
let data = getAddress(task)
// , function(){
// callback();
},
// },
N);
q.drain = function() {
stream.resume()
console.log('all items have been processed');
resolve()
}
// or await the end
// await q.drain()
q.saturated = function() {
stream.pause();
}
// assign an error callback
q.error = function(err, task) {
console.error('task experienced an error');
}
stream.on("data", function(data) {
// console.log(data);
q.push(data)
})
var getAddress = async id => {console.log(`Now checking ${id}`);
const nightmare = new Nightmare({ show: true });
// Go to initial start page, navigate to Detail search
try {
await nightmare
.goto(START)
.wait('.bodylinkcopy:first-child')
.click('.bodylinkcopy:first-child');
} catch(e) {
console.error(e);
}
// Type the title number into the appropriate box; click submit
try {
let SOMEGLOBALVAR;
await nightmare
// does some work
} catch(e) {
console.error(e);
return undefined;
}
};
Determined the cause of the problem. The callback along with getAddressed needs to be returned.
let dataArray = []
N = 4
var q = async.queue(async function (task, callback) {
return getAddress(task).then((response)=>{
console.log(response);
dataArray.push(response);
callback()})
} ,
N);

AWS Lambda function that executes 5000+ promises to AWS SQS is extremely unreliable

I'm writing a Node AWS Lambda function that queries around 5,000 items from my DB and sends them via messages into an AWS SQS queue.
My local environment involves me running my lambda with AWS SAM local, and emulating AWS SQS with GoAWS.
An example skeleton of my Lambda is:
async run() {
try {
const accounts = await this.getAccountsFromDB();
const results = await this.writeAccountsIntoQueue(accounts);
return 'I\'ve written: ' + results + ' messages into SQS';
} catch (e) {
console.log('Caught error running job: ');
console.log(e);
return e;
}
}
There are no performance issues with my getAccountsFromDB() function and it runs almost instantly, returning me an array of 5,000 accounts.
My writeAccountsIntoQueue function looks like:
async writeAccountsIntoQueue(accounts) {
// Extract the sqsClient and queueUrl from the class
const { sqsClient, queueUrl } = this;
try {
// Create array of functions to concurrenctly call later
let promises = accounts.map(acc => async () => await sqsClient.sendMessage({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(acc),
DelaySeconds: 10,
})
);
// Invoke the functions concurrently, using helper function `eachLimit`
let writtenMessages = await eachLimit(promises, 3);
return writtenMessages;
} catch (e) {
console.log('Error writing accounts into queue');
console.log(e);
return e;
}
}
My helper, eachLimit looks like:
async function eachLimit (funcs, limit) {
let rest = funcs.slice(limit);
await Promise.all(
funcs.slice(0, limit).map(
async (func) => {
await func();
while (rest.length) {
await rest.shift()();
}
}
)
);
}
To the best of my understanding, it should be limiting concurrent executions to limit.
Additionally, I've wrapped the AWS SDK SQS client to return an object with a sendMessage function that looks like:
sendMessage(params) {
const { client } = this;
return new Promise((resolve, reject) => {
client.sendMessage(params, (err, data) => {
if (err) {
console.log('Error sending message');
console.log(err);
return reject(err);
}
return resolve(data);
});
});
}
So nothing fancy there, just Promisifying a callback.
I've got my lambda set up to timeout after 300 seconds, and the lambda always times out, and if it doesn't it ends abruptly and misses some final logging that should go on, which makes me thing it may even be erroring somewhere, silently. When I check the SQS queue I'm missing around 1,000 entries.
I can see a couple of issues in your code,
First:
let promises = accounts.map(acc => async () => await sqsClient.sendMessage({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(acc),
DelaySeconds: 10,
})
);
You're abusing async / await. Always bear in mind await will wait until your promise is resolved before continuing with the next one, in this case whenever you map the array promises and call each function item it will wait for the promise wrapped by that function before continuing, which is bad. Since you're only interested in getting the promises back, you could simply do this instead:
const promises = accounts.map(acc => () => sqsClient.sendMessage({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(acc),
DelaySeconds: 10,
})
);
Now, for the second part, your eachLimit implementation looks wrong and very verbose, I've refactored it with help of es6-promise-pool to handle the concurrency limit for you:
const PromisePool = require('es6-promise-pool')
function eachLimit(promiseFuncs, limit) {
const promiseProducer = function () {
while(promiseFuncs.length) {
const promiseFunc = promiseFuncs.shift();
return promiseFunc();
}
return null;
}
const pool = new PromisePool(promiseProducer, limit)
const poolPromise = pool.start();
return poolPromise;
}
Lastly, but very important, have a look at SQS Limits, SQS FIFO has up to 300 sends / sec. Since you are processing 5k items, you could probably up your concurrency limit to 5k / (300 + 50) , approx 15. The 50 could be any positive number, just to move away from the limit a bit.
Also, considering using SendMessageBatch which you could have much more throughput and reach 3k sends / sec.
EDIT
As I suggested above, using sendMessageBatch the throughput is much better, so I've refactored the code mapping your promises to support sendMessageBatch:
function chunkArray(myArray, chunk_size){
var index = 0;
var arrayLength = myArray.length;
var tempArray = [];
for (index = 0; index < arrayLength; index += chunk_size) {
myChunk = myArray.slice(index, index+chunk_size);
tempArray.push(myChunk);
}
return tempArray;
}
const groupedAccounts = chunkArray(accounts, 10);
const promiseFuncs = groupedAccounts.map(accountsGroup => {
const messages = accountsGroup.map((acc,i) => {
return {
Id: `pos_${i}`,
MessageBody: JSON.stringify(acc),
DelaySeconds: 10
}
});
return () => sqsClient.sendMessageBatch({
Entries: messages,
QueueUrl: queueUrl
})
});
Then you can call eachLimit as usual:
const result = await eachLimit(promiseFuncs, 3);
The difference now is every promise processed will send a batch of messages of size n (10 in the example above).

Is `node-sqlite each function load all queries into memory, or it use stream/pipe from hard-disk

I want to handle 10,000,000 rows
This is my code:
stmt.each('select * from very_big_table',function(err,rows,next){
console.log('This can take 2 seconds'
})
The question is: It will eat all my RAM, or it will read from hard-drive row after row?
I've just recently created this helper class to handle streaming:
import { ReadableTyped } from '#naturalcycles/nodejs-lib'
import { Database, Statement } from 'sqlite'
import { Readable } from 'stream'
/**
* Based on: https://gist.github.com/rmela/a3bed669ad6194fb2d9670789541b0c7
*/
export class SqliteReadable<T = any> extends Readable implements ReadableTyped<T> {
constructor(private stmt: Statement) {
super( { objectMode: true } );
// might be unnecessary
// this.on( 'end', () => {
// console.log(`SQLiteStream end`)
// void this.stmt.finalize()
// })
}
static async create<T = any>(db: Database, sql: string): Promise<SqliteReadable<T>> {
const stmt = await db.prepare(sql)
return new SqliteReadable<T>(stmt)
}
/**
* Necessary to call it, otherwise this error might occur on `db.close()`:
* SQLITE_BUSY: unable to close due to unfinalized statements or unfinished backups
*/
async close(): Promise<void> {
await this.stmt.finalize()
}
// count = 0 // use for debugging
override async _read(): Promise<void> {
// console.log(`read ${++this.count}`) // debugging
try {
const r = await this.stmt.get<T>()
this.push(r || null)
} catch(err) {
console.log(err) // todo: check if it's necessary
this.emit('error', err)
}
}
}
Both. statement.each will not use any extra memory if the function is fast and synchronous, however with asynchronous functions it will start all asynchronous processing as fast as it can load the data from disk, and this will eat up all your RAM.
As is also said in the issue you posted here https://github.com/mapbox/node-sqlite3/issues/686 it is possible to get your desired behavior using Statement.get(). Statement.get() without any parameters will fetch the next row. So you can implement your own async version like this:
function asyncEach(db, sql, parameters, eachCb, doneCb) {
let stmt;
let cleanupAndDone = err => {
stmt.finalize(doneCb.bind(null, err));
};
stmt = db.prepare(sql, parameters, err => {
if (err) {
return cleanupAndDone(err);
}
let next = err => {
if (err) {
return cleanupAndDone(err);
}
return stmt.get(recursiveGet);
};
// Setup recursion
let recursiveGet = (err, row) => {
if (err) {
return cleanupAndDone(err);
}
if (!row) {
return cleanupAndDone(null);
}
// Call the each callback which must invoke the next callback
return eachCb(row, next);
}
// Start recursion
stmt.get(recursiveGet);
});
}
Note, slightly different syntax than the built in Statement.each, errors are only sent to the last callback and no support for optional parameters.
Also note that this is slower than the normal Statement.each function, which can be improved by issuing multiple get calls so the next row is waiting when next is invoked, but it is much harder to follow that code.
Example for using the snippet:
let rowCount = 0;
asyncEach(db, 'SELECT * from testtable', [], (row, next) => {
assert.isObject(row);
rowCount++;
return next();
}, err => {
if (err) {
return done(err);
}
assert.equal(rowCount, TEST_ROW_COUNT);
done();
});

Resources