I have an application that makes API calls to another system, and it queues these API calls in a queue using Bull and Redis.
However, occasionally it gets bogged down with lots of API calls, or something stops working properly, and I want an easy way for users to check if the system is just "busy". (Otherwise, if they perform some action, and 10 minutes later it hasn't completed, they'll keep trying it again, and then we get a backlog of more entries (and in some cases data issues where they've issued duplicate parts, etc.)
Here's what a single "key" looks like for a successful API call in the queue:
HSET "bull:webApi:4822" "timestamp" "1639085540683"
HSET "bull:webApi:4822" "returnvalue" "{"id":"e1df8bb4-fb6c-41ad-ba62-774fe64b7882","workOrderNumber":"WO309967","status":"success"}"
HSET "bull:webApi:4822" "processedOn" "1639085623027"
HSET "bull:webApi:4822" "data" "{"id":"e1df8bb4-fb6c-41ad-ba62-774fe64b7882","token":"eyJ0eXAiOiJKV1QiL....dQVyEpXt64Fznudfg","workOrder":{"members":{"lShopFloorLoad":true,"origStartDate":"2021-12-09T00:00:00","origRequiredQty":2,"requiredQty":2,"requiredDate":"2021-12-09T00:00:00","origRequiredDate":"2021-12-09T00:00:00","statusCode":"Released","imaItemName":"Solid Pin - Black","startDate":"2021-12-09T00:00:00","reference":"HS790022053","itemId":"13840402"}},"socketId":"3b9gejTZjAXsnEITAAvB","type":"Create WO"}"
HSET "bull:webApi:4822" "delay" "0"
HSET "bull:webApi:4822" "priority" "0"
HSET "bull:webApi:4822" "name" "__default__"
HSET "bull:webApi:4822" "opts" "{"lifo":true,"attempts":1,"delay":0,"timestamp":1639085540683}"
HSET "bull:webApi:4822" "finishedOn" "1639085623934"
You can see in this case it took 83 seconds to process. (1639085540 - 1639085623)
I'd like to be able to provide summary metrics like:
Most recent API call was added to queue X seconds ago
Most recent successful API call completed X seconds ago and took XX seconds to
complete.
I'd also like to be able to provide a list of the 50 most recent API calls, formatted in a nice way and tagged with "success", "pending", or "failed".
I'm fairly new to Redis and Bull, and I'm trying to figure out how to query this data (using Redis in Node.js) and return this data as JSON to the application.
I can pull a list of keys like this:
// #route GET /status
async function status(req, res) {
const client = createClient({
url: `redis://${REDIS_SERVER}:6379`
});
try {
client.on('error', (err) => console.log('Redis Client Error', err));
await client.connect();
const value = await client.keys('*');
res.json(value)
} catch (error) {
console.log('ERROR getting status: ', error.message, new Date())
res.status(500).json({ message: error.message })
} finally {
client.quit()
}
}
Which will return ["bull:webApi:3","bull:webApi:1","bull:webApi:2"...]
But how can I pull the values associated to the respective keys?
And how can I find the key with the highest number, and then pull the details for the "last 50". In SQL, it would be like doing a ORDER BY key_number DESC LIMIT 50 - but I'm not sure how to do it in Redis.
I'm a bit late here, but if you're not set on manually digging around in Redis, I think Bull's API, in particular Queue#getJobs(), has everything you need here, and should be much easier to work with. Generally, you shouldn't have to reach into Redis to do any common tasks like this, that's what Bull is for!
If I understand you goal correctly, you should be able to do something like:
const Queue = require('bull')
async function status (req, res) {
const { listNum = 10 } = req.params
const api_queue = new Queue('webApi', `redis://${REDIS_SERVER}:6379`)
const current_timestamp_sec = new Date().getTime() / 1000 // convert to seconds
const recent_jobs = await api_queue.getJobs(null, 0, listNum)
const results = recent_jobs.map(job => {
const processed_on_sec = job.processedOn / 1000
const finished_on_sec = job.finishedOn / 1000
return {
request_data: job.data,
return_data: job.returnvalue,
processedOn: processed_on_sec,
finishedOn: finished_on_sec,
duration: finished_on_sec - processed_on_sec,
elapsedSinceStart: current_timestamp_sec - processed_on_sec,
elapsedSinceFinished: current_timestamp_sec - finished_on_sec
}
})
res.json(results)
}
That will get you the most recent numList* jobs in your queue. I haven't tested this full code, and I'll leave the error handling and adding of your custom fields to the job data up to you, but the core of it is solid and I think that should meet your needs without ever having to think about how Bull stores things in Redis.
I also included a suggestion on how to deal with timestamps a bit more nicely, you don't need to do string processing to convert milliseconds to seconds. If you need them to be integers you can wrap them in Math.floor().
* at least that many, anyway - see the second note below
A couple notes:
The first argument of getJobs() is a list of statuses, so if you want to look at just completed jobs, you can pass ['completed'], or completed and active, do ['completed', 'active'], etc. If no list is provided (null) it defaults to all statuses.
As mentioned in the reference I linked, the limit here is per state - so you'll likely get more than listNum jobs back. It doesn't seem like that should be a problem for your use case, but if it is, you can sort the list returned (probably by job id) and just return the first listNum - you're guaranteed to get at least that many (assuming there are that many jobs in your queue), and won't get more than 6*listNum (since there are 6 states).
Folks new to Bull can get nervous about instantiating a Queue object to do stuff like this - but don't be! By itself a Queue instance doesn't do anything, it's just an interface to the given queue. It won't start processing jobs until you call process() to add a processor. This is, incidentally, also how you'd add jobs from a separate process than you run your queues in, but of course nothing will be added unless you call add().
So I've figured out how to pull the data I need. I'm not saying it's a good method, and I'm open to suggestions; but it seems to work to provide a filtered JSON return with the needed data, without changing how the queue functions.
Here's what it looks like:
// #route GET /status/:listNum
async function status(req, res) {
const { listNum = 10} = req.params
const client = createClient({
url: `redis://${REDIS_SERVER}:6379`
});
try {
client.on('error', (err) => console.log('Redis Client Error', err));
await client.connect();
// Find size of queue database
const total_keys = await client.sendCommand(['DBSIZE']);
const upper_value = total_keys;
const lower_value = total_keys - listNum;
// Generate array
const range = (start, stop) => Array.from({ length: (start - stop) + 1}, (_, i) => start - (i));
var queue_ids = range(upper_value, lower_value)
queue_ids = queue_ids.filter(function(x){ return x > 0 }); // Filer out anything less than zero
// Current timestamp in seconds
const current_timestamp = parseInt(String(new Date().getTime()).slice(0, -3)); // remove microseconds ("now")
var response = []; // Initialize array
for(id of queue_ids){ // Loop through queries
// Query value
var value = await client.HGETALL('bull:webApi:'+id);
if(Object.keys(value).length !== 0){ // if returned a value
// Grab most of the request (exclude the token & socketId to save space, not used)
var request_data = JSON.parse(value.data)
request_data.token = '';
request_data.socketId = '';
// Grab & calculate desired times
const processedOn = value.processedOn.slice(0, -3); // remove microseconds ("start")
const finishedOn = value.finishedOn.slice(0, -3); // remove microseconds ("done")
const duration = finishedOn - processedOn; // (seconds)
const elapsedSinceStart = current_timestamp - processedOn;
const elapsedSinceFinished = current_timestamp - finishedOn;
// Grab the returnValue
const return_data = value.returnValue;
// ignoring queue keys of: opts, priority, delay, name, timestamp
const object_data = {request_data: request_data, processedOn: processedOn, finishedOn: finishedOn, return_data: return_data, duration: duration, elapsedSinceStart: elapsedSinceStart, elapsedSinceFinished: elapsedSinceFinished }
response.push(object_data);
}
}
res.json(response);
} catch (error) {
console.log('ERROR getting status: ', error.message, new Date());
res.status(500).json({ message: error.message });
} finally {
client.quit();
}
}
It's looping the Redis query, so I wouldn't want to use this for hundreds of keys, but for 10 or even 50 I'm thinking it should work.
For now I've resorted to getting the total number of keys and working backwards:
await client.sendCommand(['DBSIZE']);
In my case it will return a total number slightly higher than the highest key id (~ a handful of status keys), but at least gets close, and then I just filter out any non-responses.
I've looked at ZRANGE a bit, but I can't figure out how to get it to give me the last id. When I have a Redis database (Bull Queue) like this:
If there's a simple Redis command I can run that will return "3", I'd probably use that instead. (since bull:webApi:3 has the highest number)
(In actual use case, this might be 9555 or some high number; I just want to get the highest numbered key that exists.)
For now I'll try using the method I've come up with above.
Related
I have been developing a game where on the user submitting data, the client writes some data to a Firebase Realtime Database. I then have a Google Cloud Function which is triggered onUpdate. That function checks the submissions from various players in a particular game and if certain criteria are met, the function writes an update to the DB which causes the clients to move to the next round of the game.
This is all working, however, I have found performance to be quite poor.
I've added logging to the function and can see that the function takes anywhere from 2-10ms to complete, which is acceptable. The issue is that the update is often written anywhere from 10-30 SECONDS after the function has returned the update.
To determine this, my function obtains the UTC epoch timestamp from just before writing the update, and stores this as a key with the value being the Firebase server timestamp.
I then have manually checked the two timestamps to arrive at the time between return and database write:
The strange thing is that I have another cloud function which is triggered by an HTTP request, and found that the update from that function is typically from 0.5-2 seconds after the function calls the DB update() API.
The difference between these two functions, aside from how they are triggered, is how the data is written back to the DB.
The onUpdate() function writes data by returning:
return after.ref.update(updateToWrite);
Whereas the HTTP request function writes data by calling the update API:
dbRef.update({
// object to write
});
I've provided a slightly stripped out version of my onUpdate function here (same structure but sanitised function names) :
exports.onGameUpdate = functions.database.ref("/games/{gameId}")
.onUpdate(async(snapshot, context) => {
console.time("onGameUpdate");
let id = context.params.gameId;
let after = snapshot.after;
let updatedSnapshot = snapshot.after.val();
if (updatedSnapshot.serverShouldProcess && !isFinished) {
processUpdate().then((res)=>{
// some logic to check res, and if criteria met, move on to promises to determine update data
determineDataToWrite().then((updateToWrite)=>{
// get current time
var d = new Date();
let triggerTime = d.getTime();
// I use triggerTime as the key, and firebase.database.ServerValue.TIMESTAMP as the value
if(updateToWrite["timestamps"] !== null && updateToWrite["timestamps"] !== undefined){
let timestampsCopy = updateToWrite["timestamps"];
timestampsCopy[triggerTime] = firebase.database.ServerValue.TIMESTAMP;
updateToWrite["timestamps"][triggerTime] = firebase.database.ServerValue.TIMESTAMP;
}else{
let timestampsObj = {};
timestampsObj[triggerTime] = firebase.database.ServerValue.TIMESTAMP;
updateToWrite["timestamps"] = timestampsObj;
}
// write the update to the database
return after.ref.update(updateToWrite);
}).catch((error)=>{
// error handling
})
}
// this is just here because ES Linter complains if there's no return
return null;
})
.catch((error) => {
// error handling
})
}
});
I'd appreciate any help! Thanks :)
I have the following function that gets a message from aws SQS, the problem is I get one at a time and I wish to get all of them, because I need to check the ID for each message:
function getSQSMessages() {
const params = {
QueueUrl: 'some url',
};
sqs.receiveMessage(params, (err, data) => {
if(err) {
console.log(err, err.stack)
return(err);
}
return data.Messages;
});
};
function sendMessagesBack() {
return new Promise((resolve, reject) => {
if(Array.isArray(getSQSMessages())) {
resolve(getSQSMessages());
} else {
reject(getSQSMessages());
};
});
};
The function sendMessagesBack() is used in another async/await function.
I am not sure how to get all of the messages, as I was looking on how to get them, people mention loops but I could not figure how to implement it in my case.
I assume I have to put sqs.receiveMessage() in a loop, but then I get confused on what do I need to check and when to stop the loop so I can get the ID of each message?
If anyone has any tips, please share.
Thank you.
I suggest you to use the Promise api, and it will give you the possibility to use async/await syntax right away.
const { Messages } = await sqs.receiveMessage(params).promise();
// Messages will contain all your needed info
await sqs.sendMessage(params).promise();
In this way, you will not need to wrap the callback API with Promises.
SQS doesn't return more than 10 messages in the response. To get all the available messages, you need to call the getSQSMessages function recursively.
If you return a promise from getSQSMessages, you can do something like this.
getSQSMessages()
.then(data => {
if(!data.Messages || data.Messages.length === 0){
// no messages are available. return
}
// continue processing for each message or push the messages into array and call
//getSQSMessages function again.
});
You can never be guaranteed to get all the messages in a queue, unless after you get some of them, you delete them from the queue - thus ensuring that the next requests returns a different selection of records.
Each request will return 'upto' 10 messages, if you don't delete them, then there is a good chance that the next request for 'upto' 10 messages will return a mix of messages you have already seen, and some new ones - so you will never really know when you have seen them all.
It maybe that a queue is not the right tool to use in your use case - but since I don't know your use case, its hard to say.
I know this is a bit of a necro but I landed here last night while trying to pull some all messages from a dead letter queue in SQS. While the accepted answer, "you cannot guarantee to get all messages" from the queue is absolutely correct I did want to drop an answer for anyone that may land here as well and needs to get around the 10 message limit per request from AWS.
Dependencies
In my case I have a few dependencies already in my project that I used to make life simpler.
lodash - This is something we use in our code for help making things functional. I don't think I used it below but I'm including it since it's in the file.
cli-progress - This gives you a nice little progress bar on your CLI.
Disclaimer
The below was thrown together during troubleshooting some production errors integrating with another system. Our DLQ messages contain some identifiers that I need in order to formulate cloud watch queries for troubleshooting. Given that these are two different GUIs in AWS switching back and forth is cumbersome given that our AWS session are via a form of federation and the session only lasts for one hour max.
The script
#!/usr/bin/env node
const _ = require('lodash');
const aswSdk = require('aws-sdk');
const cliProgress = require('cli-progress');
const queueUrl = 'https://[put-your-url-here]';
const queueRegion = 'us-west-1';
const getMessages = async (sqs) => {
const resp = await sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
}).promise();
return resp.Messages;
};
const main = async () => {
const sqs = new aswSdk.SQS({ region: queueRegion });
// First thing we need to do is get the current number of messages in the DLQ.
const attributes = await sqs.getQueueAttributes({
QueueUrl: queueUrl,
AttributeNames: ['All'], // Probably could thin this down but its late
}).promise();
const numberOfMessage = Number(attributes.Attributes.ApproximateNumberOfMessages);
// Next we create a in-memory cache for the messages
const allMessages = {};
let running = true;
// Honesty here: The examples we have in existing code use the multi-bar. It was about 10PM and I had 28 DLQ messages I was looking into. I didn't feel it was worth converting the multi-bar to a single-bar. Look into the docs on the github page if this is really a sticking point for you.
const progress = new cliProgress.MultiBar({
format: ' {bar} | {name} | {value}/{total}',
hideCursor: true,
clearOnComplete: true,
stopOnComplete: true
}, cliProgress.Presets.shades_grey);
const progressBar = progress.create(numberOfMessage, 0, { name: 'Messages' });
// TODO: put in a time limit to avoid an infinite loop.
// NOTE: For 28 messages I managed to get them all with this approach in about 15 seconds. When/if I cleanup this script I plan to add the time based short-circuit at that point.
while (running) {
// Fetch all the messages we can from the queue. The number of messages is not guaranteed per the AWS documentation.
let messages = await getMessages(sqs);
for (let i = 0; i < messages.length; i++) {
// Loop though the existing messages and only copy messages we have not already cached.
let message = messages[i];
let data = allMessages[message.MessageId];
if (data === undefined) {
allMessages[message.MessageId] = message;
}
}
// Update our progress bar with the current progress
const discoveredMessageCount = Object.keys(allMessages).length;
progressBar.update(discoveredMessageCount);
// Give a quick pause just to make sure we don't get rate limited or something
await new Promise((resolve) => setTimeout(resolve, 1000));
running = discoveredMessageCount !== numberOfMessage;
}
// Now that we have all the messages I printed them to console so I could copy/paste the output into LibreCalc (excel-like tool). I split on the semicolon for rows out of habit since sometimes similar scripts deal with data that has commas in it.
const keys = Object.keys(allMessages);
console.log('Message ID;ID');
for (let i = 0; i < keys.length; i++) {
const message = allMessages[keys[i]];
const decodedBody = JSON.parse(message.Body);
console.log(`${message.MessageId};${decodedBody.id}`);
}
};
main();
I was asked to import a csv file from a server daily and parse the respective header to the appropriate fields in mongoose.
My first idea was to make it to run automatically with a scheduler using the cron module.
const CronJob = require('cron').CronJob;
const fs = require("fs");
const csv = require("fast-csv")
new CronJob('30 2 * * *', async function() {
await parseCSV();
this.stop();
}, function() {
this.start()
}, true);
Next, the parseCSV() function code is as follow:
(I have simplify some of the data)
function parseCSV() {
let buffer = [];
let stream = fs.createReadStream("data.csv");
csv.fromStream(stream, {headers:
[
"lot", "order", "cwotdt"
]
, trim:true})
.on("data", async (data) =>{
let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};
// Only add product that fulfill the following condition
if (data.cwotdt !== "000000"){
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
}
})
.on("end", function(){
db.Product.find({}, function(err, productAvailable){
// Check whether database exists or not
if(productAvailable.length !== 0){
// console.log("Database Exists");
// Add subsequent onward
db.Product.insertMany(buffer)
buffer = [];
} else{
// Add first time
db.Product.insertMany(buffer)
buffer = [];
}
})
});
}
It is not a problem if it's just a few line of rows in the csv file but just only reaching 2k rows, I encountered a problem. The culprit is due to the if condition checking when listening to the event handler on, it needs to check every single row to see whether the database contains the data already or not.
The reason I'm doing this is that the csv file will have new data added into it and I need to add all the data for the first time if database is empty or look into every single row and only add those new data into mongoose.
The 1st approach I did from here (as in the code),was using async/await to make sure that all the datas have been read before proceeding to the event handler end. This helps but I see from time to time (with mongoose.set("debug", true);), some data are being queried twice, which I have no idea why.
The 2nd approach was not to use the async/await feature, this has some downside since the data was not fully queried, it proceeded straight to the event handler end and then insertMany some of the datas which were able to get pushed into the buffer.
If i stick with the current approach, it is not an issue, but the query will take 1 to 2 minutes, not to mention even more if the database keeps growing. So, during those few minutes of querying, the event queue got blocked and therefore when sending request to the server, the server time out.
I used stream.pause() and stream.resume() before this code but I can't get it to work as it just jump straight to the end event handler first. This cause the buffer to be empty every single time since end event handler runs before the on event handler
I cant' remember the links that I used but the fundamentals that I got from is through this.
Import CSV Using Mongoose Schema
I saw these threads:
Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Can't populate big chunk of data to mongodb using Node.js
to be similar to what I need but it's a bit too complicated for me to understand what is going on. Seems like using socket or a child process maybe? Furthermore, I still need to check conditions before adding into the buffer
Anyone care to guide me on this?
Edit: await is removed from console.log as it is not asynchronous
Forking a child process approach:
When web service got a request of csv data file save it somewhere in app
Fork a child process -> child process example
Pass the file url to the child_process to run the insert checks
When child process finish processing the csv file, delete the file
Like what Joe said, indexing the DB would speed up the processing time by a lot when there are lots(millions) of tuples.
If you create an index on order and lot. The query should be very fast.
db.Product.createIndex( { order: 1, lot: 1 }
Note: This is a compound index and may not be the ideal solution. Index strategies
Also, your await on console.log is weird. That may be causing your timing issues. console.log is not async. Additionally the function is not marked async
// removing await from console.log
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
I would try with removing the await on console.log (that may be a red herring if console.log is for stackoverflow and hiding the actual async method.) However, be sure to mark the function with async if that is the case.
Lastly, if the problem still exists. I may look into a 2 tiered approach.
Insert all lines from the CSV file into a mongo collection.
Process that mongo collection after the CSV has been parsed. Removing the CSV from the equation.
Been looking around the net for an answer to this, but not found anything conclusive.
I have a node application that (potentially) needs to make a large number of HTTP GET requests.
Let's say http://foo.com/bar allows an 'id' query parameter, and I have a large number of IDs to process (~1k), i.e.
http://foo.com/bar?id=100
http://foo.com/bar?id=101
etc.
What libraries that folks have used might be best suited to this task?
I guess I'm looking for something between a queue and a connection pool:
The setup:
A large array of IDs exists to be processed (up to ~1k IDs)
The process:
Some kind of pool containing X number of 'workers' is defined
Each worker takes an ID and makes a request (with up to X concurrent workers running at a time)
When a worker completes, it takes the next ID from the array and processes that
etc. until all IDs have been processed
Any experience welcome
It was actually a lot simpler than I initially thought, and only requires Bluebird (I'm paraphrasing here a little bit since my final code ended up much more complex):
var Promise = require('bluebird');
...
var allResults = [];
...
Promise.map(idList, (id) => {
// For each ID in idList, make a HTTP call
return http.get( ... url: 'http://foo.com/bar?id=' + id ... )
.then((httpResposne) => {
return allResults.push(httpResposne);
})
.catch((err) => {
var errMsg = 'ERROR: [' + err + ']';
console.log(errMsg + (err.stack ? '\n' + err.stack : ''));
});
}, { concurrency: 10 }) // Max of 10 concurrent HTTP calls at once
.then(() => {
// All requests are now complete, return all results
return res.json(allResults);
});
In Node.js I am trying to get the following behaviour: During the runtime of my Express application I accumulate several IDs of objects that need further processing. For further processing, I need to transmit these IDs to a different service. The other service however, cannot handle a lot of requests, but rather requires batch transmission. Hence I need to accumulate a lot of individual requests to a bigger one while allowing persistence.
tl;dr — Over 15 minutes, several IDs are accumulated in my application, then after this 15 minute-window, they're all emitted at once. At the same time, the next windows is opened.
From my investigation, this is probably the abstract data type of a multiset: The items in my multiset (or bag) can have duplicates (hence the multi-), they're packaged by the time window, but have no index.
My infrastructure is already using redis, but I am not sure whether there's a way to accumulate data into one single job. Or is there? Are there any other sensible ways to achieve this kind of behavior?
I might be misunderstanding some of the subtlety of your particular situation, but here goes.
Here is a simple sketch of some code that processes a batch of 10 items at a time. The way you would do this differs slightly depending on whether the processing step is synchronous or asynchronous. You don't need anything more complicated than an array for this, since arrays have constant time push and length methods and that is the only thing you need to do. You may want to add another option to flush the batch after a given item in inserted.
Synchronous example:
var batch = [];
var batchLimit = 10;
var sendItem = function (item) {
batch.push(item);
if (item.length >= batchLimit) {
processBatchSynch(batch);
batch = [];
}
}
Asynchronous example:
// note that in this case the job of emptying the batch array
// has to be done inside the callback.
var batch = [];
var batchLimit = 10;
// your callback might look something like function(err, data) { ... }
var sendItem = function (item, cb) {
batch.push(item);
if (item.length >= batchLimit) {
processBatchAsync(batch, cb);
}
}
I came up with a npm module to solve this specific problem using a MySQL database for persistence:
persistent-bag: This is to bags like redis is to queues. A bag (or multiset) is filled over time and processed at once.
On instantiation of the object, the required table is created in the provided MySQL database if necessary.
var PersistentBag = require('persistent-bag');
var bag = new PersistentBag({
host: 'localhost',
port: '3306',
user: 'root',
password: '',
database: 'test'
});
Then items can be .add()ed during the runtime of any number of applications:
var item = {
title: 'Test item to store to bag'
};
bag.add(item, function (err, itemId) {
console.log('Item id: ' + itemId);
});
Working with the emitted, aggregated items every 15 minutes is done like in kue for redis by subscribing to .process():
bag.process(function worker(bag, done) {
// bag.data is now an array of all items
doSomething(bag.data, function () {
done();
});
});