In my node.js application I read messages from AWS Kinesis stream, and I need store all messages, for last minute in cache (Redis). I run next code in one node worker:
var loopCallback = function(record) {
var nowMinute = moment.utc(record.Data.ts).minute();
//get all cached kinesis records
var key = "kinesis";
cache.get(key,function (err, cachedData) {
if (err) {
utils.logError(err);
} else {
if(!cachedData) {
cachedData = [];
} else {
cachedData = JSON.parse(cachedData);
}
//get records with the same minute
var filtered = _.filter(cachedData, function (item) {
return moment.utc(item.ts).minute() === nowMinute;
});
filtered.push(record.Data);
cache.set(key, JSON.stringify(filtered), function (saveErr) {
if (saveErr) {
utils.logError(saveErr);
}
//do other things with record;
});
}
});
};
Most of the records (few dozens) I receive exactly in the same moment. So when I try to save it, some records are not stored.
I uderstand it happen due to race condition.
Node reads old version of array from Redis and overwrites array while it writes another record to cache.
I have read about redis transactions, but as I understand it will not help me, because only one transaction will be completed, and other will be rejected.
There is way to save all records to cache in my case?
Thank you
You could use a sorted set, with the score being a Unix timestamp
ZADD kinesis <unixtimestamp> "some data to be cached"
To get the elements added less than one minute ago, create a timestamp for (now - 60 seconds) then use ZRANGEBYSCORE to get the oldest element first:
ZRANGEBYSCORE myzset -inf (timestamp
or ZREVRANGEBYSCORE if you want the newest element first:
ZRANGEBYSCORE myzset -inf (timestamp
To remove the elements older than one minute, create a timestamp for (now - 60 seconds) then use ZREMRANGEBYSCORE
ZREMRANGEBYSCORE myzset -inf (timestamp
Related
I am trying to run several hundred thousand sql update queries using node/mssql. I am trying to:
insert each record individually (if one fails I don't want the batch to fail)
batch the queries so I don't overload the SQL server (I can open a new connection for every query but the server explodes if I do that)
With my existing code (which works 99% of the time) I occasionally get: operation timed out for an unknown reason and I'm hoping someone can suggest a fix, or improvements.
this is what I have:
try {
const sql = require("mssql");
let pool=await new sql.connect(CONFIG_OBJ)
let batchSize=1000
let queries=[
`update xxx set [AwsCoID]='10118' where [PrimaryKey]='10118-78843' IF ##ROWCOUNT=0 insert into xxx([AwsCoID]) values('10118')`,
`update or insert 2`,
`update or insert 3`,....]
for (let i = 0; i < queries.length; i += batchSize) {
let prom = queries
.slice(i, i + batchSize)
.map((qq) => pool.request().query(qq));
for (let p of await (Promise as any).allSettled(prom)) {
//make sure connection is still active after batch finishes
pool=await new sql.connect(cc)
//console.error(`promerr:`, p);
let status: "fulfilled" | "rejected" = p.status;
let value = p.value as SqlResult;
if (status != "fulfilled" || !value.isSuccess) {
console.log(`batchRunSqlCommands() promERR:`, value);
errs.push(value);
}
}
}
} catch (e) {
console.log(`batchSqlCommand err:`, e);
} finally {
pool.close();
}
For anyone else who writes something like I did, the issue is that SQL server does a table lock of the affected rows when doing an upsert. The fix is to add a clustered index that ensures each record being updated is in its own cluster, so the cluster gets locked but only one row is modified within the cluster at a time.
TLDR: set a "line unique" column (eg PrimaryKey) as the clustered index on the table.
This is not good for DB performance, but will quickly and simply solve the issue. You could also intelligently cluster groups of data, but then you would need to ensure your batch update only touched each cluster once and finished before trying to access it again.
I have a database in a Firebase Realtime Database with data that looks like this:
root
|_history
|_{userId}
|_{n1}
| |_ ...
|_{n2}
|_{n...}
Nodes n are keyed with a date integer value. Each n node has at least 60 keys, with some values being arrays, max 5 levels deep.
Query times were measured in a fashion similar to this:
const startTime = performance.now();
await query();
const endTime = performance.now();
logger.info(`Query completed in ${endTime - startTime} ms`);
I have a function that queries for n nodes under history/${userId} with keys between and inclusive of the start and end values:
await admin
.database()
.ref(`history/${userId}`)
.orderByKey()
.startAt(`${start}`)
.endAt(`${end}`)
.once("value")
This query is executed in a callable cloud function. This query currently takes approximately 2-3 seconds, returning approximately 225 nodes. The total number of n nodes is currently less than 300. Looking through my logs, it looks like query times that returned 0 nodes took approximately 500 milliseconds.
Why are the queries so slow? Am I misunderstanding something about Firebase's Realtime Database?
I've run a few performance tests to allow you to compare against.
I populated my database with this script:
for (var i=0; i < 500; i++) {
ref.push({
refresh_at: Date.now() + Math.round(Math.random() * 60 * 1000)
});
}
This lead to a JSON of this form:
{
"-MlWgH51ia7Iz7ubZb7K" : {
"refresh_at" : 1633726623247
},
"-MlWgH534FgMlb7J4bH2" : {
"refresh_at" : 1633726586126
},
"-MlWgH54gd-uW_M7e6J-" : {
"refresh_at" : 1633726597651
},
...
}
When retrieved in its entirety through the API, the snapshot.val() for this JSON is 26.001 characters long.
Client-side JavaScript SDK in jsbin
With the regular client-side JavaScript SDK in a jsbin and with a simple node script similar to yours.
Updated for jsbin, the code I ran is:
ref.orderByChild("refresh_at")
.endAt(Date.now())
.limitToLast(1000) // 👈 This is what we'll vary
.once("value")
.then(function(snapshot) {
var endTime = performance.now();
console.log('Query completed in '+Math.round(endTime - startTime)+'ms, retrieved '+snapshot.numChildren()+" nodes, for a total JSON size of "+JSON.stringify(snapshot.val()).length+" chars");
});
Running it a few times, and changing the limit that I marked, leads to:
Limit
Snapshot size
Average time in ms
500
26,001
350ms - 420ms
100
5,201
300ms - 350ms
10
521
300ms - 320ms
Node.js Admin SDK
I ran the same test with a local Node.js script against the exact same data set, with a modified script that runs 10 times:
for (var i=0; i < 10; i++) {
const startTime = Date.now();
const snapshot = await ref.orderByChild("refresh_at")
.endAt(Date.now())
.limitToLast(10)
.once("value")
const endTime = Date.now();
console.log('Query completed in '+Math.round(endTime - startTime)+'ms, retrieved '+snapshot.numChildren()+" nodes, for a total JSON size of "+JSON.stringify(snapshot.val()).length+" chars");
};
The results:
Limit
Snapshot size
Time in ms
500
26,001
507ms, 78ms, 70ms, 65ms, 65ms, 61ms, 64ms, 65ms, 81ms, 62ms
100
5,201
442ms, 59ms, 56ms, 59ms, 55ms, 54ms, 54ms, 55ms, 57ms, 56ms
10
521
437ms, 52ms, 49ms, 52ms, 51ms, 51ms, 52ms, 50ms, 52ms, 50ms
So what you can see is that the first run is similar (but slightly slower) as the JavaScript SDK, and subsequent runs are then a lot faster. This makes sense as on the initial run the client establishes its (web socket) connection to the database server, which includes a few roundtrips to determine the right server. Subsequent calls seem more bandwidth constrained.
Ordering by key
I also test with ref.orderByKey().startAt("-MlWgH5QUkP5pbQIkVm0").endAt("-MlWgH5Rv5ij42Vel5Sm") in Node.js and get very similar results to the ordering by child.
Add the field that you are using for the query to the Realtime Database rules.
For example
{
"rules": {
".read": "auth.uid != null",
".write": "auth.uid != null",
"v1": {
"history": {
".indexOn": "refresh_at"
}
}
}
}
I have a simple caching mechanism that involves me checking if an item is in a database cache (table:id#id_number), then if not I check if that item is in the database. If it is in the database, then I cache that item. If it is not, then I obviously don't.
The issue is this, in my current situation, going to be happening frequently. Every time someone visits the front page, I will see if id 10 exists, then if id 9 exists, etc.
If id 9 doesn't exist, nothing will be cached and my server will constantly be hitting my database every time someone visits my front page.
My solution right now is very dirty and could easily lead to confusion in the future. I am now caching whether or not an id in the database probably exists (pexists:table:id#id_number). If it doesn't probably exist, I just assume it doesn't exist, or if the cache isn't set. If it does probably exist, I check if it's in the cache. If it's not, only then will I hit my database. I will then cache the result from my database and whether or not it exists.
I am asking if there is a better way of achieving this effect.
/*
This method takes an amount (how many posts you need) and start
(the post id we want to start at). If, for example,
the DB has ids 1, 2, 4, 7, 9, and we ask for
amount=3 and start=7, we should return the items
with ids 7, 4, and 2.
*/
const parametersValidity = await this.assureIsValidPollsRequestAsync(
amount,
start
);
const validParameters = parametersValidity.result;
if (!validParameters) {
throw new Error(parametersValidity.err);
}
let polls = [];
for (
let i=start, pollId=start;
(i > start - amount && pollId > 0);
i--, pollId--
) {
// There is probably no poll logged in the database.
if(!(await this.pollProbablyExists(pollId))) {
i++;
continue;
}
let poll;
const cacheHash = `polls:id#${pollId}`;
if(await cache.existsAsync(cacheHash)) {
poll =
await this.findKeysFromCacheHash(
cacheHash,
"Id", "Title", "Description", "Option1", "Option2", "AuthorId"
);
} else {
// Simulates a slow database retreival thingy
// for testing.
await new Promise(resolve => {
setTimeout(resolve, 500)
});
poll = await this.getPollById(pollId);
if(typeof poll !== "undefined") {
this.insertKeysIntoCacheHash(
cacheHash, 60 * 60 * 3 /* 3 hours to expire */,
poll
);
}
}
if(typeof poll === "undefined") {
// This would happen if a user deleted a poll
// when the exists:polls:id#poll_id key in cache
// hadn't expired yet.
cache.setAsync(`exists:${cacheHash}`, 0);
cache.expire(`exists:polls:id#${pollId}`, 60 * 60 * 10 /* 10 hours. */);
i++;
} else {
polls.push(poll);
cache.setAsync(`exists:${cacheHash}`, 1);
}
}
return polls;
If I understand correctly, you want to avoid those non-existent keys hitting the database frequently.
If that's the case, a better and simpler way is to cache the non-existent keys.
If the key exists in database, you can cache it in Redis with the value gotten from database.
If the key doesn't exist in database, also cache it in Redis, but with a special value.
For example, you cache players' score in Redis. If the player doesn't exist, you also cache the key, but with -1 as the score. When searching the cache, if the key exists but the cached value is -1, that means the key doesn't exist in database.
I'm writing an application using nodeJS 6.3.0 and aws DynamoDB.
the dynamodb holds statistics information that are added to dynamodb that are being called from 10 different function (10 different statistic measures). the interval is set to 10 seconds, which means that every 10 seconds, 10 calls to my function are being made to add all the relevant information.
the putItem function:
function putItem(tableName,itemData,callback) {
var params = {
TableName: tableName,
Item: itemData
};
docClient.put(params, function(err, data) {
if (err) {
logger.error(params,"putItem failed in dynamodb");
callback(err,null);
} else {
callback(null,data);
}
});
now... I created a queue.
var queue = require('./dynamoDbQueue').queue;
that implements a simple queue with fixed size that I took from http://www.bennadel.com/blog/2308-creating-a-fixed-length-queue-in-javascript-using-arrays.htm.
the idea is that if there is a network problem.. lets say for a minute. i want all the events to be pushed to the queue and when the problem is resolved to send queue information to dynamodb and to free the queue.
so I modified my original function to the following code:
function putItem(tableName,itemData,callback) {
var params = {
TableName: tableName,
Item: itemData
};
if (queue.length>0) {
queue.push(params);
callback(null,null);
} else {
docClient.put(params, function (err, data) {
if (err) {
queue.push(params);
logger.error(params, "putItem failed in dynamodb");
handleErroredQueue(); // imaginary function that i need to implement
callback(err, null);
} else {
callback(null, data);
}
});
}
}
but since I have 10 insert functions that runs at the same second, there is a chance of race conditions. which means that ...
execute1 - one function validated that the queue is empty... and is about to execute docClient.put() function.
execute2 - and at the same time another function returned from docClient.put() with an error and as a result it adds to the queue it's first row.
execute1 - by the time that the first function calling docClient.put(), the problem has been resolved and it successfully inserted data to dynamodb, which leaves the queue with previous data that will be released in the next iteration.
so for example if i inserted 4 rows with ids 1,2,3,4, the order of rows that will be inserted to dynamodb is 1,2,4,3.
is there a way to resolve that ?
thanks!
I think you are on right track, but instead of checking for an error and then adding into queue what I would suggest is to add every operation to queue first and then read the data from the queue every time.
For instance, in your case you call function 1,2,3,4 and it results in 1,2,4,3 because you are using the queue at a time off error/abrupt operation.
Step1: All your function will make an entry to a Queue -> 1,2,3,4
Step2: Read your queue and make an insert, if success remove the element
else redo the operation. This way it will insert in the desired sequence
Another advantage is that because you are using queue you don't have to keep very high throughputs for the table.
Edit:
I guess you just need to ensure that on completion of your first operation you will perform your next process and not before that.
e.g: fn 1 -> read from queue (don't delete right now from queue) -> operation Completed if not perfrom again -> Delete from queue -> perform next operation.
You just have to make sure you read from queue and wait till you get response from DynamoDB.
Hope this helps.
Say I have a link aggregation app where users vote on links. I sort the links using hotness scores generated by an algorithm that runs whenever a link is voted on. However running it on every vote seems excessive. How do I limit it so that it runs no more than, say, every 5 minutes.
a) use cron job
b) keep track of the timestamp when the procedure was last run, and when the current timestamp - the timestamp you have stored > 5 minutes then run the procedure and update the timestamp.
var yourVoteStuff = function() {
...
setTimeout(yourVoteStuff, 5 * 60 * 1000);
};
yourVoteStuff();
Before asking why not to use setTimeinterval, well, read the comment below.
Why "why setTimeinterval" and no "why cron job?"?, am I that wrong?
First you build a receiver that receives all your links submissions.
Secondly, the receiver push()es each link (that has been received) to
a queue (I strongly recommend redis)
Moreover you have an aggregator which loops with a time interval of your desire. Within this loop each queued link should be poll()ed and continue to your business logic.
I have use this solution to a production level and I can tell you that scales well as it also performs.
Example of use;
var MIN = 5; // don't run aggregation for short queue, saves resources
var THROTTLE = 10; // aggregation/sec
var queue = [];
var bucket = [];
var interval = 1000; // 1sec
flow.on("submission", function(link) {
queue.push(link);
});
___aggregationLoop(interval);
function ___aggregationLoop(interval) {
setTimeout(function() {
bucket = [];
if(queue.length<=MIN) {
___aggregationLoop(100); // intensive
return;
}
for(var i=0; i<THROTTLE; ++i) {
(function(index) {
bucket.push(this);
}).call(queue.pop(), i);
}
___aggregationLoop(interval);
}, interval);
}
Cheers!