I have a database in a Firebase Realtime Database with data that looks like this:
root
|_history
|_{userId}
|_{n1}
| |_ ...
|_{n2}
|_{n...}
Nodes n are keyed with a date integer value. Each n node has at least 60 keys, with some values being arrays, max 5 levels deep.
Query times were measured in a fashion similar to this:
const startTime = performance.now();
await query();
const endTime = performance.now();
logger.info(`Query completed in ${endTime - startTime} ms`);
I have a function that queries for n nodes under history/${userId} with keys between and inclusive of the start and end values:
await admin
.database()
.ref(`history/${userId}`)
.orderByKey()
.startAt(`${start}`)
.endAt(`${end}`)
.once("value")
This query is executed in a callable cloud function. This query currently takes approximately 2-3 seconds, returning approximately 225 nodes. The total number of n nodes is currently less than 300. Looking through my logs, it looks like query times that returned 0 nodes took approximately 500 milliseconds.
Why are the queries so slow? Am I misunderstanding something about Firebase's Realtime Database?
I've run a few performance tests to allow you to compare against.
I populated my database with this script:
for (var i=0; i < 500; i++) {
ref.push({
refresh_at: Date.now() + Math.round(Math.random() * 60 * 1000)
});
}
This lead to a JSON of this form:
{
"-MlWgH51ia7Iz7ubZb7K" : {
"refresh_at" : 1633726623247
},
"-MlWgH534FgMlb7J4bH2" : {
"refresh_at" : 1633726586126
},
"-MlWgH54gd-uW_M7e6J-" : {
"refresh_at" : 1633726597651
},
...
}
When retrieved in its entirety through the API, the snapshot.val() for this JSON is 26.001 characters long.
Client-side JavaScript SDK in jsbin
With the regular client-side JavaScript SDK in a jsbin and with a simple node script similar to yours.
Updated for jsbin, the code I ran is:
ref.orderByChild("refresh_at")
.endAt(Date.now())
.limitToLast(1000) // 👈 This is what we'll vary
.once("value")
.then(function(snapshot) {
var endTime = performance.now();
console.log('Query completed in '+Math.round(endTime - startTime)+'ms, retrieved '+snapshot.numChildren()+" nodes, for a total JSON size of "+JSON.stringify(snapshot.val()).length+" chars");
});
Running it a few times, and changing the limit that I marked, leads to:
Limit
Snapshot size
Average time in ms
500
26,001
350ms - 420ms
100
5,201
300ms - 350ms
10
521
300ms - 320ms
Node.js Admin SDK
I ran the same test with a local Node.js script against the exact same data set, with a modified script that runs 10 times:
for (var i=0; i < 10; i++) {
const startTime = Date.now();
const snapshot = await ref.orderByChild("refresh_at")
.endAt(Date.now())
.limitToLast(10)
.once("value")
const endTime = Date.now();
console.log('Query completed in '+Math.round(endTime - startTime)+'ms, retrieved '+snapshot.numChildren()+" nodes, for a total JSON size of "+JSON.stringify(snapshot.val()).length+" chars");
};
The results:
Limit
Snapshot size
Time in ms
500
26,001
507ms, 78ms, 70ms, 65ms, 65ms, 61ms, 64ms, 65ms, 81ms, 62ms
100
5,201
442ms, 59ms, 56ms, 59ms, 55ms, 54ms, 54ms, 55ms, 57ms, 56ms
10
521
437ms, 52ms, 49ms, 52ms, 51ms, 51ms, 52ms, 50ms, 52ms, 50ms
So what you can see is that the first run is similar (but slightly slower) as the JavaScript SDK, and subsequent runs are then a lot faster. This makes sense as on the initial run the client establishes its (web socket) connection to the database server, which includes a few roundtrips to determine the right server. Subsequent calls seem more bandwidth constrained.
Ordering by key
I also test with ref.orderByKey().startAt("-MlWgH5QUkP5pbQIkVm0").endAt("-MlWgH5Rv5ij42Vel5Sm") in Node.js and get very similar results to the ordering by child.
Add the field that you are using for the query to the Realtime Database rules.
For example
{
"rules": {
".read": "auth.uid != null",
".write": "auth.uid != null",
"v1": {
"history": {
".indexOn": "refresh_at"
}
}
}
}
Related
I'm trying to execute an insert query for each row of a query stream using pg-promise with pg-query-stream. With the approach I have, memory usage increases with each query executed.
I've also narrowed the problem down to just executing any query during the stream, not just inserts. I currently listen for 'data' events on the stream, pause the stream, execute a query, and resume the stream. I've also tried piping the query stream into a writeable stream that executes the query, but I get the error that the db connection is already closed.
let count = 0;
const startTime = new Date();
const qs = new QueryStream('SELECT 1 FROM GENERATE_SERIES(1, 1000000)');
db.stream(qs, stream => {
stream.on('data', async () => {
count++;
stream.pause();
await db.one('SELECT 1');
if (count % 10000 === 0) {
const duration = Math.round((new Date() - startTime) / 1000);
const mb = Math.round(process.memoryUsage().heapUsed/1024/1024);
console.log(`row ${count}, ${mb}MB, ${duration} seconds`);
}
stream.resume();
});
});
I expected the memory usage to hover around a constant value, but the output looks like the following:
row 10000, 105MB, 4 seconds
row 20000, 191MB, 6 seconds
row 30000, 278MB, 9 seconds
row 40000, 370MB, 10 seconds
row 50000, 458MB, 14 seconds
It takes over 10 minutes to reach row 60000.
UPDATE:
I edited the code above to include async/await to wait for the inner query to finish and I increased the series to 10,000,000. I ran the node process with 512MB of memory and the program slows significantly when approaching that limit but doesn't crash. This problem occurred with v10 and not v11+ of node.
This is due to invalid use of promises / asynchronous code.
Line db.one('SELECT 1'); isn't chained to anything, spawning loose promises at a fast rate, which in turn pollutes memory.
You need to chain it either with .then.catch or with await.
I am currently using this piece of code to connect to a massive list of links (a total of 2458 links, dumped at https://pastebin.com/2wC8hwad) to get feeds from numerous sources, and to deliver them to users of my program.
It's basically splitting up one massive array into multiple batches (arrays), then forking a process to handle a batch to request each stored link for a 200 status code. Only when a batch is complete is the next batch sent for processing, and when its all done the forked process is disconnected. However I'm facing issues concerning apparent inconsistency in how this is performing with this logic, particularly the part where it requests the code.
const req = require('./request.js')
const process = require('child_process')
const linkList = require('./links.json')
let processor
console.log(`Total length: ${linkList.length}`) // 2458 links
const batchLength = 400
const batchList = [] // Contains batches (arrays) of links
let currentBatch = []
for (var i in linkList) {
if (currentBatch.length < batchLength) currentBatch.push(linkList[i])
else {
batchList.push(currentBatch)
currentBatch = []
currentBatch.push(linkList[i])
}
}
if (currentBatch.length > 0) batchList.push(currentBatch)
console.log(`Batch list length by default is ${batchList.length}`)
// cutDownBatchList(1)
console.log(`New batch list length is ${batchList.length}`)
const startTime = new Date()
getBatchIsolated(0, batchList)
let failCount = 0
function getBatchIsolated (batchNumber) {
console.log('Starting batch #' + batchNumber)
let completedLinks = 0
const currentBatch = batchList[batchNumber]
if (!processor) processor = process.fork('./request.js')
for (var u in currentBatch) { processor.send(currentBatch[u]) }
processor.on('message', function (linkCompletion) {
if (linkCompletion === 'failed') failCount++
if (++completedLinks === currentBatch.length) {
if (batchNumber !== batchList.length - 1) setTimeout(getBatchIsolated, 500, batchNumber + 1)
else finish()
}
})
}
function finish() {
console.log(`Completed, time taken: ${((new Date() - startTime) / 1000).toFixed(2)}s. (${failCount}/${linkList.length} failed)`)
processor.disconnect()
}
function cutDownBatchList(maxBatches) {
for (var r = batchList.length - 1; batchList.length > maxBatches && r >= 0; r--) {
batchList.splice(r, 1)
}
return batchList
}
Below is request.js, using needle. (However, for some strange reason it may completely hang up on a particular site indefinitely - in that case, I just use this workaround)
const needle = require('needle')
function connect (link, callback) {
const options = {
timeout: 10000,
read_timeout: 8000,
follow_max: 5,
rejectUnauthorized: true
}
const request = needle.get(link, options)
.on('header', (statusCode, headers) => {
if (statusCode === 200) callback(null, link)
else request.emit('err', new Error(`Bad status code (${statusCode})`))
})
.on('err', err => callback(err, link))
}
process.on('message', function(linkRequest) {
connect(linkRequest, function(err, link) {
if (err) {
console.log(`Couldn't connect to ${link} (${err})`)
process.send('failed')
} else process.send('success')
})
})
In theory, I think this should perform perfectly fine - it spawns off a separate process to handle the dirty work in sequential batches so its not overloaded and is super scaleable. However, when using using the full list of links at length 2458 with a total of 7 batches, I often get massive "socket hang up" errors on random batches on almost every trial that I do, similar to what would happen if I requested all the links at once.
If I cut down the number of batches to 1 using the function cutDownBatchList it performs perfectly fine on almost every trial. This is all happening on a Linux Debian VPS with two 3.1GHz vCores and 4 GB RAM from OVH, on Node v6.11.2
One thing I also noticed is that if I increased the timeout to 30000 (30 sec) in request.js for 7 batches, it works as intended - however it works perfectly fine with a much lower timeout when I cut it down to 1 batch. If I also try to do all 2458 links at once, with a higher timeout, I also face no issues (which basically makes this mini algorithm useless if I can't cut down the timeout via batch handling links). This all goes back to the inconsistent behavior issue.
The best TLDR I can do: Trying to request a bunch of links in sequential batches in a forked child process - succeeds almost every time with a lower number of batches, fails consistently with full number of batches even though behavior should be the same since its handling it in isolated batches.
Any help would be greatly appreciated in solving this issue as I just cannot for the life of me figure it out!
I am developing an realtime application (node, socket.io and mongoDB with Mongoose) which receive data every 30 seconds. The data received is some metadata about the machine and 10 pressures.
I have a document per day that is preallocated when the day change (avoiding move data in the Database when documents grows saving data ), so I only have to do updates.
It data looks like:
{
metadata: { .... }
data: {
"0":{
"0":{
"0":{
"pressures" : {.....},
},
"30":{}
},
"1":{},
"59":{}
},
"1":{},
"23":{}
}
},
Without doing any Database operation in the server I receive the data from sensors every 30 seconds without problems and never lost the socket.io connection :
DATA 2016-09-30T16:02:00+02:00
DATA 2016-09-30T16:02:30+02:00
DATA 2016-09-30T16:03:00+02:00
DATA 2016-09-30T16:03:30+02:00
DATA 2016-09-30T16:04:00+02:00
DATA 2016-09-30T16:04:30+02:00
but when I start doing updates (calling DataDay.findById(_id).exec()...) I lost half of the data and sometimes the socket.io connection i.e at (16:18h, 17:10h, 17:49h, 18:12h ..), It's like the server stops receiving socket information at intervals
DATA 2016-09-30T16:02:00+02:00
LOST
LOST
DATA 2016-09-30T16:03:30+02:00
DATA 2016-09-30T16:04:00+02:00
LOST
LOST
DATA 2016-09-30T16:05:30+02:00
LOST
I am using MONGODB with mongoose (with bluebird promises), but I am probably doing some blocking operation or something wrong, but I can't find out it.
The code treating the income data is:
socket.on('machine:data', function (data) {
console.log('DATA ' + data.metada.date));
var startAt = Date.now(); // Only for testing
dataYun = data;
var _id = _createIdDataDay(dataYun._id); // Synchronous
DataDay.findById(_id).exec() // Asynchronous
.then( _handleEntityNotFound ) // Synchronous
.then( _createPreData ) // Asynchronous
.then( _saveUpdates ) // Asynchronous
.then( function () {
console.log('insertData: ' + (Date.now() - startAt + ' ms'));
})
.catch( _handleError(data) );
console.log('AFTER DE THE INSERT METHOD');
console.log(data.data.pressures);
});
I have controlled how expensive were the operations and:
_createIdDataDay: 0 ms
_handleEntityNotFound: 1 ms
_createPreData 709 ms // only is executed once a day
_saveUpdates: 452 ms
insertData: 452 ms
This test has been done only with one machine sending data but the goal is receive data from 50 to 100 machines all of them sending data at the same time.
So from this test the conclusion is that every 30 seconds I have to update the database an the operation last more or less 452 ms.
So I don't understand where the problem is.
Is 452 ms too expensive for an update?
Even so, I am not doing any operation more, and the next data comes in 30 seconds so it doesn't make sense loss data
I know that promises doesn't work well for multiple events (but I think this isn't the case), but not sure.
Can it be a problem with socket.io?
Or simply I am doing something that blocks the event loop but I can't see it.
Thanks
In my node.js application I read messages from AWS Kinesis stream, and I need store all messages, for last minute in cache (Redis). I run next code in one node worker:
var loopCallback = function(record) {
var nowMinute = moment.utc(record.Data.ts).minute();
//get all cached kinesis records
var key = "kinesis";
cache.get(key,function (err, cachedData) {
if (err) {
utils.logError(err);
} else {
if(!cachedData) {
cachedData = [];
} else {
cachedData = JSON.parse(cachedData);
}
//get records with the same minute
var filtered = _.filter(cachedData, function (item) {
return moment.utc(item.ts).minute() === nowMinute;
});
filtered.push(record.Data);
cache.set(key, JSON.stringify(filtered), function (saveErr) {
if (saveErr) {
utils.logError(saveErr);
}
//do other things with record;
});
}
});
};
Most of the records (few dozens) I receive exactly in the same moment. So when I try to save it, some records are not stored.
I uderstand it happen due to race condition.
Node reads old version of array from Redis and overwrites array while it writes another record to cache.
I have read about redis transactions, but as I understand it will not help me, because only one transaction will be completed, and other will be rejected.
There is way to save all records to cache in my case?
Thank you
You could use a sorted set, with the score being a Unix timestamp
ZADD kinesis <unixtimestamp> "some data to be cached"
To get the elements added less than one minute ago, create a timestamp for (now - 60 seconds) then use ZRANGEBYSCORE to get the oldest element first:
ZRANGEBYSCORE myzset -inf (timestamp
or ZREVRANGEBYSCORE if you want the newest element first:
ZRANGEBYSCORE myzset -inf (timestamp
To remove the elements older than one minute, create a timestamp for (now - 60 seconds) then use ZREMRANGEBYSCORE
ZREMRANGEBYSCORE myzset -inf (timestamp
Say I have a link aggregation app where users vote on links. I sort the links using hotness scores generated by an algorithm that runs whenever a link is voted on. However running it on every vote seems excessive. How do I limit it so that it runs no more than, say, every 5 minutes.
a) use cron job
b) keep track of the timestamp when the procedure was last run, and when the current timestamp - the timestamp you have stored > 5 minutes then run the procedure and update the timestamp.
var yourVoteStuff = function() {
...
setTimeout(yourVoteStuff, 5 * 60 * 1000);
};
yourVoteStuff();
Before asking why not to use setTimeinterval, well, read the comment below.
Why "why setTimeinterval" and no "why cron job?"?, am I that wrong?
First you build a receiver that receives all your links submissions.
Secondly, the receiver push()es each link (that has been received) to
a queue (I strongly recommend redis)
Moreover you have an aggregator which loops with a time interval of your desire. Within this loop each queued link should be poll()ed and continue to your business logic.
I have use this solution to a production level and I can tell you that scales well as it also performs.
Example of use;
var MIN = 5; // don't run aggregation for short queue, saves resources
var THROTTLE = 10; // aggregation/sec
var queue = [];
var bucket = [];
var interval = 1000; // 1sec
flow.on("submission", function(link) {
queue.push(link);
});
___aggregationLoop(interval);
function ___aggregationLoop(interval) {
setTimeout(function() {
bucket = [];
if(queue.length<=MIN) {
___aggregationLoop(100); // intensive
return;
}
for(var i=0; i<THROTTLE; ++i) {
(function(index) {
bucket.push(this);
}).call(queue.pop(), i);
}
___aggregationLoop(interval);
}, interval);
}
Cheers!