Loss data in MongoDB real-time application - node.js

I am developing an realtime application (node, socket.io and mongoDB with Mongoose) which receive data every 30 seconds. The data received is some metadata about the machine and 10 pressures.
I have a document per day that is preallocated when the day change (avoiding move data in the Database when documents grows saving data ), so I only have to do updates.
It data looks like:
{
metadata: { .... }
data: {
"0":{
"0":{
"0":{
"pressures" : {.....},
},
"30":{}
},
"1":{},
"59":{}
},
"1":{},
"23":{}
}
},
Without doing any Database operation in the server I receive the data from sensors every 30 seconds without problems and never lost the socket.io connection :
DATA 2016-09-30T16:02:00+02:00
DATA 2016-09-30T16:02:30+02:00
DATA 2016-09-30T16:03:00+02:00
DATA 2016-09-30T16:03:30+02:00
DATA 2016-09-30T16:04:00+02:00
DATA 2016-09-30T16:04:30+02:00
but when I start doing updates (calling DataDay.findById(_id).exec()...) I lost half of the data and sometimes the socket.io connection i.e at (16:18h, 17:10h, 17:49h, 18:12h ..), It's like the server stops receiving socket information at intervals
DATA 2016-09-30T16:02:00+02:00
LOST
LOST
DATA 2016-09-30T16:03:30+02:00
DATA 2016-09-30T16:04:00+02:00
LOST
LOST
DATA 2016-09-30T16:05:30+02:00
LOST
I am using MONGODB with mongoose (with bluebird promises), but I am probably doing some blocking operation or something wrong, but I can't find out it.
The code treating the income data is:
socket.on('machine:data', function (data) {
console.log('DATA ' + data.metada.date));
var startAt = Date.now(); // Only for testing
dataYun = data;
var _id = _createIdDataDay(dataYun._id); // Synchronous
DataDay.findById(_id).exec() // Asynchronous
.then( _handleEntityNotFound ) // Synchronous
.then( _createPreData ) // Asynchronous
.then( _saveUpdates ) // Asynchronous
.then( function () {
console.log('insertData: ' + (Date.now() - startAt + ' ms'));
})
.catch( _handleError(data) );
console.log('AFTER DE THE INSERT METHOD');
console.log(data.data.pressures);
});
I have controlled how expensive were the operations and:
_createIdDataDay: 0 ms
_handleEntityNotFound: 1 ms
_createPreData 709 ms // only is executed once a day
_saveUpdates: 452 ms
insertData: 452 ms
This test has been done only with one machine sending data but the goal is receive data from 50 to 100 machines all of them sending data at the same time.
So from this test the conclusion is that every 30 seconds I have to update the database an the operation last more or less 452 ms.
So I don't understand where the problem is.
Is 452 ms too expensive for an update?
Even so, I am not doing any operation more, and the next data comes in 30 seconds so it doesn't make sense loss data
I know that promises doesn't work well for multiple events (but I think this isn't the case), but not sure.
Can it be a problem with socket.io?
Or simply I am doing something that blocks the event loop but I can't see it.
Thanks

Related

How to manage massive calls to Postgresql in Node

I have a question regarding massive calls to PostgreSQL.
This is the scenario:
I have a simple Nodejs app that makes queries to PostgreSQL in a short period of time.
Everything is fine, but sometimes these calls get rejected due to Postgresql maximum pool connections setting, which is equal to 100.
I have in mind to make queue consumption app style, which means adding every query to a queue and then consuming an element every second. By consequence a query to PostgreSQL every second.
But my problem is, Idk where to start. This is the part where I am getting problems with, at some point, I have a lot of calls and I get lots of "ERROR IN QUERY EXECUTION" for the reason explained before.
const pool3 = new Pool(credentialsPostGres);
let res = [];
let sql_call = "select colum1 from table2 where x = y"; //the real query is a bit more complex, but you get the idea.
poll_query.query(sql_call,(err,results) => {
if (err) {
pool3.end();
console.log(err + " ERROR IN QUERY EXECUTION");
} else {
res.push({ data: Object.values(JSON.parse(JSON.stringify(results.rows))) });
pool3.end();
return callback(res,data);
}
})
How I should manage this part into a queue? I am a bit lost.
Help!

How to handle adding up to 100K entries to Firestore database in a Node.js application

Here is my function where I am trying to save data extracted from an Excel file. I am using XLSX npm package to extract the data from Excel file.
function myFunction() {
const excelFilePath = "/ExcelFile2.xlsx"
if (fs.existsSync(path.join('uploads', excelFilePath))) {
const workbook = XLSX.readFile(`./uploads${excelFilePath}`)
const [firstSheetName] = workbook.SheetNames;
const worksheet = workbook.Sheets[firstSheetName];
const rows = XLSX.utils.sheet_to_json(worksheet, {
raw: false, // Use raw values (true) or formatted strings (false)
// header: 1, // Generate an array of arrays ("2D Array")
});
// res.send({rows})
const serviceAccount = require('./*******-d75****7a06.json');
admin.initializeApp({
credential: admin.credential.cert(serviceAccount)
});
const db = admin.firestore()
rows.forEach((value) => {
db.collection('users').doc().onSnapshot((snapShot) => {
docRef.set(value).then((respo) => {
console.log("Written")
})
.catch((reason) => {
console.log(reason.note)
})
})
})
console.log(rows.length)
}
Here is an error that I am getting and this process uses up all of my system memory:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
It's pretty normal in Firebase/Firestore land to have errors like this when trying to add too much data at once.
Firebase Functions tend to time out, and even if you configure them to be able to run all the way to 9 minutes, they'll still time out eventually and you end up with partial data and/or errors.
Here's how I do things like this:
Write a function that writes 500 entries at a time (using batch write)
Use an entry identifier (let's call it userId), so the function knows which was the last user recorded to the database. Let's call it lastUserRecorded.
After each iteration (batch write of 500 entries), have your function record the value of lastUserRecorded inside a temporary document in the database.
When the function runs again, it should first read the value of lastUserRecorded in the db, then write a new batch of 500 users, starting AFTER that value. (it would select a new set of 500 users from your excel file, but start after the value of lastUserRecorded.
To avoid running into function timeout issues, I would schedule the function to run every minute (Cloud Scheduler trigger). This way, it's very highly likely that the function will be able to handle the batch of 500 writes, without timing out and recording partial data.
If you do it this way, 100k entries will take around 3 hours and 34 minutes to finish.

Run a Cron Job every 30mins after onCreate Firestore event

I want to have a cron job/scheduler that will run every 30 minutes after an onCreate event occurs in Firestore. The cron job should trigger a cloud function that picks the documents created in the last 30 minutes-validates them against a json schema-and saves them in another collection.How do I achieve this,programmatically writing such a scheduler?
What would also be fail-safe mechanism and some sort of queuing/tracking the documents created before the cron job runs to push them to another collection.
Building a queue with Firestore is simple and fits perfectly for your use-case. The idea is to write tasks to a queue collection with a due date that will then be processed when being due.
Here's an example.
Whenever your initial onCreate event for your collection occurs, write a document with the following data to a tasks collection:
duedate: new Date() + 30 minutes
type: 'yourjob'
status: 'scheduled'
data: '...' // <-- put whatever data here you need to know when processing the task
Have a worker pick up available work regularly - e.g. every minute depending on your needs
// Define what happens on what task type
const workers: Workers = {
yourjob: (data) => db.collection('xyz').add({ foo: data }),
}
// The following needs to be scheduled
export const checkQueue = functions.https.onRequest(async (req, res) => {
// Consistent timestamp
const now = admin.firestore.Timestamp.now();
// Check which tasks are due
const query = db.collection('tasks').where('duedate', '<=', new Date()).where('status', '==', 'scheduled');
const tasks = await query.get();
// Process tasks and mark it in queue as done
tasks.forEach(snapshot => {
const { type, data } = snapshot.data();
console.info('Executing job for task ' + JSON.stringify(type) + ' with data ' + JSON.stringify(data));
const job = workers[type](data)
// Update task doc with status or error
.then(() => snapshot.ref.update({ status: 'complete' }))
.catch((err) => {
console.error('Error when executing worker', err);
return snapshot.ref.update({ status: 'error' });
});
jobs.push(job);
});
return Promise.all(jobs).then(() => {
res.send('ok');
return true;
}).catch((onError) => {
console.error('Error', onError);
});
});
You have different options to trigger the checking of the queue if there is a task that is due:
Using a http callable function as in the example above. This requires you to perform a http call to this function regularly so it executes and checks if there is a task to be done. Depending on your needs you could do it from an own server or use a service like cron-job.org to perform the calls. Note that the HTTP callable function will be available publicly and potentially, others could also call it. However, if you make your check code idempotent, it shouldn't be an issue.
Use the Firebase "internal" cron option that uses Cloud Scheduler internally. Using that you can directly trigger the queue checking:
export scheduledFunctionCrontab =
functions.pubsub.schedule('* * * * *').onRun((context) => {
console.log('This will be run every minute!');
// Include code from checkQueue here from above
});
Using such a queue also makes your system more robust - if something goes wrong in between, you will not loose tasks that would somehow only exist in memory but as long as they are not marked as processed, a fixed worker will pick them up and reprocess them. This of course depends on your implementation.
You can trigger a cloud function on the Firestore Create event which will schedule the Cloud Task after 30 minutes. This will have queuing and retrying mechanism.
An easy way is that you could add a created field with a timestamp, and then have a scheduled function run at a predefined period (say, once a minute) and execute certain code for all records where created >= NOW - 31 mins AND created <= NOW - 30 mins (pseudocode). If your time precision requirements are not extremely high, that should work for most cases.
If this doesn't suit your needs, you can add a Cloud Task (Google Cloud product). The details are specified in this good article.

How to execute queries during a stream using pg-promise?

I'm trying to execute an insert query for each row of a query stream using pg-promise with pg-query-stream. With the approach I have, memory usage increases with each query executed.
I've also narrowed the problem down to just executing any query during the stream, not just inserts. I currently listen for 'data' events on the stream, pause the stream, execute a query, and resume the stream. I've also tried piping the query stream into a writeable stream that executes the query, but I get the error that the db connection is already closed.
let count = 0;
const startTime = new Date();
const qs = new QueryStream('SELECT 1 FROM GENERATE_SERIES(1, 1000000)');
db.stream(qs, stream => {
stream.on('data', async () => {
count++;
stream.pause();
await db.one('SELECT 1');
if (count % 10000 === 0) {
const duration = Math.round((new Date() - startTime) / 1000);
const mb = Math.round(process.memoryUsage().heapUsed/1024/1024);
console.log(`row ${count}, ${mb}MB, ${duration} seconds`);
}
stream.resume();
});
});
I expected the memory usage to hover around a constant value, but the output looks like the following:
row 10000, 105MB, 4 seconds
row 20000, 191MB, 6 seconds
row 30000, 278MB, 9 seconds
row 40000, 370MB, 10 seconds
row 50000, 458MB, 14 seconds
It takes over 10 minutes to reach row 60000.
UPDATE:
I edited the code above to include async/await to wait for the inner query to finish and I increased the series to 10,000,000. I ran the node process with 512MB of memory and the program slows significantly when approaching that limit but doesn't crash. This problem occurred with v10 and not v11+ of node.
This is due to invalid use of promises / asynchronous code.
Line db.one('SELECT 1'); isn't chained to anything, spawning loose promises at a fast rate, which in turn pollutes memory.
You need to chain it either with .then.catch or with await.

Best way to query all documents from a mongodb collection in a reactive way w/out flooding RAM

I want to query all the documents in a collection in a reactive way. The collection.find() method of the mongodb nodejs driver returns a cursor that fires events for each document found in the collection. So I made this:
function giant_query = (db) => {
var req = db.collection('mycollection').find({});
return Rx.Observable.merge(Rx.Observable.fromEvent(req, 'data'),
Rx.Observable.fromEvent(req, 'end'),
Rx.Observable.fromEvent(req, 'close'),
Rx.Observable.fromEvent(req, 'readable'));
}
It will do what I want: fire for each document, so I can treat then in a reactive way, like this:
Rx.Observable.of('').flatMap(giant_query).do(some_function).subscribe()
I could query the documents in packets of tens, but then I'd have to keep track of an index number for each time the observable stream is fired, and I'd have to make an observable loop which I do not know if it's possible or the right way to do it.
The problem with this cursor is that I don't think it does things in packets. It'll probably fire all the events in a short period of time, therefore flooding my RAM. Even if I buffer some events in packets using Observable's buffer, the events and events data (the documents) are going to be waiting on RAM to be manipulated.
What's the best way to deal with it n a reactive way?
I'm not an expert on mongodb, but based on the examples I've seen, this is a pattern I would try.
I've omitted the events other than data, since throttling that one seems to be the main concern.
var cursor = db.collection('mycollection').find({});
const cursorNext = new Rx.BehaviourSubject('next'); // signal first batch then wait
const nextBatch = () => {
if(cursor.hasNext()) {
cursorNext.next('next');
}
});
cursorNext
.switchMap(() => // wait for cursorNext to signal
Rx.Observable.fromPromise(cursor.next()) // get a single doc
.repeat() // get another
.takeWhile(() => cursor.hasNext() ) // stop taking if out of data
.take(batchSize) // until full batch
.toArray() // combine into a single emit
)
.map(docsBatch => {
// do something with the batch
// return docsBatch or modified doscBatch
})
... // other operators?
.subscribe(x => {
...
nextBatch();
});
I'm trying to put together a test of this Rx flow without mongodb, in the meantime this might give you some ideas.
You also might wanna check my solution without using of rxJS:
Mongoose Cursor: http bulk request from collection

Resources