mongodb stops while iterating over large DB - node.js

This is a follow-up question of this Stackoverflow question: Async Cursor Iteration with Asynchronous Sub-task. with a slightly different turn this time.
While iterating over MongoDB documents the task stops in the middle if the target DB size is too large. (more than 3000 documents in a single collection and each document consists of lengthy texts, so .toArray is not really feasible due to memory limit. 3000 is just a part of the whole data and the full data might be more than 10,000 documents.) I've noticed if the number of documents in a collection is larger than approx. 750, it just stops in the middle of the task.
I've searched over previous Stackoverflow questions to solve this: some say iteration on a large collection requires using stream, each or map instead of for/while with cursor. When I tried these recommendations in real life, non of them did work. They also just stops in the middle, bears almost no difference from for/while iteration. I don't really like the idea of expanding timeout since it may leave the cursor behind drifting around in the memory but it also didn't work.
every method below is under async condition
stream method
const cursor = db.collections('mycollection').find()
cursor.on('data', doc => {
await doSomething(doc)//do something with doc here
})
while/for method(just replace while with for)
const cursor = db.collections('mycollection').find()
while ( await cursor.hasNext() ) {
let doc = await cursor.next()
await doSomething(doc)
}
map/each/foreach method(replace map with foreach/each)
const cursor = db.collections('mycollection').find()
cursor.map(async doc=>{
await doSomething(doc)
})
none of them shows any difference to the other. They just stop when it iterates around approx. 750 documents and just hang. I've even tried registering each document on Promise.all queue and do the async/await task at once later so that cursor won't spend too much time while iterating but the same problem arises.
EDIT: I think doSomething() confuses the other readers. So I have created a sample code so that you can reproduce the problem.
const MongoClient = require('mongodb').MongoClient
const MongoUrl = 'mongodb://localhost:27017/'
const MongoDBname = 'testDB'
const MongoCollection = 'testCollection'
const moment = require('moment')
const getDB = () =>
new Promise((resolve,reject)=>{
MongoClient.connect(MongoUrl,(err,client)=>{
if(err) return reject(err)
console.log('successfully connected to db')
return resolve(client.db(MongoDBname))
client.close()
})
})
;(async ()=>{
console.log(`iteration begins on ${moment().format('YYYY/MM/DD hh:mm:ss')} ------------`)
let db = await getDB() //receives mongodb
//iterate through all db articles...
const cursor = await db.collection(MongoCollection).find()
const maxDoc = await cursor.count()
console.log('Amount of target document:' + maxDoc)
let count = 0
//replace this with stream/while/map...any other iteration methods
cursor.each((err,doc)=>{
count ++
console.log(`preloading doc No.${count} async ${(count / maxDoc * 100).toFixed(2)}%`)
})
})()
My apologies. on the test run. it actually iterated all the documents...I think I really have done something wrong with the other parts. I'll elaborate this one with the other parts causing the trouble.

Related

Firebase Firestore transactions incredibly slow (3-4 minutes)

Edit: Removing irrelevant code to improve readability
Edit 2: Reducing example to only uploadGameRound function and adding log output with times.
I'm working on a mobile multiplayer word game and was previously using the Firebase Realtime Database with fairly snappy performance apart from the cold starts. Saving an updated game and setting stats would take at most a few seconds. Recently I made the decision to switch to using Firestore for my game data and player stats / top lists, primarily because of the more advanced queries and the automatic scaling with no need for manual sharding. Now I've got things working on Firestore, but the time it takes to save an updated game and update a number of stats is just ridiculous. I'm clocking average between 3-4 minutes before the game is updated, stats added and everything is available in the database for other clients and viewable in the web interface. I'm guessing and hoping that this is because of something I've messed up in my implementation, but the transactions all go through and there are no warnings or anything else to go on really. Looking at the cloud functions log, the total time from function call to completion log statement appears to be a bit more than a minute, but that log doesn't appear until after same the 3-4 minute wait for the data.
Here's the code as it is. If someone has time to have a look and maybe spot what's wrong I'd be hugely grateful!
This function is called from Unity client:
exports.uploadGameRound = functions.https.onCall((roundUploadData, response) => {
console.log("UPLOADING GAME ROUND. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
// CODE REMOVED FOR READABILITY. JUST PREPARING SOME VARIABLES TO USE BELOW. NOTHING HEAVY, NO DATABASE TRANSACTIONS. //
// Get a new write batch
const batch = firestoreDatabase.batch();
// Save game info to activeGamesInfo
var gameInfoRef = firestoreDatabase.collection('activeGamesInfo').doc(gameId);
batch.set(gameInfoRef, gameInfo);
// Save game data to activeGamesData
const gameDataRef = firestoreDatabase.collection('activeGamesData').doc(gameId);
batch.set(gameDataRef, { gameDataCompressed: updatedGameDataGzippedString });
if (foundWord !== undefined && foundWord !== null) {
const wordId = foundWord.timeStamp + "_" + foundWord.word;
// Save word to allFoundWords
const wordRef = firestoreDatabase.collection('allFoundWords').doc(wordId);
batch.set(wordRef, foundWord);
exports.incrementNumberOfTimesWordFound(gameInfo.language, foundWord.word);
}
console.log("COMMITTING BATCH. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
// Commit the batch
batch.commit().then(result => {
return gameInfoRef.update({ roundUploaded: true }).then(function (result2) {
console.log("DONE COMMITTING BATCH. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
return;
});
});
});
Again, any help with understanding this weird behaviour massively appreciated!
Ok, so I found the problem now and thought I should share it:
Simply adding a return statement before the batch commit fixed the function and reduced the time from 4 minutes to less than a second:
RETURN batch.commit().then(result => {
return gameInfoRef.update({ roundUploaded: true }).then(function (result2) {
console.log("DONE COMMITTING BATCH. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
return;
});
});
Your function isn't returning a promise that resolves with the data to send to the client app. In the absence of a returned promise, it will return immediately, with no guarantee that any pending asynchronous work will terminate correctly.
Calling then on a single promise isn't enough to handle promises. You likely have lots of async work going on here, between commit() and other functions like incrementNumberOfTimesWordFound. You will need to handle all of the promises correctly, and make sure your overall function returns only a single promise that resolves when all that work is complete.
I strongly suggest taking some time to learn how promises work in JavaScript - this is crucial to writing effective functions. Without a full understanding, things will appear to go wrong, or not at all, in strange ways.

Avoid Timeout to read 18000 document on Firestore

My Firestore contains 17500 documents.
It's a list of tokens, in order to send push notifications.
I stock these data in a dictionary, to be able to use them later:
users = {"fr":[token, token], "en":[token, token]....}
My code:
async function getAllUsers() {
const snapshot = await admin.firestore().collection('users').get();
var users= {};
snapshot.forEach(doc => {
const userId = doc.id;
var lang = doc.data().language
if (!(lang in users)) {
users[lang] = [];
users[lang].push(doc.data().token);
} else {
users[lang].push(doc.data().token);
}
});
return users;
}
My code doesn't work anymore. I get a timeout during the foreach loop.
Is it because I have too many documents?
Any idea?
Thanks
It's not clear from your question what exactly is timing out, but there are a couple things you should be aware of.
You certainly can get errors if you attempt to read too many documents in one query. The alternative to this is to use pagination to read the documents in smaller batches so that you don't exceed any query limits.
By default Cloud Functions assumes a 60 second timeout on any function invocations. If you need more than that, you can increase the timeout, but you can only go up to 9 minutes. After that, you have to split your work up among multiple function invocations.

Export Mongoose query as an array of nested arrays

Good morning,
I thought that following the instructions of Push element into nested array mongoose nodejs I would manage to solve my issue but I am still stuck.
I am trying to get the results of a Mongoose query into an array. The below code works when the number of objects is relatively small, but I get a "parse error" whenever the volumen increases.
I can see I am not considering the fact that the code is asynchronous but my the attemps I have tried end up in Promise { <pending> } at best.
const Collection = require("./schema");
const connection = require("mongoose");
let data = [];
Collection.find({},(e,result)=>{
result.forEach(doc =>{
data.push([doc["a"],
doc["b"]])
});
})
.then(()=>connection.close())
module.exports = data;
The above is obviously wrong, since I am not respecting the async nature of the operation.
I implemented the below function, but I don't understand how I should resolve the promise.
async function getdata() {
const cursor = Collection.find().cursor();
let data = []
await cursor.eachAsync(async function(doc) {
await new Promise(resolve =>resolve((data.push(doc))));
});
}
The aim is that when I do let data = require("./queryResult.js") data contains all the necessary data [[a,b],[a,b]]
Could anybody give me a hand on this one?
Many thanks in advance.

nodejs functional programming with generators and promises

Summary
Is functional programming in node.js general enough? can it be used to do a real-world problem of handling small bulks of db records without loading all records in memory using toArray (thus going out of memory). You can read this criticism for background. We want to demonstrate Mux and DeMux and fork/tee/join capabilities of such node.js libraries with async generators.
Context
I'm questioning the validity and generality of functional programming in node.js using any functional programming tool (like ramda, lodash, and imlazy) or even custom.
Given
Millions of records from a MongoDB cursor that can be iterated using await cursor.next()
You might want to read more about async generators and for-await-of.
For fake data one can use (on node 10)
function sleep(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function* getDocs(n) {
for(let i=0;i<n;++i) {
await sleep(1);
yield {i: i, t: Date.now()};
}
}
let docs=getDocs(1000000);
Wanted
We need
first document
last document
number of documents
split into batches/bulks of n documents and emit a socket.io event for that bulk
Make sure that first and last documents are included in the batches and not consumed.
Constraints
The millions of records should not be loaded into ram, one should iterate on them and hold at most only a batch of them.
The requirement can be done using usual nodejs code, but can it be done using something like applyspec as in here.
R.applySpec({
first: R.head(),
last: R.last(),
_:
R.pipe(
R.splitEvery(n),
R.map( (i)=> {return "emit "+JSON.stringify(i);})
)
})(input)
To show how this could be modeled with vanilla JS, we can introduce the idea of folding over an async generator that produces things that can be combined together.
const foldAsyncGen = (of, concat, empty) => (step, fin) => async asyncGen => {
let acc = empty
for await (const x of asyncGen) {
acc = await step(concat(acc, of(x)))
}
return await fin(acc)
}
Here the arguments are broken up into three parts:
(of, concat, empty) expects a function to produce "combinable" thing, a function that will combine two "combinable" things and an empty/initial instance of a "combinable" thing
(step, fin) expects a function that will take a "combinable" thing at each step and produce a Promise of a "combinable" thing to be used for the next step and a function that will take the final "combinable" thing after the generator has exhausted and produce a Promise of the final result
async asyncGen is the async generator to process
In FP, the idea of a "combinable" thing is known as a Monoid, which defines some laws that detail the expected behaviour of combining two of them together.
We can then create a Monoid that will be used to carry through the first, last and batch of values when stepping through the generator.
const Accum = (first, last, batch) => ({
first,
last,
batch,
})
Accum.empty = Accum(null, null, []) // an initial instance of `Accum`
Accum.of = x => Accum(x, x, [x]) // an `Accum` instance of a single value
Accum.concat = (a, b) => // how to combine two `Accum` instances together
Accum(a.first == null ? b.first : a.first, b.last, a.batch.concat(b.batch))
To capture the idea of flushing the accumulating batches we can create another function that takes an onFlush function that will perform some action in a returned Promise with the values being flushed, and a size n of when to flush the batch.
Accum.flush = onFlush => n => acc =>
acc.batch.length < n ? Promise.resolve(acc)
: onFlush(acc.batch.slice(0, n))
.then(_ => Accum(acc.first, acc.last, acc.batch.slice(n)))
We can also now define how we can fold over the Accum instances.
Accum.foldAsyncGen = foldAsyncGen(Accum.of, Accum.concat, Accum.empty)
With the above utilities defined, we can now use them to model your specific problem.
const emit = batch => // This is an analog of where you would emit your batches
new Promise((resolve) => resolve(console.log(batch)))
const flushEmit = Accum.flush(emit)
// flush and emit every 10 items, and also the remaining batch when finished
const fold = Accum.foldAsyncGen(flushEmit(10), flushEmit(0))
And finally run with your example.
fold(getDocs(100))
.then(({ first, last })=> console.log('done', first, last))
I'm not sure it's fair to imply that functional programming was going to offer any advantages over imperative programming in term of performance when dealing with huge amount of data.
I think you need to add another tool in your toolkit and that may be RxJS.
RxJS is a library for composing asynchronous and event-based programs by using observable sequences.
If you're not familiar with RxJS or reactive programming in general, my examples will definitely look weird but I think it would be a good investment to get familiar with these concepts
In your case, the observable sequence is your MongoDB instance that emits records over time.
I'm gonna fake your db:
var db = range(1, 5);
The range function is a RxJS thing that will emit a value in the provided range.
db.subscribe(n => {
console.log(`record ${n}`);
});
//=> record 1
//=> record 2
//=> record 3
//=> record 4
//=> record 5
Now I'm only interested in the first and last record.
I can create an observable that will only emit the first record, and create another one that will emit only the last one:
var db = range(1, 5);
var firstRecord = db.pipe(first());
var lastRecord = db.pipe(last());
merge(firstRecord, lastRecord).subscribe(n => {
console.log(`record ${n}`);
});
//=> record 1
//=> record 5
However I also need to process all records in batches: (in this example I'm gonna create batches of 10 records each)
var db = range(1, 100);
var batches = db.pipe(bufferCount(10))
var firstRecord = db.pipe(first());
var lastRecord = db.pipe(last());
merge(firstRecord, batches, lastRecord).subscribe(n => {
console.log(`record ${n}`);
});
//=> record 1
//=> record 1,2,3,4,5,6,7,8,9,10
//=> record 11,12,13,14,15,16,17,18,19,20
//=> record 21,22,23,24,25,26,27,28,29,30
//=> record 31,32,33,34,35,36,37,38,39,40
//=> record 41,42,43,44,45,46,47,48,49,50
//=> record 51,52,53,54,55,56,57,58,59,60
//=> record 61,62,63,64,65,66,67,68,69,70
//=> record 71,72,73,74,75,76,77,78,79,80
//=> record 81,82,83,84,85,86,87,88,89,90
//=> record 91,92,93,94,95,96,97,98,99,100
//=> record 100
As you can see in the output, it has emitted:
The first record
Ten batches of 10 records each
The last record
I'm not gonna try to solve your exercise for you and I'm not too familiar with RxJS to expand too much on this.
I just wanted to show you another way and let you know that it is possible to combine this with functional programming.
Hope it helps
I think I may have developed an answer for you some time ago and it's called scramjet. It's lightweight (no thousands of dependencies in node_modules), it's easy to use and it does make your code very easy to understand and read.
Let's start with your case:
DataStream
.from(getDocs(10000))
.use(stream => {
let counter = 0;
const items = new DataStream();
const out = new DataStream();
stream
.peek(1, async ([first]) => out.whenWrote(first))
.batch(100)
.reduce(async (acc, result) => {
await items.whenWrote(result);
return result[result.length - 1];
}, null)
.then((last) => out.whenWrote(last))
.then(() => items.end());
items
.setOptions({ maxParallel: 1 })
.do(arr => counter += arr.length)
.each(batch => writeDataToSocketIo(batch))
.run()
.then(() => (out.end(counter)))
;
return out;
})
.toArray()
.then(([first, last, count]) => ({ first, count, last }))
.then(console.log)
;
So I don't really agree that javascript FRP is an antipattern and I don't think I have the only answer to that, but while developing the first commits I found that the ES6 arrow syntax and async/await written in a chained fashion makes the code easily understandable.
Here's another example of scramjet code from OpenAQ specifically this line in their fetch process:
return DataStream.fromArray(Object.values(sources))
// flatten the sources
.flatten()
// set parallel limits
.setOptions({maxParallel: maxParallelAdapters})
// filter sources - if env is set then choose only matching source,
// otherwise filter out inactive sources.
// * inactive sources will be run if called by name in env.
.use(chooseSourcesBasedOnEnv, env, runningSources)
// mark sources as started
.do(markSourceAs('started', runningSources))
// get measurements object from given source
// all error handling should happen inside this call
.use(fetchCorrectedMeasurementsFromSourceStream, env)
// perform streamed save to DB and S3 on each source.
.use(streamMeasurementsToDBAndStorage, env)
// mark sources as finished
.do(markSourceAs('finished', runningSources))
// convert to measurement report format for storage
.use(prepareCompleteResultsMessage, fetchReport, env)
// aggregate to Array
.toArray()
// save fetch log to DB and send a webhook if necessary.
.then(
reportAndRecordFetch(fetchReport, sources, env, apiURL, webhookKey)
);
It describes everything that happens with every source of data. So here's my proposal up for questioning. :)
here are two solutions using RxJs and scramjet.
here is an RxJs solution
the trick was to use share() so that first() and last() won't consumer from the iterator, forkJoin was used to combine them to emit the done event with those values.
function ObservableFromAsyncGen(asyncGen) {
return Rx.Observable.create(async function (observer) {
for await (let i of asyncGen) {
observer.next(i);
}
observer.complete();
});
}
async function main() {
let o=ObservableFromAsyncGen(getDocs(100));
let s = o.pipe(share());
let f=s.pipe(first());
let e=s.pipe(last());
let b=s.pipe(bufferCount(13));
let c=s.pipe(count());
b.subscribe(log("bactch: "));
Rx.forkJoin(c, f, e, b).subscribe(function(a){console.log(
"emit done with count", a[0], "first", a[1], "last", a[2]);})
}
here is a scramjet but that is not pure (functions have side effects)
async function main() {
let docs = getDocs(100);
let first, last, counter;
let s0=Sj.DataStream
.from(docs)
.setOptions({ maxParallel: 1 })
.peek(1, (item)=>first=item[0])
.tee((s)=>{
s.reduce((acc, item)=>acc+1, 0)
.then((item)=>counter=item);
})
.tee((s)=>{
s.reduce((acc, item)=>item)
.then((item)=>last=item);
})
.batch(13)
.map((batch)=>console.log("emit batch"+JSON.stringify(batch));
await s0.run();
console.log("emit done "+JSON.stringify({first: first, last:last, counter:counter}));
}
I'll work with #michaƂ-kapracki to develop a pure version of it.
For this exact kind of problems I made this library: ramda-generators
Hopefully it's what you are looking for: lazy evaluation of streams in functional JavaScript
Only problem is that I have no idea on how to take the last element and the amount of elements from a stream without re-running the generators
A possible implementation that compute the result without parsing the whole DB in memory could be this:
Try it on repl.it
const RG = require("ramda-generators");
const R = require("ramda");
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
const getDocs = amount => RG.generateAsync(async (i) => {
await sleep(1);
return { i, t: Date.now() };
}, amount);
const amount = 1000000000;
(async (chunkSize) => {
const first = await RG.headAsync(getDocs(amount).start());
const last = await RG.lastAsync(getDocs(amount).start()); // Without this line the print of the results would start immediately
const DbIterator = R.pipe(
getDocs(amount).start,
RG.splitEveryAsync(chunkSize),
RG.mapAsync(i => "emit " + JSON.stringify(i)),
RG.mapAsync(res => ({ first, last, res })),
);
for await (const el of DbIterator())
console.log(el);
})(100);

Multiple document reads in node js Firestore transaction

I want to perform a transaction that requires updating two documents using the previous values of those documents.
For the sake of the question, I'm trying to transfer 100 tokens from one app user to another. This operation must be atomic to keep the data integrity of my DB, so on the server side I though to use admin.firestore().runTransaction.
As I understand runTransaction needs to perform all reads before performing writes, so how do I read both user's balance before updating the data?
This is what I have so far:
db = admin.firestore();
const user1Ref = db.collection('users').doc(user1Id);
const user2Ref = db.collection('users').doc(user2Id);
transaction = db.runTransaction(t => {
return t.get(user1Ref).then(user1Snap => {
const user1Balance = user1Snap.data().balance;
// Somehow get the second user's balance (user2Balance)
t.update(user1Ref , {balance: user1Balance - 100});
t.update(user2Ref , {balance: user2Balance + 100});
return Promise.resolve('Transferred 100 tokens from ' + user1Id + ' to ' + user2Id);
});
}).then(result => {
console.log('Transaction success', result);
});
You can use getAll. See documentation at https://cloud.google.com/nodejs/docs/reference/firestore/0.15.x/Transaction?authuser=0#getAll
You can use Promise.all() to generate a single promise that resolves when all promises in the array passed to it have resolved. Use that promise to continue work after all your document reads are complete - it will contain all the results. The general form of your code should be like this:
const p1 = t.get(user1Ref)
const p2 = t.get(user2Ref)
const pAll = Promise.all([p1, p2])
pAll.then(results => {
snap1 = results[0]
snap2 = results[1]
// work with snap1 and snap2 here, make updates to refs...
})

Resources