Edit: Removing irrelevant code to improve readability
Edit 2: Reducing example to only uploadGameRound function and adding log output with times.
I'm working on a mobile multiplayer word game and was previously using the Firebase Realtime Database with fairly snappy performance apart from the cold starts. Saving an updated game and setting stats would take at most a few seconds. Recently I made the decision to switch to using Firestore for my game data and player stats / top lists, primarily because of the more advanced queries and the automatic scaling with no need for manual sharding. Now I've got things working on Firestore, but the time it takes to save an updated game and update a number of stats is just ridiculous. I'm clocking average between 3-4 minutes before the game is updated, stats added and everything is available in the database for other clients and viewable in the web interface. I'm guessing and hoping that this is because of something I've messed up in my implementation, but the transactions all go through and there are no warnings or anything else to go on really. Looking at the cloud functions log, the total time from function call to completion log statement appears to be a bit more than a minute, but that log doesn't appear until after same the 3-4 minute wait for the data.
Here's the code as it is. If someone has time to have a look and maybe spot what's wrong I'd be hugely grateful!
This function is called from Unity client:
exports.uploadGameRound = functions.https.onCall((roundUploadData, response) => {
console.log("UPLOADING GAME ROUND. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
// CODE REMOVED FOR READABILITY. JUST PREPARING SOME VARIABLES TO USE BELOW. NOTHING HEAVY, NO DATABASE TRANSACTIONS. //
// Get a new write batch
const batch = firestoreDatabase.batch();
// Save game info to activeGamesInfo
var gameInfoRef = firestoreDatabase.collection('activeGamesInfo').doc(gameId);
batch.set(gameInfoRef, gameInfo);
// Save game data to activeGamesData
const gameDataRef = firestoreDatabase.collection('activeGamesData').doc(gameId);
batch.set(gameDataRef, { gameDataCompressed: updatedGameDataGzippedString });
if (foundWord !== undefined && foundWord !== null) {
const wordId = foundWord.timeStamp + "_" + foundWord.word;
// Save word to allFoundWords
const wordRef = firestoreDatabase.collection('allFoundWords').doc(wordId);
batch.set(wordRef, foundWord);
exports.incrementNumberOfTimesWordFound(gameInfo.language, foundWord.word);
}
console.log("COMMITTING BATCH. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
// Commit the batch
batch.commit().then(result => {
return gameInfoRef.update({ roundUploaded: true }).then(function (result2) {
console.log("DONE COMMITTING BATCH. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
return;
});
});
});
Again, any help with understanding this weird behaviour massively appreciated!
Ok, so I found the problem now and thought I should share it:
Simply adding a return statement before the batch commit fixed the function and reduced the time from 4 minutes to less than a second:
RETURN batch.commit().then(result => {
return gameInfoRef.update({ roundUploaded: true }).then(function (result2) {
console.log("DONE COMMITTING BATCH. TIME: ");
var d = new Date();
var n = d.toLocaleTimeString();
console.log(n);
return;
});
});
Your function isn't returning a promise that resolves with the data to send to the client app. In the absence of a returned promise, it will return immediately, with no guarantee that any pending asynchronous work will terminate correctly.
Calling then on a single promise isn't enough to handle promises. You likely have lots of async work going on here, between commit() and other functions like incrementNumberOfTimesWordFound. You will need to handle all of the promises correctly, and make sure your overall function returns only a single promise that resolves when all that work is complete.
I strongly suggest taking some time to learn how promises work in JavaScript - this is crucial to writing effective functions. Without a full understanding, things will appear to go wrong, or not at all, in strange ways.
Summary
Is functional programming in node.js general enough? can it be used to do a real-world problem of handling small bulks of db records without loading all records in memory using toArray (thus going out of memory). You can read this criticism for background. We want to demonstrate Mux and DeMux and fork/tee/join capabilities of such node.js libraries with async generators.
Context
I'm questioning the validity and generality of functional programming in node.js using any functional programming tool (like ramda, lodash, and imlazy) or even custom.
Given
Millions of records from a MongoDB cursor that can be iterated using await cursor.next()
You might want to read more about async generators and for-await-of.
For fake data one can use (on node 10)
function sleep(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
async function* getDocs(n) {
for(let i=0;i<n;++i) {
await sleep(1);
yield {i: i, t: Date.now()};
}
}
let docs=getDocs(1000000);
Wanted
We need
first document
last document
number of documents
split into batches/bulks of n documents and emit a socket.io event for that bulk
Make sure that first and last documents are included in the batches and not consumed.
Constraints
The millions of records should not be loaded into ram, one should iterate on them and hold at most only a batch of them.
The requirement can be done using usual nodejs code, but can it be done using something like applyspec as in here.
R.applySpec({
first: R.head(),
last: R.last(),
_:
R.pipe(
R.splitEvery(n),
R.map( (i)=> {return "emit "+JSON.stringify(i);})
)
})(input)
To show how this could be modeled with vanilla JS, we can introduce the idea of folding over an async generator that produces things that can be combined together.
const foldAsyncGen = (of, concat, empty) => (step, fin) => async asyncGen => {
let acc = empty
for await (const x of asyncGen) {
acc = await step(concat(acc, of(x)))
}
return await fin(acc)
}
Here the arguments are broken up into three parts:
(of, concat, empty) expects a function to produce "combinable" thing, a function that will combine two "combinable" things and an empty/initial instance of a "combinable" thing
(step, fin) expects a function that will take a "combinable" thing at each step and produce a Promise of a "combinable" thing to be used for the next step and a function that will take the final "combinable" thing after the generator has exhausted and produce a Promise of the final result
async asyncGen is the async generator to process
In FP, the idea of a "combinable" thing is known as a Monoid, which defines some laws that detail the expected behaviour of combining two of them together.
We can then create a Monoid that will be used to carry through the first, last and batch of values when stepping through the generator.
const Accum = (first, last, batch) => ({
first,
last,
batch,
})
Accum.empty = Accum(null, null, []) // an initial instance of `Accum`
Accum.of = x => Accum(x, x, [x]) // an `Accum` instance of a single value
Accum.concat = (a, b) => // how to combine two `Accum` instances together
Accum(a.first == null ? b.first : a.first, b.last, a.batch.concat(b.batch))
To capture the idea of flushing the accumulating batches we can create another function that takes an onFlush function that will perform some action in a returned Promise with the values being flushed, and a size n of when to flush the batch.
Accum.flush = onFlush => n => acc =>
acc.batch.length < n ? Promise.resolve(acc)
: onFlush(acc.batch.slice(0, n))
.then(_ => Accum(acc.first, acc.last, acc.batch.slice(n)))
We can also now define how we can fold over the Accum instances.
Accum.foldAsyncGen = foldAsyncGen(Accum.of, Accum.concat, Accum.empty)
With the above utilities defined, we can now use them to model your specific problem.
const emit = batch => // This is an analog of where you would emit your batches
new Promise((resolve) => resolve(console.log(batch)))
const flushEmit = Accum.flush(emit)
// flush and emit every 10 items, and also the remaining batch when finished
const fold = Accum.foldAsyncGen(flushEmit(10), flushEmit(0))
And finally run with your example.
fold(getDocs(100))
.then(({ first, last })=> console.log('done', first, last))
I'm not sure it's fair to imply that functional programming was going to offer any advantages over imperative programming in term of performance when dealing with huge amount of data.
I think you need to add another tool in your toolkit and that may be RxJS.
RxJS is a library for composing asynchronous and event-based programs by using observable sequences.
If you're not familiar with RxJS or reactive programming in general, my examples will definitely look weird but I think it would be a good investment to get familiar with these concepts
In your case, the observable sequence is your MongoDB instance that emits records over time.
I'm gonna fake your db:
var db = range(1, 5);
The range function is a RxJS thing that will emit a value in the provided range.
db.subscribe(n => {
console.log(`record ${n}`);
});
//=> record 1
//=> record 2
//=> record 3
//=> record 4
//=> record 5
Now I'm only interested in the first and last record.
I can create an observable that will only emit the first record, and create another one that will emit only the last one:
var db = range(1, 5);
var firstRecord = db.pipe(first());
var lastRecord = db.pipe(last());
merge(firstRecord, lastRecord).subscribe(n => {
console.log(`record ${n}`);
});
//=> record 1
//=> record 5
However I also need to process all records in batches: (in this example I'm gonna create batches of 10 records each)
var db = range(1, 100);
var batches = db.pipe(bufferCount(10))
var firstRecord = db.pipe(first());
var lastRecord = db.pipe(last());
merge(firstRecord, batches, lastRecord).subscribe(n => {
console.log(`record ${n}`);
});
//=> record 1
//=> record 1,2,3,4,5,6,7,8,9,10
//=> record 11,12,13,14,15,16,17,18,19,20
//=> record 21,22,23,24,25,26,27,28,29,30
//=> record 31,32,33,34,35,36,37,38,39,40
//=> record 41,42,43,44,45,46,47,48,49,50
//=> record 51,52,53,54,55,56,57,58,59,60
//=> record 61,62,63,64,65,66,67,68,69,70
//=> record 71,72,73,74,75,76,77,78,79,80
//=> record 81,82,83,84,85,86,87,88,89,90
//=> record 91,92,93,94,95,96,97,98,99,100
//=> record 100
As you can see in the output, it has emitted:
The first record
Ten batches of 10 records each
The last record
I'm not gonna try to solve your exercise for you and I'm not too familiar with RxJS to expand too much on this.
I just wanted to show you another way and let you know that it is possible to combine this with functional programming.
Hope it helps
I think I may have developed an answer for you some time ago and it's called scramjet. It's lightweight (no thousands of dependencies in node_modules), it's easy to use and it does make your code very easy to understand and read.
Let's start with your case:
DataStream
.from(getDocs(10000))
.use(stream => {
let counter = 0;
const items = new DataStream();
const out = new DataStream();
stream
.peek(1, async ([first]) => out.whenWrote(first))
.batch(100)
.reduce(async (acc, result) => {
await items.whenWrote(result);
return result[result.length - 1];
}, null)
.then((last) => out.whenWrote(last))
.then(() => items.end());
items
.setOptions({ maxParallel: 1 })
.do(arr => counter += arr.length)
.each(batch => writeDataToSocketIo(batch))
.run()
.then(() => (out.end(counter)))
;
return out;
})
.toArray()
.then(([first, last, count]) => ({ first, count, last }))
.then(console.log)
;
So I don't really agree that javascript FRP is an antipattern and I don't think I have the only answer to that, but while developing the first commits I found that the ES6 arrow syntax and async/await written in a chained fashion makes the code easily understandable.
Here's another example of scramjet code from OpenAQ specifically this line in their fetch process:
return DataStream.fromArray(Object.values(sources))
// flatten the sources
.flatten()
// set parallel limits
.setOptions({maxParallel: maxParallelAdapters})
// filter sources - if env is set then choose only matching source,
// otherwise filter out inactive sources.
// * inactive sources will be run if called by name in env.
.use(chooseSourcesBasedOnEnv, env, runningSources)
// mark sources as started
.do(markSourceAs('started', runningSources))
// get measurements object from given source
// all error handling should happen inside this call
.use(fetchCorrectedMeasurementsFromSourceStream, env)
// perform streamed save to DB and S3 on each source.
.use(streamMeasurementsToDBAndStorage, env)
// mark sources as finished
.do(markSourceAs('finished', runningSources))
// convert to measurement report format for storage
.use(prepareCompleteResultsMessage, fetchReport, env)
// aggregate to Array
.toArray()
// save fetch log to DB and send a webhook if necessary.
.then(
reportAndRecordFetch(fetchReport, sources, env, apiURL, webhookKey)
);
It describes everything that happens with every source of data. So here's my proposal up for questioning. :)
here are two solutions using RxJs and scramjet.
here is an RxJs solution
the trick was to use share() so that first() and last() won't consumer from the iterator, forkJoin was used to combine them to emit the done event with those values.
function ObservableFromAsyncGen(asyncGen) {
return Rx.Observable.create(async function (observer) {
for await (let i of asyncGen) {
observer.next(i);
}
observer.complete();
});
}
async function main() {
let o=ObservableFromAsyncGen(getDocs(100));
let s = o.pipe(share());
let f=s.pipe(first());
let e=s.pipe(last());
let b=s.pipe(bufferCount(13));
let c=s.pipe(count());
b.subscribe(log("bactch: "));
Rx.forkJoin(c, f, e, b).subscribe(function(a){console.log(
"emit done with count", a[0], "first", a[1], "last", a[2]);})
}
here is a scramjet but that is not pure (functions have side effects)
async function main() {
let docs = getDocs(100);
let first, last, counter;
let s0=Sj.DataStream
.from(docs)
.setOptions({ maxParallel: 1 })
.peek(1, (item)=>first=item[0])
.tee((s)=>{
s.reduce((acc, item)=>acc+1, 0)
.then((item)=>counter=item);
})
.tee((s)=>{
s.reduce((acc, item)=>item)
.then((item)=>last=item);
})
.batch(13)
.map((batch)=>console.log("emit batch"+JSON.stringify(batch));
await s0.run();
console.log("emit done "+JSON.stringify({first: first, last:last, counter:counter}));
}
I'll work with #michaĆ-kapracki to develop a pure version of it.
For this exact kind of problems I made this library: ramda-generators
Hopefully it's what you are looking for: lazy evaluation of streams in functional JavaScript
Only problem is that I have no idea on how to take the last element and the amount of elements from a stream without re-running the generators
A possible implementation that compute the result without parsing the whole DB in memory could be this:
Try it on repl.it
const RG = require("ramda-generators");
const R = require("ramda");
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));
const getDocs = amount => RG.generateAsync(async (i) => {
await sleep(1);
return { i, t: Date.now() };
}, amount);
const amount = 1000000000;
(async (chunkSize) => {
const first = await RG.headAsync(getDocs(amount).start());
const last = await RG.lastAsync(getDocs(amount).start()); // Without this line the print of the results would start immediately
const DbIterator = R.pipe(
getDocs(amount).start,
RG.splitEveryAsync(chunkSize),
RG.mapAsync(i => "emit " + JSON.stringify(i)),
RG.mapAsync(res => ({ first, last, res })),
);
for await (const el of DbIterator())
console.log(el);
})(100);