Can I commit a Firestore batch write without waiting? - node.js

Overview
I want to create some document references in a Cloud Function and return them to be used in another document. My app is time critical, so I don't want to wait for the batch to commit before returning the references.
Current solution
I currently create the references and the destination document in one Cloud Function and then commit the whole batch. This makes my code repetitive, as I need to create these references in other places, also.
My question
If I omit the .then from the batch.commit() can I simply pass the references straight back and leave Cloud Firestore to write the documents in its own time?
I've created this test script, which works. Is there a problem with this approach or should I always wait for a batch to finish writing before continuing code execution?
My sample code
// Set the data to be written
let myData = {test: '123'};
// Create the document references and return them for future processing
let docRefs = writeData(myData);
// Write these references to a master document
myDoc = {
name: 'A document containing references to other documents',
doc0Ref: docRefs[0],
doc1Ref: docRefs[1],
doc2Ref: docRefs[2]
}
return db.collection('masterCollection').add(myDoc).then(response => {
console.log('Success');
return Promise.resolve();
}).catch(err => {
console.error(err);
return Promise.reject(err);
});
// Create the batch and write the data
function writeData(myData) {
let batch = firestore.batch();
let doc1Ref = firestore.collection('test').doc();
let doc2Ref = firestore.collection('test').doc();
let doc3Ref = firestore.collection('test').doc();
console.log(`doc1Ref: ${doc1Ref.id}, doc2Ref: ${doc2Ref.id}, doc3Ref = ${doc3Ref.id}`);
batch.set(doc1Ref, myData);
batch.set(doc2Ref, myData);
batch.set(doc3Ref, myData);
batch.commit(); // No .then to wait for the batch to be written
return [doc1Ref, doc2Ref, doc3Ref];
}

If your Cloud Function doesn't deal with all asynchronous work correctly (typically, with promises), there is a very good chance that the work may not complete successfully.
For HTTP triggers, you must only send your final response to the client after all the pending work is complete.
For all other types of triggers, you must return a promise that resolves only after all the async work in that function is complete.
What you have right now is a "dangling" promise that's not being handled according to these rules. If you're using ESLint or TSLint to check your code, the linter will likely detect this and complain about it.

Related

Avoid triggering Firebase functions by real-time database on special cases

Sometimes we use the firebase functions triggered by real-time database (onCreate/onDelete/onUpdate ...) to do some logic (like counting, etc).
My question, would it be possible to avoid this trigger in some cases. Mainly, when I would like to allow a user to import a huge JSON to firebase?
Example:
a function E triggered on the creation of a new child in /examples. Normally, users add examples one by one to /examples and function E runs to do some logic. However, I would like to allow a user (from the front-end) to import 2000 children to /examples and the logic which is done by function E is possible at import time without the need for E. Then, I do not need E to be triggered for such a case where a high number of functions could be executed. (Note: I am aware of the 1000 limit)
Update:
based on the accepted answer, submitted my answer down.
As far as I know, there is no way to disable a Cloud Function programmatically without just deleting it. However this introduces an edge case where data is added to the database while the import is taking place.
A compromise would be to signal that the data you are uploading should be post-processed. Let's say you were uploading to /examples/{pushId}, instead of attaching the database trigger to /examples/{pushId}, attach it to /examples/{pushId}/needsProcessing (or something similar). Unfortunately this has the trade-off of not being able to make use of change objects for onUpdate() and onWrite().
const result = await firebase.database.ref('/examples').push({
title: "Example 1A",
desc: "This is an example",
attachments: { /* ... */ },
class: "-MTjzAKMcJzhhtxwUbFw",
author: "johndoe1970",
needsProcessing: true
});
async function handleExampleProcessing(snapshot, context) {
// do post processing if needsProcessing is truthy
if (!snapshot.exists() || !snapshot.val()) {
console.log('No processing needed, exiting.');
return;
}
const exampleRef = admin.database().ref(change.ref.parent); // /examples/{pushId}, as admin
const data = await exampleRef.once('value');
// do something with data, like mutate it
// commit changes
return exampleRef.update({
...data,
needsProcessing: null /* delete needsProcessing value */
});
}
const functionsExampleProcessingRef = functions.database.ref("examples/{pushId}/needsProcessing");
export const handleExampleNeedingProcessingOnCreate = functionsExampleProcessingRef.onCreate(handleExampleProcessing);
// this is only needed if you ever intend on writing `needsProcessing = /* some falsy value */`, I recommend just creating and deleting it, then you can use just the above trigger.
export const handleExampleNeedingProcessingOnUpdate = functionsExampleProcessingRef.onUpdate((change, context) => handleExampleProcessing(change.after, context));
An alternative to Sam's approach is to use feature flags to determine if a Cloud Function performs its main function. I often have this in my code:
exports.onUpload = functions.database
.ref("/uploads/{uploadId}")
.onWrite((event) => {
return ifEnabled("transcribe").then(() => {
console.log("transcription is enabled: calling Cloud Speech");
...
})
});
The ifEnabled is a simple helper function that checks (also in Realtime Database) if the feature is enabled:
function ifEnabled(feature) {
console.log("Checking if feature '"+feature+"' is enabled");
return new Promise((resolve, reject) => {
admin.database().ref("/config/features")
.child(feature)
.once('value')
.then(snapshot => {
if (snapshot.val()) {
resolve(snapshot.val());
}
else {
reject("No value or 'falsy' value found");
}
});
});
}
Most of my usage of this is during talks at conferences, to enable the Cloud Functions at the right time (as a deploy takes a bit longer than we'd like for a demo). But the same approach should work to temporarily disable features during for example data import.
Okay, another solution would be
A: Add a new table in firebase like /triggers-queue where all CRUD that should fire a background function are added. In this table, we add a key for each table that should have triggers - in our example /examples table. Any key that represents a table should also have /created, /updated, and /deleted keys as follows.
/examples
.../example-id-1
/triggers-queue
.../examples
....../created
........./example-id
....../updated
........./example-id
............old-value
....../deleted
........./example-id
............old-value
Note that the old-value should be added from app (front-end, etc).
We set triggers always onCreate on
/triggers-queue/examples/created/{exampleID} (simulate onCreate)
/triggers-queue/examples/updated/{exampleID} (simulate onUpdate)
/triggers-queue/examples/deleted/{exampleID} (simulate onDelete)
The fired function can know all the necessary info to handle the logic as follows:
Operation type: from the path (either: created, updated, or deleted)
key of the object: from the path
current data: by reading the corresponding table (i.e., /examples/id)
old data: from the triggers table
Good Points:
You can import a huge data to /examples table without firing any function as we do not add to the /triggers-queue
you can fanout functions to pass the limit 1000/sec. That is by setting triggers on (as an example to fanout on-create)
/triggers-queue/examples/created0/{exampleID} and
/triggers-queue/examples/created1/{exampleID}
bad-points:
more difficult to implement
need to write more data to firebase (like old-data) from the app.
B- Another way (although not an answer for this) is to move the login in the background function to an HTTP function and call it on every crud ops.

Difficulty processing CSV file, browser timeout

I was asked to import a csv file from a server daily and parse the respective header to the appropriate fields in mongoose.
My first idea was to make it to run automatically with a scheduler using the cron module.
const CronJob = require('cron').CronJob;
const fs = require("fs");
const csv = require("fast-csv")
new CronJob('30 2 * * *', async function() {
await parseCSV();
this.stop();
}, function() {
this.start()
}, true);
Next, the parseCSV() function code is as follow:
(I have simplify some of the data)
function parseCSV() {
let buffer = [];
let stream = fs.createReadStream("data.csv");
csv.fromStream(stream, {headers:
[
"lot", "order", "cwotdt"
]
, trim:true})
.on("data", async (data) =>{
let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};
// Only add product that fulfill the following condition
if (data.cwotdt !== "000000"){
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
}
})
.on("end", function(){
db.Product.find({}, function(err, productAvailable){
// Check whether database exists or not
if(productAvailable.length !== 0){
// console.log("Database Exists");
// Add subsequent onward
db.Product.insertMany(buffer)
buffer = [];
} else{
// Add first time
db.Product.insertMany(buffer)
buffer = [];
}
})
});
}
It is not a problem if it's just a few line of rows in the csv file but just only reaching 2k rows, I encountered a problem. The culprit is due to the if condition checking when listening to the event handler on, it needs to check every single row to see whether the database contains the data already or not.
The reason I'm doing this is that the csv file will have new data added into it and I need to add all the data for the first time if database is empty or look into every single row and only add those new data into mongoose.
The 1st approach I did from here (as in the code),was using async/await to make sure that all the datas have been read before proceeding to the event handler end. This helps but I see from time to time (with mongoose.set("debug", true);), some data are being queried twice, which I have no idea why.
The 2nd approach was not to use the async/await feature, this has some downside since the data was not fully queried, it proceeded straight to the event handler end and then insertMany some of the datas which were able to get pushed into the buffer.
If i stick with the current approach, it is not an issue, but the query will take 1 to 2 minutes, not to mention even more if the database keeps growing. So, during those few minutes of querying, the event queue got blocked and therefore when sending request to the server, the server time out.
I used stream.pause() and stream.resume() before this code but I can't get it to work as it just jump straight to the end event handler first. This cause the buffer to be empty every single time since end event handler runs before the on event handler
I cant' remember the links that I used but the fundamentals that I got from is through this.
Import CSV Using Mongoose Schema
I saw these threads:
Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Can't populate big chunk of data to mongodb using Node.js
to be similar to what I need but it's a bit too complicated for me to understand what is going on. Seems like using socket or a child process maybe? Furthermore, I still need to check conditions before adding into the buffer
Anyone care to guide me on this?
Edit: await is removed from console.log as it is not asynchronous
Forking a child process approach:
When web service got a request of csv data file save it somewhere in app
Fork a child process -> child process example
Pass the file url to the child_process to run the insert checks
When child process finish processing the csv file, delete the file
Like what Joe said, indexing the DB would speed up the processing time by a lot when there are lots(millions) of tuples.
If you create an index on order and lot. The query should be very fast.
db.Product.createIndex( { order: 1, lot: 1 }
Note: This is a compound index and may not be the ideal solution. Index strategies
Also, your await on console.log is weird. That may be causing your timing issues. console.log is not async. Additionally the function is not marked async
// removing await from console.log
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
I would try with removing the await on console.log (that may be a red herring if console.log is for stackoverflow and hiding the actual async method.) However, be sure to mark the function with async if that is the case.
Lastly, if the problem still exists. I may look into a 2 tiered approach.
Insert all lines from the CSV file into a mongo collection.
Process that mongo collection after the CSV has been parsed. Removing the CSV from the equation.

How do I make a large but unknown number of REST http calls in nodejs?

I have an orientdb database. I want to use nodejs with RESTfull calls to create a large number of records. I need to get the #rid of each for some later processing.
My psuedo code is:
for each record
write.to.db(record)
when the async of write.to.db() finishes
process based on #rid
carryon()
I have landed in serious callback hell from this. The version that was closest used a tail recursion in the .then function to write the next record to the db. However, I couldn't carry on with the rest of the processing.
A final constraint is that I am behind a corporate proxy and cannot use any other packages without going through the network administrator, so using the native nodejs packages is essential.
Any suggestions?
With a completion callback, the general design pattern for this type of problem makes use of a local function for doing each write:
var records = ....; // array of records to write
var index = 0;
function writeNext(r) {
write.to.db(r, function(err) {
if (err) {
// error handling
} else {
++index;
if (index < records.length) {
writeOne(records[index]);
}
}
});
}
writeNext(records[0]);
The key here is that you can't use synchronous iterators like .forEach() because they won't iterate one at a time and wait for completion. Instead, you do your own iteration.
If your write function returns a promise, you can use the .reduce() pattern that is common for iterating an array.
var records = ...; // some array of records to write
records.reduce(function(p, r) {
return p.then(function() {
return write.to.db(r);
});
}, Promsise.resolve()).then(function() {
// all done here
}, function(err) {
// error here
});
This solution chains promises together, waiting for each one to resolve before executing the next save.
It's kinda hard to tell which function would be best for your scenario w/o more detail, but I almost always use asyncjs for this kind of thing.
From what you say, one way to do it would be with async.map:
var recordsToCreate = [...];
function functionThatCallsTheApi(record, cb){
// do the api call, then call cb(null, rid)
}
async.map(recordsToCreate, functionThatCallsTheApi, function(err, results){
// here, err will be if anything failed in any function
// results will be an array of the rids
});
You can also check out other ones to enable throttling, which is probablya good idea.

Complex sequencing of promises - nested

After a lot of googling I have not been able to confirm the correct approach to this problem. The following code runs as expected but I have a grave feeling that I am not approaching this in the correct way, and I am setting myself up for problems.
The following code is initiated by the main app.js file and is passed a location to start loading XML files from and processing into a mongoDB
exports.processProfiles = function(path) {
var deferrer = q.defer();
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function(deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function(filenames) {
// now we have all the file names lets load and save
filenames.forEach(function(filename) {
// Here is where i think the problem is!
// kick off another promise chain for the dynamically sized array of files to process
q(loadFileContent(path, filename)) // first we load the data in the file
.then(function(inboundFile) {
// then parse XML structure to my new shiny JSON structure
// and ask Mongo to store it for me
return dataService.createProfile(processProfileXML(filename, inboundFile));
})
.done(function(result) {
console.log(result);
})
});
})
.catch(function(err) {
deferrer.reject('Unable to Process Profile records : ' + err);
})
.done(function() {
deferrer.resolve('Profile Processing Completed');
});
return deferrer.promise;
}
Whilst this code works these are my main concerns but cannot solve them on my own after a few hours of Google and reading.
1) Is this blocking? The read out to the console is difficult to understand if this is running asynchronously as i want it to - i think it is but advice on if I am doing something fundamentally wrong would be great
2) Is having a nested promise a bad idea, should I be linking it to the outter promise - I have tried but could not get anything to compile or run.
I haven't used Q in a really long time, but I think that you'd need to do is let it know you're about to hand back an array of promises that need to all be satisfied before moving on.
Additionally as you're waiting for multiple promises on one section of code, rather than nesting further, throw the 'set' of promises back up once they're all satisfied.
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function (deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function (filenames) {
return q.all(
filenames.map(function (filename) {
return q(loadFileContent(path, filename)) { /* Do stuff with your filenames */ });
})
);
.then(function (resultsOfLoadFileContentsPromises) {
console.log('I did stuff with all the things');
)
.catch(function(err) {});
What you have is not 'blocking'. But really what you're doing with promises is moving things into a new 'block'ing section. The more blocks you have, the more async-ish your code will appear. If nothing else is running apart from this promise, it will still appear procedural.
But inner promises must still resolve before the parent promises resolve thereafter.
Inner promises like what you have aren't an inherently bad, personally I will break them out into seperate files to makes easier to reason about, but I wouldn't define that as 'bad' unless there's no need for that inner promise to exist, however where possible (and in your example here) I've adjusted so I throw back up the next set of promises for a new section to deal with the data after it's gotten it.
(I'm not great with Q though, this code will probably require a little further tweaking).

Node.js + SQLite async transactions

I am using node-sqlite3, but I am sure this problem appears in another database libraries too. I have discovered a bug in my code with mixing transactions and async code.
function insertData(arrayWithData, callback) {
// start a transaction
db.run("BEGIN", function() {
// do multiple inserts
slide.asyncMap(
arrayWithData,
function(cb) {
db.run("INSERT ...", cb);
},
function() {
// all done
db.run("COMMIT");
}
);
});
}
// some other insert
setInterval(
function() { db.run("INSERT ...", cb); },
100
);
You can also run the full example.
The problem is that some other code with insert or update query can be launched during the async pause after begin or insert. Then this extra query is run in the transaction. This is not a problem when the transaction is committed. But if the transaction is rolled back the change made by this extra query is also rolled back. Hoops we've just unpredictably lost data without any error message.
I thought about this issue and I think that one solution is to create a wrapper class that will make sure that:
Only one transaction is running at the same time.
When transaction is running only queries which belong to the transaction are executed.
All the extra queries are queued and executed after the current transaction is finished.
All attempts to start a transaction when one is already running will also get queued.
But it sounds like too complicated solution. Is there a better approach? How do you deal with this problem?
At first, I would like to state that I have no experience with SQLite. My answer is based on quick study of node-sqlite3.
The biggest problem with your code IMHO is that you try to write to DB from different locations. As I understand SQLite, you have no control of different parallel "connections" as you have in PostgreSQL, so you probably need to wrap all your communication with DB. I modified your example to use always insertData wrapper. Here is the modified function:
function insertData(callback, cmds) {
// start a transaction
db.serialize(function() {
db.run("BEGIN;");
//console.log('insertData -> begin');
// do multiple inserts
cmds.forEach(function(item) {
db.run("INSERT INTO data (t) VALUES (?)", item, function(e) {
if (e) {
console.log('error');
// rollback here
} else {
//console.log(item);
}
});
});
// all done
//here should be commit
//console.log('insertData -> commit');
db.run("ROLLBACK;", function(e) {
return callback();
});
});
}
Function is called with this code:
init(function() {
// insert with transaction
function doTransactionInsert(e) {
if (e) return console.log(e);
setTimeout(insertData, 10, doTransactionInsert, ['all', 'your', 'base', 'are', 'belong', 'to', 'us']);
}
doTransactionInsert();
// Insert increasing integers 0, 1, 2, ...
var i=0;
function doIntegerInsert() {
//console.log('integer insert');
insertData(function(e) {
if (e) return console.log(e);
setTimeout(doIntegerInsert, 9);
}, [i++]);
}
...
I made following changes:
added cmds parameter, for simplicity I added it as last parameter but callback should be last (cmds is an array of inserted values, in final implementation it should be an array of SQL commands)
changed db.exec to db.run (should be quicker)
added db.serialize to serialize requests inside transaction
ommited callback for BEGIN command
leave out slide and some underscore
Your test implementation now works fine for me.
I have end up doing full wrapper around sqlite3 to implement locking the database in a transaction. When DB is locked all queries are queued and executed after the current transaction is over.
https://github.com/Strix-CZ/sqlite3-transactions
IMHO there are some problems with the ivoszz's answer:
Since all db.run are async you cannot check the result of the whole transaction and if one run has error result you should rollback all commands. For do this you should call db.run("ROLLBACK") in the callback in the forEach loop. The db.serialize function will not serialize async run and so a "cannot start transaction within transaction occurs".
The "COMMIT/ROLLBACK" after the forEach loop has to check the result of all statements and you cannot run it before all the previous run finished.
IMHO there are only one way to make a safe-thread (obv referred to the background thread pool) transaction management: create a wrapper function and use the async library in order to serialize manually all statements. In this way you can avoid db.serialize function and (more important) you can check all single db.run result in order to rollback the whole transaction (and return the promise if needed).
The main problem of the node-sqlite3 library related to transaction is that there aren't a callback in the serialize function in order to check if one error occurs

Resources