Firebase Cloud Functions - move data in Firestore - node.js

Have a very basic understanding of the Typescript language, but would like to know, how can I copy multiple documents from one firestore database collection to another collection?
I know how to send the request from the app's code along with the relevant data (a string and firebase auth user ID), but unsure about the Typescript code to handle the request...

Thats a very broad question, but something like this can move data in moderate sizes from one collection to another:
import * as _ from 'lodash';
import {firestore} from 'firebase-admin';
export async function moveFromCollection(collectionPath1: string, collectionPath2: string): void {
try {
const collectionSnapshot1Ref = firestore.collection(collectionPath1);
const collectionSnapshot2Ref = firestore.collection(collectionPath2);
const collectionSnapshot1Snapshot = await collectionSnapshot1Ref.get();
// Here we get all the snapshots from collection 1. This is ok, if you only need
// to move moderate amounts of data (since all data will be stored in memory)
// Now lets use lodash chunk, to insert data in batches of 500
const chunkedArray = _.chunk(collectionSnapshot1Snapshot.docs, 500);
// chunkedArray is now an array of arrays, with max 500 in each
for (const chunk of chunkedArray) {
const batch = firestore.batch();
// Use the batch to insert many firestore docs
chunk.forEach(doc => {
// Now you might need some business logic to handle the new address,
// but maybe something like this is enough
const newDocRef = collectionSnapshot2Ref.doc(doc.id);
batch.set(newDocRef, doc.data(), {merge: false});
});
await batch.commit();
// Commit the batch
}
console.log('Done!');
} catch (error) {
console.log(`something went wrong: ${error.message}`);
}
}
But maybe you can tell more about the use case?

Related

Firebase Functions and Express: listen to firestore data live

I have a website that runs its frontend of Firebase Hosting and its server which is written using node.js and Express on Firebase Functions
What I want to have redirect links from my website so I can map for example mywebsite.com/youtube to my youtube channel. the way I am creating these links is from my admin panel, and adding them to my Firestore database.
My data is roughly something like this:
The first way I approached this, is by querying my Firestore database on every request, but that is heavily expensive and slow.
Another way I tried to approach this is by setting some kind of background listener to the Firestore database which will always provide up to date data. but unfortunately that did not work because Firebase Functions suspends the main function when the current request execution ends.
lastly, which is the most convenience way, I configured an api route, which will be called from my Admin Panel when any change happens to the data, and I would save the new data to some json file. I tried this on my local but it did not work on production because appearently Firebase Functions is a Read-only system, so we can't edit any files after they are deployed. after some research I found out that Firebase Functions allows writing to the tmp directory, so I went forward with this, and tried deploying it. but again, Firebase Functions was resetting the tmp folder when some request execution ends.
here is my api request code which updates the utm_data.json file in the tmp directory:
// my firestore provider
const db = require('../db');
const fs = require('fs');
const os = require('os')
const mkdirp = require('mkdirp');
const updateUrlsAPI = (req, res) => {
// we wanna get the utm list from firestore, and update the file
// tmp/utm_data.json
// query data from firestore
db.collection('utmLinks').get().then(async function(querySnapshot) {
try {
// get the path to `tmp` folder depending on
// the os running this program
let tmpFolderName = os.tmpdir()
// create `tmp` directory if not exists
await mkdirp(tmpFolderName)
let docsData = querySnapshot.docs.map(doc => doc.data())
let tmpFilePath = tmpFolderName + '/utm_data.json'
let strData = JSON.stringify(docsData)
fs.writeFileSync(tmpFilePath, strData)
res.send('200')
} catch (error) {
console.log("error while updating utm_data.json: ", error)
res.send(error)
}
});
}
and this is my code for reading the utm_data.json file on an incoming request:
const readUrlsFromJson = (req, res) => {
var url = req.path.split('/');
// the url will be in the format of: 'mywebsite.com/routeName'
var routeName = url[1];
try {
// read the file ../tmp/utm_data.json
// {
// 'createdAt': Date
// 'creatorEmail': string
// 'name': string
// 'url': string
// }
// our [routeName] should match [name] of the doc
let tmpFolderName = os.tmpdir()
let tmpFilePath = tmpFolderName + '/utm_data.json'
// read links list file and assign it to the `utms` variable
let utms = require(tmpFilePath)
if (!utms || !utms.length) {
return undefined;
}
// find the link matching the routeName
let utm = utms.find(utm => utm.name == routeName)
if (!utm) {
return undefined;
}
// if we found the doc,
// then we'll redirect to the url
res.redirect(utm.url)
} catch (error) {
console.error(error)
return undefined;
}
}
Is there something I am doing wrong, and if not, what is an optimal solution for this case?
You can initialize the Firestore listener in global scope. From the documentation,
The global scope in the function file, which is expected to contain the function definition, is executed on every cold start, but not if the instance has already been initialized.
This should keep the listener active even after the function's execution has completed until that specific instance is running (which should be about ~30 minutes). Try refactoring the code as shown below:
import * as functions from "firebase-functions";
import * as admin from "firebase-admin";
admin.initializeApp();
let listener = false;
// Store all utmLinks in global scope
let utmLinks: any[] = [];
const initListeners = () => {
functions.logger.info("Initializing listeners");
admin
.firestore()
.collection("utmLinks")
.onSnapshot((snapshot) => {
snapshot.docChanges().forEach(async (change) => {
functions.logger.info(change.type, "document received");
switch (change.type) {
case "added":
utmLinks.push({ id: change.doc.id, ...change.doc.data() });
break;
case "modified":
const index = utmLinks.findIndex(
(link) => link.id === change.doc.id
);
utmLinks[index] = { id: change.doc.id, ...change.doc.data() };
break;
case "removed":
utmLinks = utmLinks.filter((link) => link.id !== change.doc.id);
default:
break;
}
});
});
return;
};
// The HTTPs function
export const helloWorld = functions.https.onRequest(
async (request, response) => {
if (!listener) {
// Cold start, no listener active
initListeners();
listener = true;
} else {
functions.logger.info("Listeners already initialized");
}
response.send(JSON.stringify(utmLinks, null, 2));
}
);
This example stores all UTM links in an array in global scope which won't be persisted in new instances but you won't have to query each link for every request. The onSnapshot() listener will keep utmLinks updated.
The output in logs should be:
If you want to persist this data permanently and prevent querying in every cold start, then you can try using Google Cloud Compute that keeps running unlike Cloud functions that timeout eventually.

Firebase function Node.js transform stream

I'm creating a Firebase HTTP Function that makes a BigQuery query and returns a modified version of the query results. The query potentially returns millions of rows, so I cannot store the entire query result in memory before responding to the HTTP client. I am trying to use Node.js streams, and since I need to modify the results before sending them to the client, I am trying to use a transform stream. However, when I try to pipe the query stream through my transform stream, the Firebase Function crashes with the following error message: finished with status: 'response error'.
My minimal reproducible example is as follows. I am using a buffer, because I don't want to process a single row (chunk) at a time, since I need to make asynchronous network calls to transform the data.
return new Promise((resolve, reject) => {
const buffer = new Array(5000)
let bufferIndex = 0
const [job] = await bigQuery.createQueryJob(options)
const bqStream = job.getQueryResultsStream()
const transformer = new Transform({
writableObjectMode: true,
readableObjectMode: false,
transform(chunk, enc, callback) {
buffer[bufferIndex] = chunk
if (bufferIndex < buffer.length - 1) {
bufferIndex++
}
else {
this.push(JSON.stringify(buffer).slice(1, -1)) // Transformation should happen here.
bufferIndex = 0
}
callback()
},
flush(callback) {
if (bufferIndex > 0) {
this.push(JSON.stringify(buffer.slice(0, bufferIndex)).slice(1, -1))
}
this.push("]")
callback()
},
})
bqStream
.pipe(transform)
.pipe(response)
bqStream.on("end", () => {
resolve()
})
}
I cannot store the entire query result in memory before responding to the HTTP client
Unfortunately, when using Cloud Functions, this is precisely what must happen.
There is a documented limit of 10MB for the response payload, and that is effectively stored in memory as your code continues to write to the response. Streaming of requests and responses is not supported.
One alternative is to write your response to an object in Cloud Storage, then send a link or reference to that file to the client so it can read the response fully from that object.
If you need to send a large streamed response, Cloud Functions is not a good choice. Neither is Cloud Run, which is similarly limited. You will need to look into other solutions that allow direct socket access, such as Compute Engine.
I tried to implement the workaround as suggested by Doug Stevenson and got the following error:
#firebase/firestore: Firestore (9.8.2):
Connection GRPC stream error.
Code: 3
Message: 3
INVALID_ARGUMENT: Request payload size exceeds the limit: 11534336 bytes.
I created a workaround to store data in Firestore first. It works fine when the content size is below 10MB.
import * as firestore from "firebase/firestore";
import { initializeApp } from "firebase/app";
import { firebaseConfig } from '../conf/firebase'
// Initialize Firebase
const app = initializeApp(firebaseConfig);
const fs = firestore.getFirestore(app);
export async function storeStudents(data, context) {
const students = await api.getTermStudents()
const batch = firestore.writeBatch(fs);
students.forEach((student) => {
const ref = firestore.doc(fs, 'students', student.studentId)
batch.set(ref, student)
})
await batch.commit()
return 'stored'
}
exports.getTermStudents = functions.https.onCall(storeStudents);
UPDATE:
To bypass Firestore's limit when using the batch function, I just looped through the array and set (add/update) documents. Set() creates or overwrites a single document.
export async function storeStudents(data, context) {
const students = await api.getTermStudents({images: true})
students.forEach((student: Student) => {
const ref = firestore.doc(fs, 'students', student.student_id)
firestore.setDoc(ref, student)
})
return 'stored'
}

Using firebase cloud function to copy all documents from a master collection in Firestore to new sub-collection

I have a master collection in firestore with a couple hundred documents (which will grow to a few thousand in a couple of months).
I have a use case, where every time a new user document is created in /users/ collection, I want all the documents from the master to be copied over to /users/{userId}/.
To achieve this, I have created a firebase cloud function as below:
// setup for new user
exports.setupCollectionForUser = functions.firestore
.document('users/{userId}')
.onCreate((snap, context) => {
const userId = context.params.userId;
db.collection('master').get().then(snapshot => {
if (snapshot.empty) {
console.log('no docs found');
return;
}
snapshot.forEach(function(doc) {
return db.collection('users').doc(userId).collection('slave').doc(doc.get('uid')).set(doc.data());
});
});
});
This works, the only problem is, it takes forever (~3-5 mins) for only about 200 documents. This has been such a bummer because a lot depends on how fast these documents get copied over. I was hoping this to be not more than a few seconds at max. Also, the documents show up altogether and not as they are written, or at least they seem that way.
Am I doing anything wrong? Why should it take so long?
Is there a way I can break this operation into multiple reads and writes so that I can guarantee a minimum documents in a few seconds and not wait until all of them are copied over?
Please advise.
If I am not mistaking, by correctly managing the parallel writes with Promise.all() and returning the Promises chain it should normally improve the speed.
Try to adapt your code as follows:
exports.setupCollectionForUser = functions.firestore
.document('users/{userId}')
.onCreate((snap, context) => {
const userId = context.params.userId;
return db.collection('master').get().then(snapshot => {
if (snapshot.empty) {
console.log('no docs found');
return null;
} else {
const promises = [];
const slaveRef = db.collection('users').doc(userId).collection('slave');
snapshot.forEach(doc => {
promises.push(slaveRef.doc(doc.get('uid')).set(doc.data()))
});
return Promise.all(promises);
}
});
});
I would suggest you watch the 3 videos about "JavaScript Promises" from the Firebase video series which explain why it is key to return a Promise or a value in a background triggered Cloud Function.
Note also, that if you are sure you have less than 500 documents to save in the slave collection, you could use a batched write. (You could use it for more than 500 docs but then you would have to manage different batches of batched write...)

Google Cloud Datastore, how to query for more results

Straight and simple, I have the following function, using Google Cloud Datastore Node.js API:
fetchAll(query, result=[], queryCursor=null) {
this.debug(`datastoreService.fetchAll, queryCursor=${queryCursor}`);
if (queryCursor !== null) {
query.start(queryCursor);
}
return this.datastore.runQuery(query)
.then( (results) => {
result=result.concat(results[0]);
if (results[1].moreResults === _datastore.NO_MORE_RESULTS) {
return result;
} else {
this.debug(`results[1] = `, results[1]);
this.debug(`fetch next with queryCursor=${results[1].endCursor}`);
return this.fetchAll(query, result, results[1].endCursor);
}
});
}
The Datastore API object is in the variable this.datastore;
The goal of this function is to fetch all results for a given query, notwithstanding any limits on the number of items returned per single runQuery call.
I have not yet found out about any definite hard limits imposed by the Datastore API on this, and the documentation seems somewhat opaque on this point, but I only noticed that I always get
results[1] = { moreResults: 'MORE_RESULTS_AFTER_LIMIT' },
indicating that there are still more results to be fetched, and the results[1].endCursor remains stuck on constant value that is passed on again on each iteration.
So, given some simple query that I plug into this function, I just go on running the query iteratively, setting the query start cursor (by doing query.start(queryCursor);) to the endCursor obtained in the result of the previous query. And my hope is, obviously, to obtain the next bunch of results on each successive query in this iteration. But I always get the same value for results[1].endCursor. My question is: Why?
Conceptually, I cannot see a difference to this example given in the Google Documentation:
// By default, google-cloud-node will automatically paginate through all of
// the results that match a query. However, this sample implements manual
// pagination using limits and cursor tokens.
function runPageQuery (pageCursor) {
let query = datastore.createQuery('Task')
.limit(pageSize);
if (pageCursor) {
query = query.start(pageCursor);
}
return datastore.runQuery(query)
.then((results) => {
const entities = results[0];
const info = results[1];
if (info.moreResults !== Datastore.NO_MORE_RESULTS) {
// If there are more results to retrieve, the end cursor is
// automatically set on `info`. To get this value directly, access
// the `endCursor` property.
return runPageQuery(info.endCursor)
.then((results) => {
// Concatenate entities
results[0] = entities.concat(results[0]);
return results;
});
}
return [entities, info];
});
}
(except for the fact, that I don't specify a limit on the size of the query result by myself, which I have also tried, by setting it to 1000, which does not change anything.)
Why does my code run into this infinite loop, stuck on each step at the same "endCursor"? And how do I correct this?
Also, what is the hard limit on the number of results obtained per call of datastore.runQuery()? I have not found this information in the Google Datastore documentation thus far.
Thanks.
Looking at the API documentation for the Node.js client library for Datastore there is a section on that page titled "Paginating Records" that may help you. Here's a direct copy of the code snippet from the section:
var express = require('express');
var app = express();
var NUM_RESULTS_PER_PAGE = 15;
app.get('/contacts', function(req, res) {
var query = datastore.createQuery('Contacts')
.limit(NUM_RESULTS_PER_PAGE);
if (req.query.nextPageCursor) {
query.start(req.query.nextPageCursor);
}
datastore.runQuery(query, function(err, entities, info) {
if (err) {
// Error handling omitted.
return;
}
// Respond to the front end with the contacts and the cursoring token
// from the query we just ran.
var frontEndResponse = {
contacts: entities
};
// Check if more results may exist.
if (info.moreResults !== datastore.NO_MORE_RESULTS) {
frontEndResponse.nextPageCursor = info.endCursor;
}
res.render('contacts', frontEndResponse);
});
});
Maybe you can try using one of the other syntax options (instead of Promises). The runQuery method can take a callback function as an argument, and that callback's parameters include explicit references to the entities array and the info object (which has the endCursor as a property).
And there are limits and quotas imposed on calls to the Datastore API as well. Here are links to official documentation that address them in detail:
Limits
Quotas

How to cache a mongoose query in memory?

I have the following queries, which starts with the GetById method firing up, once that fires up and extracts data from another document, it saves into the race document.
I want to be able to cache the data after I save it for ten minutes. I have taken a look at cacheman library and not sure if it is the right tool for the job. what would be the best way to approach this ?
getById: function(opts,callback) {
var id = opts.action;
var raceData = { };
var self = this;
this.getService().findById(id,function(err,resp) {
if(err)
callback(null);
else {
raceData = resp;
self.getService().getPositions(id, function(err,positions) {
self.savePositions(positions,raceData,callback);
});
}
});
},
savePositions: function(positions,raceData,callback) {
var race = [];
_.each(positions,function(item) {
_.each(item.position,function(el) {
race.push(el);
});
});
raceData.positions = race;
this.getService().modelClass.update({'_id' : raceData._id },{ 'positions' : raceData.positions },callback(raceData));
}
I have recently coded and published a module called Monc. You could find the source code over here. You could find several useful methods to store, delete and retrieve data stored into the memory.
You may use it to cache Mongoose queries using simple nesting as
test.find({}).lean().cache().exec(function(err, docs) {
//docs are fetched into the cache.
});
Otherwise you may need to take a look at the core of Mongoose and override the prototype in order to provide a way to use cacheman as you original suggested.
Create a node module and force it to extend Mongoose as:
monc.hellocache(mongoose, {});
Inside your module you should extend the Mongoose.Query.prototype
exports.hellocache = module.exports.hellocache = function(mongoose, options, Aggregate) {
//require cacheman
var CachemanMemory = require('cacheman-memory');
var cache = new CachemanMemory();
var m = mongoose;
m.execAlter = function(caller, args) {
//do your stuff here
}
m.Query.prototype.exec = function(arg1, arg2) {
return m.execAlter.call(this, 'exec', arguments);
};
})
Take a look at Monc's source code as it may be a good reference on how you may extend and chain Mongoose methods
I will explain with npm redis package which stores key/value pairs in the cache server. keys are queries and redis stores only strings.
we have to make sure that keys are unique and consistent. So key value should store query and also name of the model that you are applying the query.
when you query, inside the mongoose library, there is
function Query(conditions, options, model, collection) {} //constructor function
responsible for query. inside this constructor,
Query.prototype.exec = function exec(op, callback) {}
this function is responsible executing the queries. so we have to manipulate this function and have it execute those tasks:
first check if we have any cached data related to the query
if yes respond to request right away and return
if no we need to respond to request and update our cache and then respond
const redis = require("client");
const redisUrl = "redis://127.0.0.1:6379";
const client = redis.createClient(redisUrl);
const util = require("util");
//client.get does not return promise
client.get = util.promisify(client.get);
const exec = mongoose.Query.prototype.exec;
//mongoose code is written using classical prototype inheritance for setting up objects and classes inside the library.
mongoose.Query.prototype.exec = async function() {
//crate a unique and consistent key
const key = JSON.stringify(
Object.assign({}, this.getQuery(), {
collection: this.mongooseCollection.name
})
);
//see if we have value for key in redis
const cachedValue = await redis.get(key);
//if we do return that as a mongoose model.
//the exec function expects us to return mongoose documents
if (cachedValue) {
const doc = JSON.parse(cacheValue);
return Array.isArray(doc)
? doc.map(d => new this.model(d))
: new this.model(doc);
}
const result = await exec.apply(this, arguments); //now exec function's original task.
client.set(key, JSON.stringify(result),"EX",6000);//it is saved to cache server make sure capital letters EX and time as seconds
};
if we store values as array of objects we need to make sure that each object is individullay converted to mongoose document.
this.model is a method inside the Query constructor and converts object to a mongoose document.
note that if you are storing nested values instead of client.get and client.set, use client.hset and client.hget
Now we monkey patched
Query.prototype.exec
so you do not need to export this function. wherever you have a query operation inside your code, mongoose will execute above code

Resources